PostgreSQL Getting Parallel Query 83
New submitter iamvego writes: A major feature PostgreSQL users have requested for some time now is to have the query planner "parallelize" a query. Now, thanks to Robert Haas and Amit Kapila, this has now materialized in the 9.6 branch. Robert Haas writes in his blog entry that so far it only supports splitting up a sequential scan between multiple workers, but should hopefully be extended to work with multiple partitions before the final release, and much more beside in future releases.
Re: (Score:1)
Re: (Score:3, Funny)
You mean 500 johns or 500 joins? Sounds painful either way.
Re: (Score:2)
particularly in parallel.
Re: (Score:1)
Maybe what they meant is that the bottleneck should be disk, RAM, bus, and/or network I/O for the vast majority of sub-queries if "done right". In other words, a single core should be able to computationally sift rows fast enough to keep up with row scans from multiple "sources" for up to say 8 sources, which may be the max practical number of sources.
However, I imagine that a complex or "mathy" equation in a query could make it be CPU intensive. But maybe that's relatively rare.
Re: (Score:2)
However, I imagine that a complex or "mathy" equation in a query could make it be CPU intensive. But maybe that's relatively rare.
Good point.
Re: (Score:2)
Not a dual core Pentium D or Athlon X2 from 2005?
Re:How does this compare with other databases? (Score:4, Informative)
Re: (Score:1)
The parent did not reference "sharding". He referenced shard-query, which is a mechanism that can parallelize a single query using relational algebra.
https://www.percona.com/blog/2011/05/14/distributed-set-processing-with-shard-query/
Re: (Score:3)
Re: (Score:1)
I've used shard-query -- an early version of it. See the link I posted. It uses relational algebra to break a single query up into discrete parallel queries and execute them in parallel then combine the result set, on either a single server or multiple servers. It is able to do this outside the DB, and it could probably be ported to pgsql as well.
Re: (Score:2, Informative)
It's worth pointing out that while this is a useful feature for some applications, it won't make any difference whatsoever for many more. PostgreSQL has been multi-threaded for executing concurrent queries for a very long time, so if you're running a database for many users, you probably won't notice any difference with this functionality. If you have some particular query that takes a heck of a long time like MRP runs, then this will matter to you.
Re: (Score:2)
Re: (Score:2, Insightful)
Oracle has had this capability for at least a decade.
On the other hand it is often worth it to rewrite your application entirely or ditch the customer that needs that particular function to not have to deal with Oracle.
Sounds like Terradata (Score:2)
Re: (Score:2)
> Is this similar to MySQL / MariaDB's shard query? Do the commercial databases have this?
Since the beginning of time by comparison...
This improvement a plus (Score:2)
Re: (Score:2)
Patent trolls in 3 2 1 (Score:1)
I believe Oracle owns using more than 1 cpu in a query if I am correct so their own rdbms looks faster. Maybe it is now being enabled if it expires or maybe not.
I don't want to be sued using it
Re: (Score:2)
I believe Oracle owns using more than 1 cpu in a query if I am correct...
Well, Sybase had it well over 20 years ago...
Re: Patent trolls in 3 2 1 (Score:2)
So it was baked into SQL Server since the beginning
Re: (Score:2)
So it was baked into SQL Server since the beginning
I believe so, but that was a long time ago. Could be my timeline is off and it was added after the MS purchase.
New features every few months (Score:5, Informative)
Just a few months back, lwn.net had a longish story on PostgreSQL. They were scoring a victory with the "UPSERT" command addition in 9.5, which with speed updates old records, OR inserts a new one, if none. A big feature on your commercial databases. Apparently, PostgreSQL's biggest worry lately is that it has so many developers adding cool new features that there's some resource lacks maintaining and cleaning the base code. (Possibly unfair oversimplification of lwn.net story.)
I discovered PostgreSQL to get a free geodatabase for mapping, with the PostGIS plug-in...the open plug-in architecture being one of the greatest things about a FLOSS database. After nearly 25 years with Oracle and thinking everything else was a toy by comparison, PG blew me away. Amazing features, high performance, reliable. It's an amazing project, and this news is both impressive and unsurprising.
Re: (Score:2)
This one on the other hand seems like baby steps:
One rather enormous limitation of the current feature is that we only generate Gather nodes immediately on top of Parallel Seq Scan nodes. This means that this feature doesn't currently work for inheritance hierarchies (which are used to implement partitioned tables) because there would be an Append node in between. Nor is it possible to push a join down into the workers at present. (...) With things as they are, about the only case that benefits from this feature is a sequential scan of a table that cannot be index-accelerated but can be made faster by having multiple workers test the filter condition in parallel.
No partitioning, no joins, right now the only thing you can speed up is a simple table scan where you can't/won't use an index. This is more "proof of concept" parallelism than a useful feature right now. I guess in a release or two this will be a big thing.
Re: (Score:3)
The biggest issue with parallelism is that a lot of stuff can't be parallelized in a way that makes sense. The way it is done (dispatching and gathering nodes) only makes sense if the query takes a really long time, otherwise there is a lot of overhead that destroys any type of speedup and could actually make everything else slower. Typically multi-threading in databases is done to speed up multiple independent queries, not a single query.
Re: (Score:2)
The biggest issue with parallelism is that a lot of stuff can't be parallelized in a way that makes sense. The way it is done (dispatching and gathering nodes) only makes sense if the query takes a really long time, otherwise there is a lot of overhead that destroys any type of speedup and could actually make everything else slower. Typically multi-threading in databases is done to speed up multiple independent queries, not a single query.
Define "really long time", in the example he's running a query that takes less than a second and gets it down to <250ms. Sure, it's not useful for transaction processing but I got many queries running on millions of rows where anything from a minute to an hour is more like it and milliseconds are peanuts. I would think most people have at least some reports that would benefit.
Re:New features every few months (Score:5, Insightful)
Re: (Score:1)
It's like the FreeBSD of RDBMS!
Re: (Score:1)
Having worked with some of the core committers, and attended conferences (and dinner) with others, I'm not the least bit surprised. I consider myself a pretty competent developer, and every single one of them made me feel like a total rookie. Also, none of them had the chest thumping ego that you see from the heads of a lot of other opensource projects. It was refreshing.
Re: (Score:1)
I miss pgsql (Score:2)
I only ever got to develop a single project on pgsql and I regret that. This was back in 2001. MySQL was pretty immature at the time but had the enormous install base. I went with PostgreSQL because it was more mature. It never let me down. The deployment went fine, it ran great, customer used it on and off for about 6 years and then it was just no longer needed.
Fast forward to 2011, ten years later, and now I'm running the show and developing a point of sale for the family business I'm in and I run wit
Re: (Score:2, Interesting)
That and the master-master replication suite from Percona.
I think that is a good reason to pick MySQL.
As much as I like Postgres..... it seems to be a heck of a lot easier to do replication with MySQL and put together a highly-survivable system.
I'm not even sure how to even start to go about doing it with Postgres.... although in the past; I have had a Cold/Warm standby Postgres with Slony-L based replication; It was quite frankly, a PITA.
Re: (Score:2)
Re: (Score:2)
The native solution (streaming replication with many options) has been production stable since 9.1 and only increased in features, speed, and reliability.
Re: (Score:2)
Pg replication systems are dime a dozen, pick one, and implement it.
Re: (Score:2, Insightful)
... and the replication systems are typically not worth much more than a dime, sadly.
We have a pretty beefy set up; 4x 16 Core Xeon DB servers with 128 GB of RAM each and Enterprise SSDs, serving hundreds of instances of like-schema databases, one per (organizational) customer, serving an aggregate peak of about 1,000 queries/second in a mixed read/write load.
And we've never been able to get replication to work reliably, ever. In every case we've ever tried, we've seen a net reduction in reliability. Every
Re: (Score:1)
Even with streaming replication in 9.3 onward? And tuning it? This a single instance with multiple schemas? Or separate instances?
Re: (Score:2)
You're doing something wrong, then. Streaming replication works fine for us on a fairly similarly-sized setup.
Re:I miss pgsql (Score:5, Informative)
MySQL/MariaDB are still toys in comparison to PostgreSQL.
Postgres has recursive CTEs, regex replacement, native JSON support (as a record type and trivially convert every query return type), and even base64 decoding and xpath parsing.
MySQL has had some nice features for years, like REPLACE, but since the 9.x branch, CTEs can do that and more. And now PG has UPSERT for simplicity. Replication has always been great with MySQL, but PG's replication is now easy to administer. I've relied on Postgres' streaming replication since 9.1 in production and it's been great for years.
Re: (Score:2, Informative)
MySQL has had some nice features for years, like REPLACE, but since the 9.x branch
The features MySQL has are good enough for 99% of real-world web applications.
Yes, Postgres has more, but the extra features it has don't necessarily add much value for most programs.
MySQL multi-master replication features are immensely valuable by comparison, and Postgres lacking them has prevented me from using Postgres, more than once.
Re: (Score:2)
> The features MySQL has are good enough for 99% of real-world web applications.
It doesn't even have complete SQL support.
While that might not matter for a trivial web application. For the more interesting ones that include non-trivial development teams, that will cause problems.
Re: (Score:1)
http://bdr-project.org/docs/st... [bdr-project.org]
Re: (Score:1)
Yes, Postgres has more, but the extra features it has don't necessarily add much value for most programs.
MySQL multi-master replication features are immensely valuable
I guess it depends on your programs. It does take effort to maintain two or more connections (one for write, the others for reads).
If your use case needs balanced multi-master replication, but simple features, you should use a NoSQL solution.
Postgres is amazing for reporting where you can bring anything you want in a single query, including JSON output (without using plsql procedures), and bulk updates are fantastic, with CTEs for selects and layers of them for updates/deletes, especially great with regex_r
Re: (Score:2)
> I will eventually migrate off MySQL but I don't know if it'll be MariaDB
So from MySQL to still MySQL?
Re: (Score:2)
How many grams of that righteous mineral rhodium do you charge?
Re: (Score:1)
Interesting, it appears Greenplum has recently been open sourced [infoworld.com].
Fuck Yeah. PostgreSQL FTW (Score:1)
I've been using Postgres for well over a decade now, and I still love it. Yes, you have to tune it, like any powerful tool.
Granted this first pass is only for sequential scans, but those are the simplest to parallelize and generally the slowest. Some queries rely on table scans as not every column can be indexed.
Postgres' growing feature set is amazing. Thanks team!
mongoDB (Score:1)
It will be interesting to see how it compares with MongoDB. MongoDB does not use joins. MongoDB is webscale. [youtube.com]
I hope this doesn't compromise overall performance (Score:2)
Re: (Score:2)
Re: (Score:2)
I wouldn't call it a "niche" exactly. It's one of the major main use cases for employing an RDBMS.
We just have a lot of "database people" with very limited experience and a limited mindset.
Re: (Score:1)
A more typical usage is where there are many smaller queries.
Typical usage of what? If I have an OLTP system, for a transactional web based system, sure i'd agree. But if I am operating a data warehouse with fact tables housing hundreds of millions of rows, or trying to run largish reports on top of my OLTP system (say for state/fed reporting, or financial reporting), my "typical usage" is not many smaller queries.
The last two projects I have been on, turning on auto-parrallism in Oracle has made huge performance gains. Not just for reads, but also when enabled fo
Working on similar feature (Score:1)