First "Real" Benchmark for PostgreSQL 275
anticlimate writes "A new benchmark published on SPEC shows PostgreSQL's performance approaching that of Oracle's and surpassing or on par with MySQL (however the test-hardwares of the other DB systems are somewhat different). The test was put together by PostgreSQL's core developers working at Sun. They certainly are not unbiased, but this is the first 'real' benchmark with PostgreSQL — according to Josh Berkus's blog. The main difference compared to earlier benchmarks (and anecdotes) seems to be the tuning of PostgreSQL."
Re:on the playground... (Score:1, Informative)
like BSD.. no it's not dead yet.
Bad firehose! (Score:5, Informative)
The current version of PostgreSQL now has its first real benchmark [ittoolbox.com], a SPECjAppServer2004 [spec.org] submission [spec.org] from Sun Microsystems. The results required substantial tuning [sun.com] of many performance-related PostgreSQL parameters, some of which are set to extremely low values in the default configuration — a known issue that contributes to why many untuned PostgreSQL installations appear sluggish compared to its rivals. The speed result is close but slightly faster than an earlier Sun submission using MySQL 5 [spec.org] (with enough hardware differences to make a direct comparison of those results unfair), and comes close to keeping up with Oracle on similarly priced hardware — but with a large software savings. Having a published result on the level playing field of an industry-standard benchmark like SPECjAppServer2004, with documentation on all the tuning required to reach that performance level, should make PostgreSQL an easier sell to corporate customers who are wary of adopting open-source applications for their critical databases.
Re:on the playground... (Score:2, Informative)
Re:I do not think it means what you think it means (Score:5, Informative)
Here's a more useful one: All SPEC jAppServer2004 Results Published by SPEC [spec.org]
The benchmarks aren't standardized enough for any useful comparison. The hardware and configurations vary in almost every one.
Re:on the playground... (Score:5, Informative)
Um, no. DB2 [ibm.com] these days runs on most major UNIX variants (HP-UX, Solaris, AIX, IRIX, etc.), Linux and Windows. It's used quite often, in fact. Most Enovia/VPM installations use DB2 backends, for instance. Modern versions use XML along with regular relational database stores and are very, very up-to-date technology-wise. Very scalable.
We finally have PROOF (but not real proof) (Score:3, Informative)
MySQL 5 on AMD Opteron 285 [spec.org]
The UltraSPARC has 8 cores on 1 chip and 16GB of memory.
The Opteron has 4 cores and 8GB of memory.
The UltraSPARC should smoke it every time.
Re:on the playground... (Score:1, Informative)
Databases are even being used in many situations that would be better addressed using flat files.
Re:The best way to truly compare (Score:4, Informative)
Re:performance isn't the issue (Score:5, Informative)
Do people really get that hung up on using postgres as the initial user instead of root?!? That is the ONLY difference that I can see.
Now I can agree that maybe postgresql doesn't really "target" an audience, but that is also because they are a true open source project. They don't have a commercial version. I really don't think Linus sits around saying "Ok, we need to add this feature so that more fortune 500's will adopt linux". Or "we need to add feature xyz so this will appeal to small businesses". MySQL has a "target" audience because they are selling something. If you aren't selling anything, by definition you don't have a target.
Re:The best way to truly compare (Score:1, Informative)
Re:on the playground... (Score:1, Informative)
SQLite is a small and fast database that works from local files. Again it's often embedded and there are bindings for all major scripting hosts. You could run into concurrency issues using it under heavy load.
Flat files are simple and fast, you have the concurrency issue and need to lock the file for writes.
XML is typically flat files but sleepycat and others have XML databases. XML is mainly useful for interchange where several apps need access to the same data. Leaner formats like YAML and JSON are gaining ground for serialized data.
Then you get the full blown RDBMS which are first choice for large scale apps with multiple clients. Multi user ERP, accounts packages, busy web forums - it's the right choice for these apps. Using MySQL or Postgres to catalog your personal DVD collection or for a personal blog/forum with only a handful of users is worse than overkill - it's ridiculous.
Re:Mod parent way up! (Score:4, Informative)
Re:VACUUM FULL / VACUUM ANALYZE (Score:3, Informative)
Re:Mod parent way up! (Score:5, Informative)
You can do way better than that with PostgreSQL, at least, and I suspect with MySQL as well. I wrote a benchmark similar to yours, but a good bit more complex. I had two tables, one of which was seeded and another which was populated by the benchmark. The benchmark table had six columns (int, timestamp, 4x bigint), a primary key (int + timestamp), four check constraints (on the bigints), a foreign key constraint (int, to the seeded table), and two indexes (one int, one timestamp). I would do a commit every 75k rows, with 24 such commits per iteration and 30 passes per benchmark run, so 54 million rows total. I also used a thread pool, and there are two reasons for that. First, some amount of parallelism improves DB performance. Second, it more accurately simulated our predicted usage patterns of the database. We ran my benchmark against PSQL and IBM DB2.
The results were interesting (at least, I thought they were). First, PSQL can only handle about 10 threads doing work at once. Past about 10 threads, the DB completely falls apart. DB2, however, could handle more busy threads than Linux could, with a very gradual (and linear) degradation in performance past about 25 threads. I stopped testing at 100 threads. Second, PSQL's inserts per second (IPS) rate cut in half by the end of the bechmark. DB2 followed a similar trend until about 5 million rows, at which point IPS went up to where it started and stayed there without moving. Third, DB2 was I/O-bound, whereas PSQL was CPU-bound. I suspect it's why DB2 was able to handle an order of magnitude greater concurrency: more threads just meant the CPUs had something to do while waiting on the disks. However, it does mean that PSQL might do better with faster CPUs, whereas DB2 would not (it'd just be able to handle more threads).
And the numbers: DB2 averaged 1100 IPS, PSQL 600. Note that for the first million rows or so PSQL was faster: it just eventually dropped down to ~400 IPS after ten million rows or so, killing the average. Of course, since this table would never have fewer than 54M rows - actually, it would typically have 160M - the IPS I got at the end was the one that mattered. Also, this was on a pretty weak server, at least for this kind of workload. With more (and faster) cores, more memory, and more spindles, I'm pretty sure you could increase those numbers by 50% or more. With tuning, perhaps that much again.
Re:Mod parent way up! (Score:3, Informative)
Certainly postgres plays nice with self joins and natural joins as the query turned out faster than trying an iterative stored proc so we just access a view from the proc.
Re:PostgreSQL's Core Developers @ Sun? (Score:3, Informative)
Re:Mod parent way up! (Score:1, Informative)
It was also on a server loaded down by apache2/samba/X/mysq/psql and whatever else I've installed to play with over the past 2 years. So while not realistic as far as a production environment, I thought I'd share my experience since I thought postgres would have handled at least close to mysql. Never imagined the performance would be 5 times slower though. I'll shoot you an email for your config files though, since now I'm curious
Also, to the person who suggested I use a flat text file, I only inserted with no indexes for speed, I built the indexes again afterwards. Furthermore, I was already importing from a flat text file, and I wanted to do better than a linear search and also the DBs compress the data which is helpful in keeping disk space usage down (mysql did roughly 7 GB text to 3.6 GB MYD file).
Re:What are the tuning parameters? (Score:2, Informative)
http://www.varlena.com/GeneralBits/Tidbits/perf.h
http://www.revsys.com/writings/postgresql-perform
There is also good material in chapter 4 (Performance) of "PostgreSQL" by Douglas and Douglas (Sams Publishing Developer's Library)
Re:I do not think it means what you think it means (Score:2, Informative)
PostgreSQL setup: 8 cores, 16GB RAM [spec.org].
Oracle setup: 64 cores, 512GB RAM [spec.org].
Re:Bias (Score:3, Informative)
I don't know if the same holds true for PostgreSQL, but I would be extremely surprised since Postgre has had all of those features for years and anyone designing a replication scheme would take them into account. MySQL in contrast had replication first, and now the designers of the additional features have to take the replication scheme into account (which is much harder I'd imagine, as views, triggers and stored procs all have some semblance of a standard, and at the very least have a well defined feature set, and sometimes you can't bend a feature set to fit into an already existing replication scheme)
Also, even if you own the server, converting an existing MyISAM DB to InnoDB can be a huge chore. I've done this for at least 3 customers, and every time it is a much more painful process than you'd think. Mostly because existing data in MyISAM almost always has referential problems after any real world usage (like its been live for more than a month). Someone has deleted a customer, and now you have orders that are orphaned, someone ran a bad update that set the FK wrong in the orders table, and MySQL happily obliged, and now you've got orphaned data... I mean to me the most BASIC functionality of an Rdbms is that it enforces the relationships you give it. I don't even count MySQL w/MyISAM as a database. Might as well just use a bunch of flat comma separated files for tables.
Re:Which MySQL? (Score:4, Informative)
The point IS pick'n'mix as you put it. [ from parent ]
I think this is a huge misconception, and completely backwards. MySQL allows you to change storage engines, but the behavior at the semantic level changes. That's the antithesis of modularity: the semantic behavior should remain constant, while the performance changes. If you change both, that's not modularity, that's a new configuration that breaks client applications.
MySQL is configureware, like PHP. Everything you get is a trade. Want full text indexes (MyISAM)? You have to give up transactions (InnoDB). But the marketing material always just says "Yup, we have full text indexes (MyISAM), SQL compliance (strict mode=on), transactions (InnoDB), large number of apps (strict mode=off)". Of course many of the features are mutually exclusive. When postgresql says it has a feature, it's really there.
Just need some in-memory lookups? Memory table engine. Clustered data? NDB.
Seems like the memory engine and NDB are the same thing: http://dev.mysql.com/doc/refman/5.0/en/mysql-clus
Now compare with PostgreSQL. PostgreSQL has one "storage-engine", but there are many access methods to that storage engine. There's Btree, GiST, and GIN. On top of the access methods, there are also a lot of plans. There's sequential scan, index scan (good for lookups or some sorting needs), bitmap index scan (good for AND/OR with other bitmap index scans, and always orders the I/O in file order), hash join, hash aggregate, merge join, nested loop, group aggregate, etc.
Look at all those algorithms for accessing the single storage mechanism. It's amazing. MySQL doesn't have all those, and even if it did have the algorithms, it couldn't use them. How would MySQL know when to use a hash join and when to use a merge join? PostgreSQL collects statistics, calculates expected costs, and chooses the best plan based on the given query (called a cost-based planner). MySQL uses a rule-based planner (does it have an index? ok, let's use it then). To PostgreSQL, these two queries are different:
(1) SELECT * FROM parents WHERE number_of_children=2;
(2) SELECT * FROM parents WHERE number_of_children=20;
The stats collector would know that #1 will match many more records than #2. It will probably choose a sequential scan for #1, since it needs to read every page anyway. It will probably choose an index scan for #2. Now, if that's in a subselect, postgresql will know that #1 will return a lot of records, and maybe choose a different join algorithm than if it were #2. Again, PostgreSQL chooses these plans for you based on cost -- no special configuration required.
If you want modular, it's being able to replace a full text GiST index with a full text GIN index and the application never knows the difference except performance. And by the way, the DDL in postgresql is transactional, so you don't even have to restart the application even if you do some major restructuring of the table (like replacing a table with an updatable view).
Re:on the playground... (Score:2, Informative)
Mod parent(s) up, up, up (Score:1, Informative)
PS: Oracle is not Fortran - PL/SQL is derived from Ada and therefore shares much of the Pascal/Modula feel.