Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Databases Programming Software IT

Is the One-Size-Fits-All Database Dead? 208

jlbrown writes "In a new benchmarking paper, MIT professor Mike Stonebraker and colleagues demonstrate that specialized databases can have dramatic performance advantages over traditional databases (PDF) in four areas: text processing, data warehousing, stream processing, and scientific and intelligence applications. The advantage can be a factor of 10 or higher. The paper includes some interesting 'apples to apples' performance comparisons between commercial implementations of specialized architectures and relational databases in two areas: data warehousing and stream processing." From the paper: "A single code line will succeed whenever the intended customer base is reasonably uniform in their feature and query requirements. One can easily argue this uniformity for business data processing. However, in the last quarter century, a collection of new markets with new requirements has arisen. In addition, the relentless advance of technology has a tendency to change the optimization tactics from time to time."
This discussion has been archived. No new comments can be posted.

Is the One-Size-Fits-All Database Dead?

Comments Filter:
  • Prediction... (Score:5, Insightful)

    by Ingolfke ( 515826 ) on Tuesday January 09, 2007 @11:00PM (#17534140) Journal
    1) More and more specialized databases will begin cropping up.
    2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.
    3) Someone will write a paper about how we've gone from specialized to monolithic...
    4) Something else will trigger specialization... (repeat)

    Dvorak if you steal this one from me I'm going to stop reading your writing... oh wait.
  • one size fits 90% (Score:5, Insightful)

    by JanneM ( 7445 ) on Tuesday January 09, 2007 @11:03PM (#17534158) Homepage
    It's natural to look at the edges of any feature or performance envelope. People that want to store petabytes of particle accellerator data, do complex queries to serve a million webpages a second, have hundreds of thousands of employees doing concurrent things to the backend.

    But for most uses of databases - or any back-end processing - performance just isn't a factor and haven't been for years. Enron may have needed a huge data warehouse system; "Icepick Johhny's Bail Bonds and Securities Management" does not. Amazon needs the cutting edge in customer management; "Betty's Healing Crystals Online Shop (Now With 30% More Karma!)" not so much.

    For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.
  • Imagine that.... (Score:5, Insightful)

    by NerveGas ( 168686 ) on Tuesday January 09, 2007 @11:09PM (#17534210)
    ... a database mechanism particularly written for the task at hand will beat a generic one. Who would have thought?

    steve

    (+1 Sarcastic)
  • Dammit (Score:5, Insightful)

    by AKAImBatman ( 238306 ) * <akaimbatman@gmaYEATSil.com minus poet> on Tuesday January 09, 2007 @11:15PM (#17534238) Homepage Journal
    I was just thinking about writing an article on the same issue.

    The problem I've noticed is that too many applications are becoming specialized in ways that are not handled well by traditional databases. The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system. Yet we keep trying to cram them into SQL databases, then get surprised when we're hit with performance problems and security issues. It's simply the wrong way to go about solving the problem.

    As anyone with a compsci degree or equivalent experience can tell you, creating a custom database is not that hard. In the past it made sense to go with off-the-shelf databases because they were more flexible and robust. But now that modern technology is causing us to fight with the databases just to get the job done, the time saved from generic databases is starting to look like a wash. We might as well go back to custom databases (or database platforms like BerkeleyDB) for these specialized needs.
  • Duh (Score:5, Insightful)

    by Reality Master 101 ( 179095 ) <<moc.liamg> <ta> <101retsaMytilaeR>> on Tuesday January 09, 2007 @11:18PM (#17534258) Homepage Journal

    Who thinks that a specialized application (or algorithm) won't beat a generalized one in just about every case?

    The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.

    See also: high level compiled languages versus assembly language*.

    (*and no, please don't quote the "magic compiler" myth... "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.)

  • Re:Duh (Score:2, Insightful)

    by kfg ( 145172 ) on Wednesday January 10, 2007 @12:00AM (#17534606)
    The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly. . .

    . . .has some level of definable and gauranteed data integrity.

    KFG
  • by suv4x4 ( 956391 ) on Wednesday January 10, 2007 @12:06AM (#17534660)
    We're all sick with "new fad: X is dead?" articles. Please reduce lameness to an acceptable level!
    Can't we get used to the fact that specialized & new solutions don't magically kill existing popular solution to a problem?

    And it's not a recent phenomenon, either, I bet it goes back to when the first proto-journalistic phenomenons formed in early uhman societies, and haunts us to this very day...

    "Letters! Spoken speech dead?"

    "Bicycles! Walking on foot dead?"

    "Trains! Bicycles dead?"

    "Cars! Trains dead?"

    "Aeroplanes! Trains maybe dead again this time?"

    "Computers! Brains dead?"

    "Monitors! Printing dead yet?"

    "Databases! File systems dead?"

    "Specialized databases! Generic databases dead?"

    In a nutshell. Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems (as noted above), so specialized systems will thrive just as well without killing anything.
  • Re:Duh (Score:5, Insightful)

    by wcbarksdale ( 621327 ) on Wednesday January 10, 2007 @12:07AM (#17534670)
    Also, to successfully hand-optimize you need to remember a lot of details about instruction pipelines, caches, and so on, which is fairly detrimental to remembering what your program is supposed to do.
  • Re:Duh (Score:2, Insightful)

    by mparker762 ( 315146 ) on Wednesday January 10, 2007 @12:20AM (#17534776) Homepage
    Only someone who hasn't recently replaced some critical C code with assembler and gotten substantial improvement would say that. This was MSVC 2003 which isn't the smartest C compiler out there, but not a bad one for the architecture. Still, a few hours with the assembler and a few more hours doing some timings to help fine-tune things improved the CPU performance of this particular service by about 8%.

    Humans have been writing optimized assembler for decades, the compilers are still trying to catch up. Modern hand-written assembler isn't necessarily any trickier or more clever than the old stuff (it's actually a bit simpler). Yes compilers are using complicated and advanced techniques, but it's still all an attempt to approximate what humans do easily and intuitively. Artificial intelligence programs use complicated and advanced techniques too, but no one would claim that this suddenly makes philosophy any harder.

    Your second point about the sophistication of the CPU's is true but orthogonal to the original claim. These sophisticated CPU's don't know who wrote the machine code, they do parallel execution and branch prediction and so forth on hand-optimized assembly just like they do on compiler-generated code. Which is one reason (along with extra registers and less segment BS) that it's easier to write and maintain assembler nowadays, even well-optimized assembler.

  • Re:Dammit (Score:3, Insightful)

    by AKAImBatman ( 238306 ) * <akaimbatman@gmaYEATSil.com minus poet> on Wednesday January 10, 2007 @12:30AM (#17534868) Homepage Journal
    But is this actually happening? Has slashdot had to abandon general-purpose RDBMS?

    I wasn't referring to Slashdot in particular, but rather general web forum software. Your PhpBB, vBulletins, and JForums of the world are more along the lines of what I'm referring to. After dealing with the frustrations of setting up, managing, and hacking projects like these, I've come to the conclusion that the backend datastore is the problem. The relational theories still hold true, but the SQL database implementations simply aren't built with CLOBs and BLOBs in mind.

    That being said, Slashdot is a fairly good example of how they've worked around the limitations of their backend database at a cost equalling or far exceeding the cost of building a customized data store. A costly venture that bit them in the rear [slashdot.org] when they reached their maximum post count.

    Not that I'm criticizing Slashcode. Hindsight is 20/20. It's just becoming more and more apparent that for some applications the cost of using an off-the-shelf database has become greater than the cost of building a custom datastore.
  • Re:Dammit (Score:2, Insightful)

    by Tablizer ( 95088 ) on Wednesday January 10, 2007 @12:46AM (#17534984) Journal
    The relational theories still hold true, but the SQL database implementations simply aren't built with CLOBs and BLOBs in mind.

    That is very true. They haven't seemed to have perfected the performance handling of highly variable "cells".

    That being said, Slashdot is a fairly good example of how they've worked around the limitations of their backend database at a cost equalling or far exceeding the cost of building a customized data store. A costly venture that bit them in the rear

    Last night we crossed over 16,777,216 comments in the database. The wise amongst you might note that this number is 2^24, or in MySQLese an unsigned mediumint. Unfortunately, like 5 years ago we changed our primary keys in the comment table to unsigned int (32 bits, or 4.1 billion) but neglected to change the index that handles parents.


    It would be nice if more RDBMS offered flexible integers such that you didn't have to pick a size up front. Fixed sizes (small-int,int,long) are from the era where variable-sized column calculations were too expensive CPU-wise. Since then CPU is cheap compared to "pipeline" issues such that variable columns are just as efficient as fixed ones, but only take the space they need.

    But it would not have been hard for slashdot to use a big integer up-front. They chose to be stingy and made a gamble, it was not forced on them. It may have cost a few cents more early, but would have prevented that disaster. Plus, bleep happens no matter what technology you use. I am sure dedicated-purpose databases have their own gotcha's and trade-off decision points. Being dedicated probably means they are less road-tested also.
         
  • Re:Duh (Score:2, Insightful)

    by suv4x4 ( 956391 ) on Wednesday January 10, 2007 @02:20AM (#17535618)
    This was MSVC 2003 which isn't the smartest C compiler out there, but not a bad one for the architecture. Still, a few hours with the assembler and a few more hours doing some timings to help fine-tune things improved the CPU performance of this particular service by about 8%... These sophisticated CPU's don't know who wrote the machine code, they do parallel execution and branch prediction and so forth on hand-optimized assembly just like they do on compiler-generated code. Which is one reason (along with extra registers and less segment BS) that it's easier to write and maintain assembler nowadays, even well-optimized assembler.

    Do you know which types of commands when ordered in quadruples will execute at once on a Core Duo? Incidentally those that won't on a Pentium 4.

    I hope you're happy with your 8% improvement, enjoy it until your next CPU upgrade that requires different approach to assembly optimization.

    The advantage of a compiler is that compiling for a target CPU is a matter of a compiler switch, so compiler programmers can concentrate on performance and smart use of the CPU specifics, and you can concentrate on your program features.

    If you were that concerned about performance in first place, you'd use a compiler provided by the processor vendor (Intel I presume) and use the intel libraries for processor specific implementations of common math and algorithm issues needed in applications.

    Most likely this would've given you more than 8% boost and still keep your code somewhat less bound to a specific CPU, than with assembler.

    An example of "optimization surprise" i like, is the removal of the barrel shifter in Pentium 4 CPU-s. You see, lots of programmers know that it's faster (on most platforms) to bit shift, and not multiply by 2, 4, 8, etc (or divide).

    But bit shifting on P4 is handled by the ALU, and is slightly slower than multiplication (why, I don't know, but it's a fact). Code "optimized" for bit shifting would be "antioptimized" on P4 processors.

    I know some people adapted their performance critical code to meet this new challenge. But then what? P4 is obsolete and instead we're back to the P3 derived architecture, and the barrel shifter is back!

    When I code a huge and complex system, I'd rather buy a 8% faster machine and use a better compiler than have to manage this hell each time a CPU comes out.
  • Re:Duh (Score:4, Insightful)

    by try_anything ( 880404 ) on Wednesday January 10, 2007 @02:59AM (#17535854)
    Modenr compilers apply optimizations on a so sophisticated level that would be a nightmare for a human to support such a solution optimized.

    There are three quite simple things that humans can do that aren't commonly available in compilers.

    First, a human gets to start with the compiler output and work from there :-) He can even compare the output of several compilers.

    Second, a human can experiment and discover things accidentally. I recently compiled some trivial for loops to demonstrate that array bounds checking doesn't have a catastrophic effect on performance. With the optimizer cranked up, the loop containing a bounds check was faster than the loop with the bounds check removed. That did not inspire confidence.

    Third, a human can concentrate his effort for hours or days on a single section of code that profiling revealed to be critical and test it using real data. Now, I know JIT compilers and some specialized compilers can do this stuff, but as far as I know I can't tell gcc, "Compile this object file, and make the foo function as fast as possible. Here's some data to test it with. Let me know on Friday how far you got, and don't throw away your notes, because we might need further improvements."

    I hope I'm wrong about my third point (please please please) so feel free to post links proving me wrong. You'll make me dance for joy, because I do NOT have time to write assembly, but I have a nice fast machine here that is usually idle overnight.

  • Re:Death to Trees! (Score:3, Insightful)

    by suv4x4 ( 956391 ) on Wednesday January 10, 2007 @03:02AM (#17535874)
    Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both.

    You know, I've seen my share of RDBMS designs to know the "messiness" is not the fault of the file systems (or databases in that regard).

    Sets have more issues than you describe, and you know very well Vista had lots of set based features that were later downscaled, hidden and reduced, not because WinFS was dropped (because the sets in Vista don't use WinFS, they work with indexing too), but because it was terribly confusing to the users.
  • by sqlgeek ( 168433 ) on Wednesday January 10, 2007 @03:12AM (#17535930)
    I don't think that you know Oracle very well. Lets say you want so scale and so you want clustering or grid functionality -- built into Oracle. Lets say that you want to partition your enormous table into one physical table per month or quarter -- built in. Oh, and if you query the whole giant table you'd like parallel processes to run against each partition, balanced across your cluster or grid -- yeah, that's built in too. Lets say you almost always get a group of data together rather than piece by piece so you want it physically colocated to reduce disk i/o -- built in.

    This is why you pay a good wage for your Oracle data architect & DBA -- so that you can get people who know how to do these sort of things when needed. And honestly I'm not even scratching the surface.

    Consider a data warehouse for a giant telecom in South Africa (with a DBA named Billy in case you wondered). You have over a billion rows in your main fact table, but you're only interested in a few thousand of those rows. You have an index on dates and another index on geographic region and another region on customer. Any one of those indexes will reduce the 1.1 billion rows to 10's of millions of rows, but all three restrictions will reduce it to a few thousand. What if you could read three indexes, perform bitmap comparisons on the results to get only the rows that match the results of all three indexes and then only fetch those few thousand rows from the 1.1 billion row table. Yup, that's built in and Oracle does it for you for behind the scenes.

    Now yeah, you can build a faster single-purpose db. But you better have a god damn'd lot of dev hours allocated to the task. My bet is that you'll probably come our way ahead in cash & time to market with Oracle, a good data architect and a good DBA. Any time you want to put your money on the line, you let me know.
  • by TubeSteak ( 669689 ) on Wednesday January 10, 2007 @03:23AM (#17535984) Journal
    For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.
    I'm willing to concede that...
    But IMO it is not 100% relevant.

    Large corporate customers usually have a large effect on what features show up in the next version of [software]. Software companies put a lot of time & effort into pleasing their large accounts.

    And since performance isn't a factor for the majority of users, they won't really be affected by any performance losses resulting from increased specialization/optimizations. Right?
  • by mysticgoat ( 582871 ) on Wednesday January 10, 2007 @04:15AM (#17536252) Homepage Journal

    As any English teacher will tell you, any language that will support great poetry and prose will also make it possible to write the most gawdawful cr*p. Perl bestows great powers, but the perl user must temper his cleverness with wisdom if he is to truly master his craft.

    However in this specific case Google reveals that

    ## The Y Combinator
    sub Y (&) {
    my $le=shift;
    return &{
    sub {
    &{sub { my $f=shift; &$f($f) } }
    (sub { my $f=shift; &$le(sub { &{&$f($f)}(@_) }) });
    }
    }
    }
    was simply "borrowed" from y-combinator.pl [synthcode.com]. This is an instance of Perl being used in a self-referential manner to add a new capability (the Y combinator allows recursion of anonymous subroutines (why anyone would bother to do such an arcane thing comes back to the English teacher's remarks)). Self-referential statements are always difficult to understand because, well, they just are that way (including this one).
  • Re:Duh (Score:3, Insightful)

    by RAMMS+EIN ( 578166 ) on Wednesday January 10, 2007 @07:07AM (#17537176) Homepage Journal
    Also, the compiler may know more CPUs than you do. For example, do you know the pairing rules for instructions on an original Pentium? The differences one must pay attention to when optimizing for an Thoroughbred Athlon vs. a Prescott P4 vs. a Yonah Pentium-M vs. a VIA Nehemiah? GCC does a pretty good job of generating optimized assembly code for each of these from the same C source code. If you were to do the same in assembly, you would have to write separate code for each CPU, and know the subtle differences as well as the compiler does.
  • Re:Perl & CSV (Score:3, Insightful)

    by shotgunefx ( 239460 ) on Wednesday January 10, 2007 @10:24AM (#17538836) Journal
    Outside of "Golfing", I'd strongly disagree. I don't think the community encourages it for the most.

    This is from someone who's spent the last seven years with Perl and in the community. YMMV

  • Re:Prediction... (Score:3, Insightful)

    by Pseudonym ( 62607 ) on Wednesday January 10, 2007 @07:17PM (#17548346)

    Z39.50 is actually much, much more than mere "text searching". If you think hard about the way that you interact with a library catalogue or Google compared with how you interact with a RDBMS, you'll realise there are quite a few more differences than just "text searching".

    Think about highly heterogeneous data. Libraries, for example, might index books, periodicals, audio-visual items and online resources such as journals. Google indexes web pages, Usenet news articles, PDF documents and so on. And you can search them all by "title".

    Think about "result sets" instead of sequences of tuples. When you search google, or a library catalogue, what you get is a bunch of summary information which you page through, then eventually retrieve the record that you want. Or you might refine your query by adding new search terms or sorting your results by some key. The key data structure here is the "result set": a sequence of record numbers. Everything happens to result sets. You sort your results by state, or intersect the set with another query. The whole process is record-oriented. SQL, on the other hand, is data-oriented: the central data structure is a sequence of tuples, and tuples contain real data.

    I hear you objecting that there are ways to do this in SQL, and you'd be right. But in this kind of application, it's always going to be at the expense of a lot more time (more processing grunt required, or less opportunity to exploit disk locality) or much more disk space, if only because of the extra indirection required. If you have terabytes of information, this bites, and bites hard. You wouldn't use Google or your library catalogue if it were ten times slower.

    SQL is optimised for the case where data is "right there". Z39.50 is optimised for the case where accessing real data is expensive, because it might involve parsing XML or PDF. People complain about how supposedly inefficient XML data is, but the fact is, there's no better way to do text with structure. The real problems are a) people use XML for things that aren't structured text, and b) relational databases can't handle it with reasonable efficiency at the moment.

    Yes, I know, SQL will eventually be able to handle things like this. But it's not there yet.

An Ada exception is when a routine gets in trouble and says 'Beam me up, Scotty'.

Working...