Is the One-Size-Fits-All Database Dead? 208

Posted by kdawson on Tuesday January 09, 2007 @10:50PM from the specialized-and-optimized dept.

jlbrown writes "In a new benchmarking paper, MIT professor Mike Stonebraker and colleagues demonstrate that specialized databases can have dramatic performance advantages over traditional databases (PDF) in four areas: text processing, data warehousing, stream processing, and scientific and intelligence applications. The advantage can be a factor of 10 or higher. The paper includes some interesting 'apples to apples' performance comparisons between commercial implementations of specialized architectures and relational databases in two areas: data warehousing and stream processing." From the paper: "A single code line will succeed whenever the intended customer base is reasonably uniform in their feature and query requirements. One can easily argue this uniformity for business data processing. However, in the last quarter century, a collection of new markets with new requirements has arisen. In addition, the relentless advance of technology has a tendency to change the optimization tactics from time to time."

Is the One-Size-Fits-All Database Dead?

This discussion has been archived. No new comments can be posted.

Search 208 Comments Log In/Create an Account

Comments Filter:

Noticed how roll your own is faster? (Score:2, Interesting)

by BillGatesLoveChild ( 1046184 ) writes: on Tuesday January 09, 2007 @10:58PM (#17534126) Journal

Have you noticed when you code your own routines for manipulating data (in effect, your own application specific database) you can produce stuff that is very, very fast? In the good old days of the Internet Bubble 1.0 I took an application specific database like this (originally for a record store) and generalized it into a generic database capable of handling all sorts of data. But every change I made to make the code more general also made it less efficient. The end result wasn't bad by any means: we solid it as an eCommerce database to a number of solutions, but as far as the original record store database went, the original version was by far the best. Yes. I *know* generic databases with fantastic optimization engines designed by database experts should be faster, but noticed how much time you have to spend with the likes of Oracle or MySQL trying to get it to do what to you is an exceedingly obvious way of doing something?

Re:Noticed how roll your own is faster? (Score:5, Interesting)

by smilindog2000 ( 907665 ) writes: <bill@billrocks.org> on Tuesday January 09, 2007 @11:16PM (#17534244) Homepage

I write all my databases with the fairly generic DataDraw database generator. The resulting C code is faster that if you wrote it manually using pointers to C structures (really). http:datadraw.sourceforge.net [sourceforge.net]. Its generic, and faster than anything EVER.

Re:Prediction... (Score:4, Interesting)

by Tablizer ( 95088 ) writes: on Tuesday January 09, 2007 @11:47PM (#17534506) Journal

2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.

I agree with this prediction. Database interfaces (such as SQL) do not dictate implimentation. Ideally, query languages only ask for what you want, not tell the computer how to do it. As long as it returns the expected results, it does not matter if the database engine uses pointers, hashes, or gerbiles to get the answer. It may however require "hints" in the schema about what to optimize. Of course, you will sacrifice general-purpose performance to speed up a specific usage pattern. But at least they will give you the option.

It is somewhat similar to what "clustered indexes" do in some RDBMS. Clusters improve the indexing by a chosen key at the expense of other keys or certain write patterns by physically grouping the data by that *one* chosen index/key order. The other keys still work, just not as fast.

This has been known for years already (Score:3, Interesting)

by TVmisGuided ( 151197 ) writes: <(moc.liamg) (ta) (pmuj.nala)> on Tuesday January 09, 2007 @11:56PM (#17534566) Homepage

Sheesh...and it took someone from MIT to point this out? Look at a prime example of a high-end, heavily-scaled, specialized database: American Airlines' SABRE. The reservations and ticket-sales database system alone is arguably one of the most complex databases ever devised, is constantly (and I do mean constantly) being updated, is routinely accessed by hundreds of thousands of separate clients a day...and in its purest form, is completely command-line driven. (Ever see a command line for SABRE? People just THINK the APL symbol set looked arcane!) And yet this one system is expected to maintain carrier-grade uptime or better, and respond to any command or request within eight seconds of input. I've seen desktop (read: non-networked) Oracle databases that couldn't accomplish that!

Re:Duh (Score:3, Interesting)

by suv4x4 ( 956391 ) writes: on Tuesday January 09, 2007 @11:57PM (#17534580)

"modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.

Only people who haven't seen recent advancements in CPU design and compiler architecture will say what you just said.

Modenr compilers apply optimizations on a so sophisticated level that would be a nightmare for a human to support such a solution optimized.

As an example, modern Intel processors can process certain "simple" commands in parallel and other commands are broken apart into simpler commands, processed serially. I'm simplifying the explanation a great deal, but anyone who read about how a modern CPU works, branch prediction algorithms and so on is familiar with the concept.

Of course "they can beat human written assembly code in just about every case" is an overstatement, but still, you gotta know there's some sound logic & real reasons behind this "myth".

Re:Perl & CSV (Score:4, Interesting)

by nuzak ( 959558 ) writes: on Wednesday January 10, 2007 @12:39AM (#17534930) Journal

> It was my understanding that after you've got associative arrays you can get to any other conceivable data structure

Once you have lambda you can get to any conceivable data structure. The question is, do you really want to?

sub Y (&) { my $le=shift; return &{sub {&{sub {my $f=shift; &$f($f)}}(sub {my $f=shift; &$le(sub {&{&$f($f)}(@_)})});}}}

Re:Prediction... (Score:3, Interesting)

by Pseudonym ( 62607 ) writes: on Wednesday January 10, 2007 @01:56AM (#17535468)

Interfaces like SQL don't dictate the implementation, but they do dictate the model. Sometimes, the model that you want is so far from the interface language, that you need to either extend or replace the interface language for the problem to be tractable.

SQL's approach has been to evolve. It isn't quite "there" for a lot of modern applications. I can forsee a day when SQL can efficiently model all the capabilities of, say, Z39.50, but we're not there now.

Death to Trees! (Score:3, Interesting)

by Tablizer ( 95088 ) writes: on Wednesday January 10, 2007 @02:50AM (#17535770) Journal

Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems

Very specialized? Please explain. Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both. They were okay in 1984 when you only had a few hundred files. But they don't scale. Category philosophers have known since before computers that hierarchy taxonomies were limited.

The problem is that the best alternative, set-based file systems, have a longer learning curve than trees. People pick up hierarchies pretty fast, but sets take longer to click. Power does not always come easy. I hope that geeks start using set-oriented file systems and then others catch up. The thing is that set-oriented file systems are enough like relational that one might as well use relational. If only the RDBMS were performance-tuned for file-like uses (with some special interfaces added).

Re:Perl & CSV (Score:5, Interesting)

by patio11 ( 857072 ) writes: on Wednesday January 10, 2007 @04:08AM (#17536216)

I think it implements a Y combinator. Then again, it could just print out "Just another perl hacker". But I'm guessing on the Y combinator. Lets break it down so its readable:

sub Y (&) {
my $le=shift;
return &{
sub { ## SUB_A
&{
sub { ## SUB_B
my $f=shift;
&$f($f)
}
} ##Close SUB_A's block
(sub { ## SUB_C
my $f=shift;
&$le(sub { ##SUB_D
&{
&$f($f)
}
(@_)
}## END SUB_D
)} ##END SUB_C
); ##End the block enclosing SUB_C
} ## END SUB_A
} ## Close the return line
} ##Close sub Y

Y can have any number of parameters you want (this is sort of a "welcome to Perl, n00b, hope you enjoy your stay" bit of pain). The first line of the program assigns le to the first parameter and pops that one off the list. That & used in the next line passes the rest of the list to the function he's about to declare. So we're going to be returning the output of that function evaluated on the remaining argument list. Clear so far?

OK, moving on to SUB_A. We again use the & to pass the list of arguments through to ... another block. This one actually makes sense if you look at it -- take the first argument from the list, evaluate it as a function on itself. We're assuming that is going to return a function. Why? Because that opening parent means we have arguments, such as they are, coming to the function.

OK, unwrapping the arguments. There is only one argument -- a block of code encompassing SUB_C. (Wasted 15 minutes figuring that out. Thats what I get for doing this in Notepad instead of an IDE that would auto-indent for me. Friends don't let friends read Perl code.)

By now, bits and pieces of this are starting to look almost easy, if no closer to actual readable computer code. We reuse the function we popped from the list of arguments earlier, and we use the same trick to get a second function off of the argument list. We then apply that function to itself, assume the result is a function, and then run that function on the rest of the argument list. Then we pop that up the call stack and we're, blissfully, done.

So, now that we understand WTF this code is doing, how do we know its the Y combinator? Well, we've essentially got a bunch of arguments (f, x, whatever). We ended up doing LAMBDA(f,(LAMBDA(x,f (x x)),(LAMBDA(x,f (x x)))) . Which, since I took a compiler class once and have the nightmares to prove it, is the Y combinator.

Now you want to know the REALLY warped thing about this? I program Perl for a living (under protest!), I knew the answer going in (Googled the code), and I have an expensive theoretical CS education which includes all of the concepts trotted out here... and the Perl syntax STILL made me bloody swim through WTF was going on.

I. Hate. Perl.

And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them.
Read the rest of this comment...

Re:No specifics (Score:2, Interesting)

by dedrop ( 308627 ) writes: on Wednesday January 10, 2007 @05:24AM (#17536612)

There's a reason for that. Many years ago, the Wisconsin database group (David DeWitt in particular) authored one of the first popular database benchmarks, the Wisconsin benchmarks. They showed that some databases performed embarrassingly poorly, which made a lot of people really angry. In fact, Larry Ellison got so angry, he tried to get DeWitt fired (Ellison wasn't clear on the concept of tenure). Since then, major databases have a "DeWitt clause" in their end-user license, which says that the name of the database can't be used when reporting benchmark results.

And this years ahead of Microsoft not allowing users to benchmark Vista at all!

Creative Commons License (Score:3, Interesting)

by pfafrich ( 647460 ) writes: <rich@noSPaM.singsurf.org> on Wednesday January 10, 2007 @07:27AM (#17537278) Homepage

Has anyone noticed the This article is published under a Creative Commons License Agreement, its the first time I've seen this applied to an academic paper. Another small step for the open-content movement.

Re:Noticed how roll your own is faster? (Score:3, Interesting)

by fingusernames ( 695699 ) writes: on Wednesday January 10, 2007 @01:21PM (#17541934) Homepage

Back in the late 90s, I worked on a data warehouse project. We tried Oracle, and had an Oracle tuning expert work with us. However, we couldn't get the performance we needed. We wound up developing a custom "database" system, where data was extracted from the source databases (billing, CDRs, etc.) and de-normalized into several large tables in parallel. The de-normalization performed global transformations and corrections. Those tables were then loaded into shared memory (64bit HP multi-CPU system with a huge amount of RAM for those days, 32GB IIRC), indices were built, and a highly optimized algorithm (over time it kept getting tighter and smaller) was used to join the data based on various criteria using standard, left, right and some hybrid methods. The join algorithm operated on pointers to tables of pointers. Initially, developers used a PERL script to pre-process simple pseudo-SQL into C code/macros, that would be linked to their report application. As the project grew, I developed a SQL-derived language that was run through a cross-compiler to generate the C code and macros to link to applications. That language supported joins, views, temporary tables, and some other useful features that enabled developers to work quickly in implementing report requests. The system was very fast for our purposes, performing fraud analysis and sales trends analysis nightly. In parallel to that analysis on a different server, the de-normalized data was also exported to a Redbrick database so users could perform desktop reporting over historical data. I was the overall technical architect for system, and the developer of the joining system and the SQL-like language and compiling/development tools. I'm sure that today though there are data warehouse specific tools that would eliminate most of that.

Larry

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Is the One-Size-Fits-All Database Dead? 208

Is the One-Size-Fits-All Database Dead? More Login

Is the One-Size-Fits-All Database Dead?

Noticed how roll your own is faster? (Score:2, Interesting)

Re:Noticed how roll your own is faster? (Score:5, Interesting)

Re:Prediction... (Score:4, Interesting)

This has been known for years already (Score:3, Interesting)

Re:Duh (Score:3, Interesting)

Re:Perl & CSV (Score:4, Interesting)

Re:Prediction... (Score:3, Interesting)

Death to Trees! (Score:3, Interesting)

Re:Perl & CSV (Score:5, Interesting)

Re:No specifics (Score:2, Interesting)

Creative Commons License (Score:3, Interesting)

Re:Noticed how roll your own is faster? (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot