Beyond Relational Databases 360
CowboyRobot writes "Relational databases were developed in the 1970s as a way of improving the efficiency of complex systems.
But modern warehousing of data results in terabytes of information that needs to be organized, and the growing prevalence of mobile devices points to the increasing need for intelligent caching on the local hardware.
According to the ACM, the future of database architecture must include more modularity and configuration.
Although no concrete solutions are included, the article is a good overview of the problems with modern data systems."
KISS (Score:5, Insightful)
1) Overly complex
2) Don't scale
3) Tied to a single platform/implementation
4) Poor performance
It's typical to see all four in a single try!
SQL, on the other hand:
1) Reasonably simple API
2) Scales to very large databsaes
3) Cross-platform/architecture
4) Performs very well.
Given the insane amount of inertia SQL has, it will extend into an object model, rather than be replaced by one. (EG: C/C++)
Just because something is "old" (Score:4, Insightful)
Just because people are too stupid to take the time to read and understand the theory and learn the application doesn't mean the technology is no longer relevant.
Of course no solutions are proposed. There are none because relational theory is correct, and appropriate for real database driven applications. Little crap bulletin boards can use MySQL.
Netcraft confirms relational databases are dead!
Wake me up when it's ready (Score:3, Insightful)
Have been crying for the need to replace relational databases since the early nineties at least.
We can all see where that got them.
. . . no concrete solutions are included. . . (Score:3, Insightful)
KFG
Re:I did not RTFA (Score:5, Insightful)
Or the summary
mySQL suits me quite well.
That's nice. It won't handle a multi-terabyte database, though. That's the domain of Terabase, Oracle, and (blech) DB2. It's also what the article is about.
The power of PHP and mySQL is all I need.
And a moped is all you need to get to work. If you want to haul 300 metric tons of rock from point A to point B, you need a dump truck. Again, that's what this article is about.
Back on topic, this entire article is mostly speculative for the moment. A lot of excellent work has been done in OODB and XMLDB designs, but no singular design has yet emerged to solve all our woes. For example, I love the Prevayler [prevayler.org] concept. It solves a lot of problems, lowers data access times, and provides for complete data security. It also isn't usable or scalable without a lot more design work.
The future will hold some very interesting things, but for now we'll have to keep inventing until we come up with a consolidated solution.
Re:Databases? Bah! (Score:2, Insightful)
Re:I did not RTFA (Score:5, Insightful)
Re:SQL Dying...film at 11...NOT! (Score:3, Insightful)
In the future, databases will be more "good."
While we can't yet go into detail as to how this will all work, suffice it to say that we have a pretty solid idea what the future holds.
Why 'Beyond'? (Score:5, Insightful)
Essayist Clay Shirky has gone to far as to suggest that MySQL is at the center of a whole new software movement [shirky.com].
In my experience with Web applicaions the chief problem with the RDBMS seems to be that it does not do text indexing and search very well, so I have to keep a second store of data in something like Lucene.
The other major problem is the level of skill required to tune the database to achieve high-performance SQL queries, so hopefully the RDBMS will evolve with more self-configuration capability.
The article, which I only skimmed, actually addresses these two concerns but seems to pooh-pooh the notion of simply refining the existing RDBMS systems. Instead it says " Old-style database systems solve old-style problems; we need new-style databases to solve new-style problems. "
The paper seems awfully squishy on what this means. The clearest I found was a call to "produce a storage engine that is more configurable so that it can be tuned to the requirements of individual applications."
But this call for new highly modular/configurable storage "engines" seems to me to require at least as much fussy care and feeding as a traditional RDBMS. You're just replacing one DBA with another. And throwing out decades of refinement in the process.
The raison d'etre of the RDBMS is to allow the programmer to treat storage as a black box while gaining nifty ACID features. Extending this to text indexing seems logical.
Kind of OT SQL Question (Score:2, Insightful)
As I am designing more and more complex web apps, I am constantly having to think of new, innovative ways to design the tables and databases and am currently making it up as I go. Does anyone have a reccomendation for books/sites that talk about good design proactices, that is not "How to use SQL" and relatively agnostic on the specific brand on DB?
Sorry for the OT post, its just something that has been bugging me for a while
the usual database blatherings (Score:2, Insightful)
by MARGO SELTZER, SLEEPYCAT
Sleepycat? The guys who make a brain-dead key/value database with no data manipulation or integrity capabilities? Who are they to educate others on the topic of relational databases? (Sleepycat's products are useful tools, but they are not true databases).
while data management has become almost synonymous with RDBMS, however, there are an increasing number of applications for which lighter-weight alternatives are more appropriate.
Ahh, so the proper title of this paper should be: "Beneath Relational Databases" or "Below Relational Databases". Because the relational model is a *complete* model for data storage and manipulation, so if you have a subset of this functionality, you are not "beyond" it.
As argued by Stonebraker, the relational vendors have been providing the illusion that an RDBMS is the answer to any data management need. For example, as data warehousing and decision support have emerged as important application domains, the vendors have adapted products to address the specialized needs that arise in these new domains. They do this by hiding fairly different data management implementations behind the familiar SQL front end. This model breaks down, however, as one begins to examine emerging data needs in more depth.
Well, the mention of Stonebraker's name as an authority on databases is generally an indicater of a content-free paper, but let's be sure we're talking about the same thing: the relational *model* is a *complete* model. There is no other more effective model, in fact as far as I know, there are no other complete models!
So if you want to use the relational model as a foundation to build new database products, go right ahead. If you're talking the same old vendor BS about "post relational" or "XML" (hierarchical) or "object" (network and/or hierarchical), then please shut up!!
My feeling is when he says "in depth", he means "less depth".
As more documents are created, transmitted, and operated in XML, these translations become unnecessary, inefficient, and tedious. Surely there must be a better way. Native XML data stores with XQuery and XPath access patterns represent the next wave of storage evolution. While new items are constantly added to and removed from an XML repository, the documents themselves are largely read-only.
Uh, yes there is a better way: create an XML data type in a relational database with a full set of XML operators. The relational model doesn't care about data types.
I have no interest in giving up the general relational model for a hierarchic model (rejected decades ago as not being general enough) based on a TEXT FILE FORMAT.
Stream processing is a bit of an outcast in this laundry list of data-intensive applications.
I smell Stonebraker.. yes, it's an outcast because stream processing has nothing to do with data storage!!!
Some argue that database architecture is in need of a revolution akin to the RISC revolution in computer hardware
Yes, all these people need to study and understand the relational model which was developed 30 years ago and is still the only complete data model. The relational model can be described in half a page, and consists of a small number of core operations from which any possible data storage and manipulation need can be developed. Stop thinking about implementations, think about the *model* and then use that develop new implementations!!
Old-style database systems solve old-style problems; we need new-style databases to solve new-style problems.
What does this mean exactly? I need to store and manipulate data without limitations. The relational model offers this. What is "old" or "new" here? I'm not going to switch to an ad-hoc subset of the relational model because it's "new".
This "paper" (wasn't there one a couple weeks from some Microsoft dude, which was equally useless?) commits the same old sins: 1) look at existin
Re:KISS (Score:4, Insightful)
As for SQL I do not agree with point 4: SQL does not perform very well. We're in the age of Ghz processors, fast disk drives and it *still* is an performance issue to add a few million records to a database? What SQL sorely lacks is a recognition that the 4th generation of software languages was a bad mistake, and get back to a third generation language: explicit indices, explicit loops over these indices and fast (compiled) execution of said loops. Just freaking program the database instead of waving chicken bones at it.
Re:Relational Filesystems (Score:3, Insightful)
Because current RDBMS designs are unsuitable for filesystems. Relational theory still holds (just as it does for OODBs), but the physical design should be quite different if it's going to be effecient.
As I said, this has been beaten to death in the research communities. BeOS even included a DBFS design [nobius.org], but it went largely unused. NTFS also has all the necessary stuff in it, but Microsoft constantly removes it in final releases. ReiserFS has DBFS features, but these also go largely unused.
I think the problem is that making effective use of a DBFS requires a very different set of applications. i.e. If the applications are aware of the functionality, then they can assist the user and provide useful support. But without this form of OS and application support, the user will find that the metadata is nothing but added confusion.
Re:KISS (Score:4, Insightful)
In other words, relational databases are very nice and elegant but their interface (SQL) is bad and should be replaced.
Also relational databases by themselves can't supply the needs of a typical enterprise and that's where technologies such as OLAP are built on top of RDBMS to make certain data manipulations efficient.
Re:AS-400 (Score:3, Insightful)
AS/400's are hopelessly complex even to seasoned IT professionals such as myself, and they're only around is not because people like them but because the work SO GOD DAMNED WELL :)
point being, tractors work well to but you dont drive one day to day do you? :)
Re:Just because something is "old" (Score:3, Insightful)
Little crap bulletin boards can use MySQL.
Like Slashdot :) Remember yesterday's fiasco where posts were migrating into other articles and the time before that where it happened and the time before that. :)
Why people don't use PostgreSQL is beyond me - unless they don't like fact it is under a BSD license instead of the GPL (please don't get me started on that).
Re:Just because something is "old" (Score:2, Insightful)
Instead of blaming the technology and tools, they should improve skills in the Sytem Artchitects and all the way down the road the people involved in Software Development.
Re:A thought on XML documents (Score:5, Insightful)
Re:How about.... (Score:2, Insightful)
1) a table is *not* a class or an object.
Good advice. A table is simply a snapshot of a relational value. True relational values have no top/bottom or left/right ordering so you can't really show them on paper or on the screen. It doesn't make sense to map classes or objects to tables.
2) Learn how to normalize. A badly (or flat out not) normalized database threatens data integrity by violating the once-and-only once rule.
Normalization has nothing to do with integrity per se. You can add constraints to a denormalized database that make it logically equivalent to a normalized one. I'm not aware of any "once and only once" rule in relational theory.
Normalization is about dependencies: your data shouldn't have certain types of dependencies in it.
As a rule of thumb if the table has more than 20 fields in it you should review your data model and make sure it is properly normalized.
Wrong, the number of fields has nothing to do with normalization. And you should review your data model even if it has 1 field.
Ditch Raid 5. 0+1 will give better perfomance in most cases. Manager like Raid 5 because it is cheap, you get what you pay for.
Yes, 0+1 is faster, but lots of RAM is even faster. However this is a physical detail and not anything to do with the relational model.
6) Learn a little theory, it won't hurt you. In fact it can save a large amount of time and trouble.
This should be point #1 with a double underline. The lack of understanding and even hostility toward theory is frightening in the IT industry. If the author of the paper above had any foundation knowledge her paper would be much different.
8) Avoid XML. Too much bloat.
Just remember that XML is FILE FORMAT like Jpeg, CSV, and you'll do fine.
9) Learn how to use indices on tables.
Christ, I hope people know this one! They should also realize that an index is a PHYSICAL manifestion of a logical idea: keys.
Re:5th normal form model of perfection. (Score:3, Insightful)
If a few joins (assuming you've got indexes, etc setup) make it too slow you are too close to the edge.
Saving on hardware but spending more in dealing with data problems is false economy.
Re:KISS (Score:4, Insightful)
This is a BIG IMPORTANT POINT. But, unfortunately, the people that are architecting the databases in these cases usually don't realize that they're producing bad designs. All they see is "The database is slow. Why the *!?$ is it so slow!?" There's usually one or two answers: 1) Your schema is organized poorly. 2) You're just plain juggling too much data at once. And let's not forget the deadly combination of both 1 and 2. 100M rows, for example, is a lot of data, any way you look at it. If you've got 100 bytes per row (remarkably small in many cases), that's 10G of data. If you're shuffling through all that data every time, yeah, it's gonna be slow. Ya might want to look into reorganizing, or archiving.
Re:KISS (Score:2, Insightful)
> 1) Reasonably simple API...
The outrageous success of SQL is in part because it complies with the "second law of axiomatic design": a good design has minimum entropy. Prior to SQL a lot of application knowledge was duplicated in the database which increased entropy. Also a lot of knowledge of the physical structure of the data was embedded in the programs which also increased entropy. This is bad because the more redundant information the system contains the harder it is to change.
With SQL you can change the physical structure without rewriting all your programs. And you can change your code without affecting the DBMS design eg you can get the rows back in a different order. In the bad old days if you wanted things in a different order you had to reload the database and recompile all your programs.
Even though SQL uses 3 to 10 times as many CPU cycles as hierarchical DBMSs or ISAM files it took over because of this flexibility.
Similarly, I think that one of the reasons OO programming has not delivered the productivity benefits it was supposed to is that OO somehow encourages people to embed a lot of duplicated knowledge of the application into the class structure. This duplicated information makes it hard to change the system. I have seen this again and again. In contrast, before OO a lot of this information ended up in data or in a database, where it could be easily changed.
JDBC further standardizes SQL dialects (Score:3, Insightful)
XML documents are an hierarchical database (Score:3, Insightful)
that's why XML databases flopped
Re:KISS (Score:4, Insightful)
Google's web index and desktop search facility is a database. I don't know about point 1, but Google definitely blows any relational database out of the water on point 2 to 4.
Google is a very unique case. There are two things in Google's advantage that most RDBMS system have to take into account:
1. Google does not have to update in realtime. If I add a page to my website it is not immediately available on Google. Contrast this to a normal RDBMS where if I add a record it must be available immediately. Google is much more similar to a Data Warehouse than a RDBMS.
2. Websites indexes are more easily parallelized because complex joining of data is not needed. Naively, Google can store all websites starting with 'A' on a server, 'B' on a server, etc. You can store the User table and the Address table on separate database servers and expect to query on users and their addresses with any sort of performance.
3. Going along with 3, the queries expected out of most RDBMS systems are much much more complex than any queries expected out of Google right now. I'm assuming you haven't seen any complex financial reports or statistics that have been generated from a RDBMS. Databases do alot more than "select name from user where id=1234".
Brian
Re:KISS (Score:3, Insightful)
Google's performance and value are both amazing. But it's easy to drive yourself nuts (if you're an enterprise software architect, which I am) trying to get that kind of performance out of other types of application. You see, Google has two major advantages over nearly all other large data-backed applications:
1. There is no "right" answer
Google keeps their ranking and indexing schemes proprietary, so nobody can say what "should" come back from a given search. Indeed, execute the same query at the same time from different machines, or the same machine at different times, and you sometimes get wildly varying results. SEO folks call this the "Google Dance".
2. Writes are asynchronous with reads
As near as anyone can tell, the Google index is rebuilt by their crawler over the span of a few weeks, and then the whole new index is exported to production machines over the span of a few days. Only the crawler writes data to the index, and the index as it's being built is not read by end-user clients; the production index is *only* read.
These advantages let Google use a distributed, loosely coupled, inconsistent server farm made of cheap boxes that needn't be in sync with one another. It doesn't matter if queries to two of them give different answers, after all.
Contrast this with (e.g.) an online bookstore; once you order a book, every node in every part of the system needs to know about that order in order to keep everything consistent (stock level monitoring, end-user purchase tracking, and so forth). This is a much, much harder problem to solve.
So again, not to take anything away from Google's tech (which I more or less worship), but it's not a fair point of comparison for most large enterprise apps.
Re:SQL isn't a database (Score:3, Insightful)
Quite a bit actually. In my experience most only support MySQL. Why? They don't understand the value of data integrity and the features MySQL lacks/lacked. As a result they have had to poorly implement things that the DB itself should do, and when they try to "port" this to another DB they run into all sorts of issues.
And SQL is a language, not an API.
So they give up.