Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Databases Software

The NoSQL Ecosystem 381

abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'
This discussion has been archived. No new comments can be posted.

The NoSQL Ecosystem

Comments Filter:
  • by Tablizer ( 95088 ) on Tuesday November 10, 2009 @01:26AM (#30042546) Journal

    The performance claims will probably be disputed by Oracle whizzes. However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management. Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified. We have dynamic app languages, so why not dynamic RDBMS also? Let's fiddle with and stretch RDBMS before outright tossing them. Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

    More at:
    http://geocities.com/tablizer/dynrelat.htm [geocities.com]
    (And you thought geocities was de

  • by Just Some Guy ( 3352 ) <kirk+slashdot@strauser.com> on Tuesday November 10, 2009 @01:30AM (#30042562) Homepage Journal

    I'm a huge PostgreSQL fan and took classes in formal database theory in college. I'm saying this as someone who understands and thoroughly appreciates relational databases: I'm starting to love schema-less systems. I've only been playing with CouchDB for a few weeks but can certainly see what such stores bring to the table. Specifically, a lot of the data I've stored over the years doesn't neatly map to a predefined tuple, and while one-to-one tables can go a long way toward addressing that, they're certainly not the most elegant or efficient or convenient representation of arbitrary data.

    I'm certainly not going to stop using an RDBMS for most purposes, but neither am I going to waste a lot of time trying to shoehorn an everchanging blob into one. Each tool has its place and I'm excited to see what niche this ecosystem evolves to fill.

  • Re:bad design (Score:4, Interesting)

    by JavaPunk ( 757983 ) on Tuesday November 10, 2009 @01:38AM (#30042600)
    Yes it does (look through 50TB of data), and how would you design it? It has to access all of your friends and find their postings. Robert Johnson gave an excellent talk on facebook's design two weeks ago at OOPSLA (it should be in the ACM digital library soon). He stated that there is no clear segregation of data, the (friend) network is too connected and extracting groups of friends isn't possible. Basically they have a huge mysql farm with memcached on top. Loading an inbox will hit multiple servers (maybe even a different server for each of your friends) across the farm.
  • by QuoteMstr ( 55051 ) <dan.colascione@gmail.com> on Tuesday November 10, 2009 @01:51AM (#30042656)

    We didn't start with relationship databases. RDBMSes were responses to the seductive but unmanageable navigational databases [wikipedia.org] that preceded them. There were good reasons for moving to relational databases, and those reasons are still valid today.

    Computer Science doesn't change because we're writing in Javascript now instead of PL/1.

  • by Tablizer ( 95088 ) on Tuesday November 10, 2009 @01:57AM (#30042666) Journal

    Notice that this blog is from a company pushing a cloud based solution.

    That is indeed suspicious. But if they want to sell clouds, then make a RDBMS that *does* scale across cloud nodes instead of bashing SQL. (SQL as a language doesn't define implementation; that's one of it's selling points.) It may be that since there's not one out yet, they instead hype the existing non-RDBMS that can span clouds.

    (I agree that SQL could use some improvements, such as named sub-queries instead of massive deep nesting to make one big run-on statement. Some dialects already have this to some extent.)
             

  • Re:hmm (Score:3, Interesting)

    by Prof.Phreak ( 584152 ) on Tuesday November 10, 2009 @01:59AM (#30042674) Homepage

    Depends. We've been using Netezza with ~100T of data, and... well... it takes seconds to search tables that are 30T in size. I'd imagine Teradata, greenplum and other parallel db's get similar performance---all while using standard SQL with all the bells and whistles you'd normally expect Oracle SQL to have (windowing functions, etc.).

  • by johnlcallaway ( 165670 ) on Tuesday November 10, 2009 @02:19AM (#30042772)
    I was an admin on a system that spread the data across 10 database servers. Each server had a complete set of some data, like accounts, but the system was designed so that ranges of accounts stored their transaction type data a specific server, and each server held about the same number of accounts and transactions. As data came in, it was temporarily housed on the incoming server until a background process picked it up and moved it to the 'correct' one. This is a very simplistic view, but the reality was that it worked quite well. Occasionally, there was a re-balancing that had to be done. But it was very scalable. The incoming data wasn't so time sensitive that if it took a few hours to get moved, everything was still OK. When an 'online' session needed data, it knew which server to connect to to get it. Processing was done overnight on each server, then summarized and combined as needed.

    So yes .. .people have been coming up with innovative ways to solve these problems for a very long time.

    And they will continue to do so.
  • I/O bottleneck (Score:2, Interesting)

    by Begemot ( 38841 ) on Tuesday November 10, 2009 @02:24AM (#30042800)

    Let's not forget where the bottleneck is - the I/O. It's expensive but once you build a fast and solid storage system, correctly configure it and partition your data properly over a sufficiently large number of hard drives, RAIDs, LUNs etc., you might be able to use SQL. We run a database of 10TB on MS SQL with hundreds of millions of records with an equal rate of reads and writes and could not be happier.

  • Re:hmm (Score:5, Interesting)

    by buchner.johannes ( 1139593 ) on Tuesday November 10, 2009 @02:45AM (#30042874) Homepage Journal

    The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

    Your point is valid, but I think there is more to it. And the problems these solutions try to solve are quite old too. For example:

    Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness. A multi-version relational database would be nice, you actually don't need modification/delete operations in this scenario, just 'updates' that add to the previous status. CouchDB [blogspot.com] does append operations.

    In some cases you may not need a complete SQL database, just key->value relations, but have them scaling very well. http://project-voldemort.com/ [project-voldemort.com] states: "It is basically just a big, distributed, persistent, fault-tolerant hash table." Then they state that they provide horizontal scalability, which MySQL doesn't (OTOH, we should really look at Oracle for these things).

    And you can't really say MapReduce/Hadoop [apache.org] is pointless.

  • Re:hmm (Score:3, Interesting)

    by phantomfive ( 622387 ) on Tuesday November 10, 2009 @03:07AM (#30042972) Journal

    Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness.

    I did it by every time I did an INSERT, DELETE, or UPDATE query, taking an exact copy of the query and dumping it into a special table in the database (along with a stack trace of where it was called from). To reconstruct I could just run those commands straight from the database, to whatever point was desired. It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

  • by Tablizer ( 95088 ) on Tuesday November 10, 2009 @03:08AM (#30042976) Journal

    What prevents indexing a dynamic-relational DB? Although I said that you didn't need a data-definition language, but that doesn't mean one *must* skip the DDL (for things such as indexes). Another thing to explore is auto-indexing. If so many queries keep filtering by a given column, then it could automatically put an index on it.

  • Re:bad design (Score:4, Interesting)

    by Ragzouken ( 943900 ) on Tuesday November 10, 2009 @03:33AM (#30043072)

    "Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation"."

    When was the last time you tried to use Facebook or Facebook chat and didn't get failed transport requests, unsent chat messages, unavailable photos, or random blank pages?

  • Re:bad design (Score:3, Interesting)

    by gutter ( 27465 ) <ian...ragsdale@@@gmail...com> on Tuesday November 10, 2009 @03:53AM (#30043152) Homepage

    Sounds like you don't know much about Erlang. Erlang processes are MUCH lighter weight than unix processes, and are designed to scale to millions of processes. Generally, you want one Erlang process for each concurrent task in the system, like maybe one process for each active chat session. So, having 5 million Erlang processes would be as designed.

  • This again (Score:3, Interesting)

    by Twillerror ( 536681 ) on Tuesday November 10, 2009 @03:56AM (#30043158) Homepage Journal

    Wow a "object oriented" database discussion again. I've never read one of these :P I've only been doing this 15 years and I've lost count of these talks a long time ago.

    What is the difference between schema less and schema rigid anyways. I don't see what that has anything to do with performance. The real issue is uptime and transaction support. People want to add a column or index without taking the system down. That is different then dealing with PBs of data. Most table structures can easily deal with that much data.

    If you have a DB that is big you have lots of outs. Pay...get Enterprise version of whatever. Break it into many DB/tables and merge together. Archive. Archive I bet will get most people by. Does eBay really need all that bidding info for items over a few weeks old...only for analysis maybe. Move that old stale data out of the active heavily hit data tiers.

    The fact remains that MySQL should be able to scale to TBs of data. The fact that it can't is a failure of the product. All the others have been for a while. Why can't it...I don't know...the fact that it uses a F'in different file for each index on a table. If you don't understand how old school that is start using Paradox. Just because it is open source doesn't mean it has to be so damn out of date. Please for the love of god save multiple tables/indexes in the same pre sized file...god.

    Google has all the power to go and use something different. Google gets to cheat. Google is a collection of pretty static data. They scan the internet a lot, but imagine if every time you did a search Google had to scan every web page on the planet, index them, and then give you search results. That would be impractical for sure. So for now they just store big collections of blobs and a big fast index for searching keywords and links to pages. Impressive none the less, but it's not like your typical app. GMail is...funny that it is one system they've had problem with. Even then EMAIL DOESN'T CHANGE. It's user specific, but it's still f'in static. GoogleTastic if you ask me.

    The fact is people are using RDBMS right now to solve real world problems. Some start up is finding a way to tweek MySQL to do something cool and then posting it on a blog...then all of the sudden RDBMS is dead. RDBMS is fine, it will be fine for at least 10 years if not longer. In that time it will evolve as well so that it will be around for even longer. MySQL in 5 years will have online index addition, performance hitless online column addition, partitioning, geo indexing, XML columns, BigASS table support, Oracle RAC like support, and a thousand other features that some RDBMSs have today and some will not see for even longer. Then developers that spent all that cash developing custom shit will revert and post comments like this one.

    That's the way it goes in software development. The middle tier gets bigger, gets inept, custom shit comes out, it gets integrated into the middle tier shit....continue;

    Instead of pronouncing death start talking about how dated a 2 dimensional result set is. JOINs should return N dimension result sets similar to XML with butt loads of meta data. ODBC/JDBC are dated...so updated them.

    select u.login, ul.when from users u join user_logins ul as logins.login ON ul.user_id = u.user_id where u.name = 'me' should equal something like a nested XML packet instead of duplicated crap when there is more then one user_logins.

  • Re:hmm (Score:3, Interesting)

    by h4rm0ny ( 722443 ) on Tuesday November 10, 2009 @05:14AM (#30043464) Journal

    It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

    I'd love someone to post it if they do. We use the same method and the one time we had to replay the sequence to get what we wanted, it took most of a day. Yes, that was because are last snapshot "starting point" was nearly a week old, but nonetheless... if technology has moved on and there's a better way of doing this, then I'm sure a lot of us will be interested.

  • by Errol backfiring ( 1280012 ) on Tuesday November 10, 2009 @05:43AM (#30043582) Journal

    MS-Access had some really great features: it could be accessed with both SQL and with a blazingly fast (because almost running on the bare OS) ISAM-style library. I am still missing anything like it on Linux. SQLite is a file-system database, but why on earth should it parse full-blown SQL at runtime and why on earth should my program write another program in SQL at runtime just to load some data? Get serious. Parsing and building SQL is just overhead, and especially parsing SQL is no easy and light task.

    Since I switched to OO programming, most (95%) of my queries are "This table/index. Number 5 please." In essence that is the get/put method, or the ISAM style method. I really would like something like that to exist on Linux. The closest thing around is MySQL's HANDLER statement, but that can only be used for constant data (because it does dirty reads) and for reading only.

    SQLite could even be faster if it just accepted some basic "get row by index" and "put row by index" commands that do not try to parse, optimize or outsmart anything. The problem with "modern" databases is that they are either "SQL" or "NoSQL". That's awful. Some programs speak SQL (because of compatibility, because it is a reporting program or just because the programmer does not know anything else) and some programs are better off with direct row management. That does not mean that the data should not be accessible by both programs. I really wish that the regular SQL databases would develop ISAM-style access methods. Programming would be a hell of a lot easier then, and the programs themselves would speed up significantly was well.

    This is no idle remark. I worked a lot with MS-Access and most rants about it being slow comes from the fact that most programmers treat the file-system database as a server. So it must emulate itself as a server and do a lot of household parsing and does not even have a physical server to relieve its load.
    But if you know how to program a file-system database with ISAM-style methods, MS-Access is by far the fastest database I ever encountered. No Joke. Really. It can be fast because there is no need to do all these household jobs to just dig up a row.

  • by QuoteMstr ( 55051 ) <dan.colascione@gmail.com> on Tuesday November 10, 2009 @06:43AM (#30043804)

    Your question reminds me of the people who say, "if flight records are so strong, why don't we just build the whole plane out of the stuff they use to make them?" You might as well ask, "if DNS is so great, why don't we implement filesystems in terms of it?" Your post demonstrates that you you haven't considered context and purpose.

    Relational databases are models. You can certainly describe DNS in terms of a relational schema. In principle, you could construct a wrapper and query it with SQL. But there's no reason to do that, because with someone as simple as DNS, the full power of a relational query engine doesn't buy you much.

    Most datasets aren't that simple.

    Furthermore, DNS is an open standard that needs to be accessible in as simple a way as possible. Complicating it with relational semantics wouldn't have been worthwhile (because of DNS's relative simplicity), and would have significantly hampered DNS's interoperability.

    That is, if relational databases had existed when DNS was implemented, which they didn't.

    Furthermore, DNS is a distributed, decentralized database. You couldn't use a RDBMS (the software that realized the abstract model of a relational database) to manage it even if you wanted to. That doesn't apply to most datasets, which however large, are still managed by a single organization, and which are accessed by software under the control of that organization.

    Your comparison really makes no sense whatsoever. The vast majority of databases aren't put under the same constraints DNS, and so can take advantage of the much greater flexibility an RDBMS affords.

    You're basically arguing that we can't have efficient engines in automobiles because of a few of them might need to tow 18 ton trailers and withstand mortar rounds. It's ridiculous.

  • Re:Why worry? (Score:5, Interesting)

    by Anonymous Coward on Tuesday November 10, 2009 @06:53AM (#30043850)
    You laugh, but the things I see done in Excel on a daily basis in production environments getting a LOT of work done are a testament to it's power. It is one of the best rapid application development platforms in existance. People with no CS background programming away in a functional style and getting shit done and not even realising they are programming. It could be so much better but it's still the best of breed. Any yes I have tried, and seen others try, O.O. et al. Forget it. Lets not go down that worn old road.
  • by sco08y ( 615665 ) on Tuesday November 10, 2009 @08:14AM (#30044202)

    However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array).

    You described an entity attribute value model, which winds up reinventing half the DBMS, poorly. Don't worry, *everyone* does one once until they realize it's a bad idea.

    Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified.

    A "rigid" schema is preventing a ton of totally redundant code being written on the app side. All those constraints wind up in the schema because your UI designer doesn't want to consider that Mary might have 5 addresses or 6 mothers or work 7 jobs simultaneously. And your UI tester doesn't want to test an exploding combinatorial number of possibilities.

    I'd like to see, however, a decent type system, proper logical / physical separation, etc.

    Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

    I'm starting from scratch. [github.com] (Currently I'm slowly retyping about 40 pages into Latex...)

  • by QuoteMstr ( 55051 ) <dan.colascione@gmail.com> on Tuesday November 10, 2009 @08:42AM (#30044360)

    Right. Don't forget PostgreSQL too. Really, the problem here is MySQL. Hell, look at the "tips and tricks" comments for this story: they all deal with ways to work around deficiencies in MySQL (and old versions of MySQL at that.)

    The guy who recommends using the first two characters of the MD5 hash to select a table is particularly hilarious. Doesn't he realize that's what a database index already does, and that databases (even MySQL) will do that for him?

  • Re:hmm (Score:3, Interesting)

    by popeyethesailor ( 325796 ) on Tuesday November 10, 2009 @11:53AM (#30046344)

    Why not use the DB features? Most enterprise-y databases have PITR(Point-in-time Recovery features).. Although it's not designed for that sort of thing, it could be used in such a fashion.
    Most DBs do the same thing you guys do, i.e, use a transaction log. The transaction log could be replayed to get into a Point-in-time state. The one disadvantage is it's all or nothing i.e, you can't do it for specific transactions(although I'm sure some DBA will wander in correct me on this ;)

  • Re:bad design (Score:2, Interesting)

    by vajrabum ( 688509 ) on Tuesday November 10, 2009 @12:20PM (#30046776)
    Bloom filters give constant time probablistic answers to set membership questions in a very space efficient manner. Moreover set union and intersection for the filters can be computed by simple AND and OR operations--also in constant time. The downside is that delete is hard. That union and intersection property means that it's easy to distribute query's over an arbitrary number of machines. Sounds kind of perfect to me for implementing a distributed index for searching, no?
  • Re:hmm (Score:1, Interesting)

    by Anonymous Coward on Tuesday November 10, 2009 @02:10PM (#30048782)

    That might be part of it but the big selling point of SQlite is its license which is a lot less restrictive than Berkeley DB.

    SQLite is kind of cool but if really doesn't seem very well optimized as far as accessing the filesystem below it. Unless your whole database fits in RAM then it can really thrash your drives something terrible. Needless to say, it doesn't work well with large databases, mostly because of this. Even with lots of research and tweaking on high-end hardware I have never gotten SQLite to perform very well.

    Firebird can be embedded just like SQLite (single file database, one library to link to, etc), also has a very permissive license, has tons more features (stored procedures, etc), and performs orders of magnitude better than SQLite. Its license is also more permissive than MySQL and it has more features (but MySQL does have multiple database engines and more indexing types). I'm successfully using Firebird for terabyte size databases and it works well. I'm not sure why more people don't use it.

    (What about PostpreSQL you ask? Meh, decent license but it can't be embedded, performs better than SQLite but much worse than Firebird and MySQL; feature-wise its about on par with Firebird)

  • Re:bad design (Score:3, Interesting)

    by Pseudonym ( 62607 ) on Wednesday November 11, 2009 @08:37PM (#30067950)

    Bloom filters are not as useful as they once were for large-scale indexing. As memory sizes increase, the tradeoff between precision and space efficiency changes. It's just as easy to distribute a hash table or a radix trie across multiple machines these days.

    A more common modern use is when you have data which is logically tabular, with potentially many "columns" which can contain arbitrary-sized objects, but the table is expected to be sparse. Traditional SQL table representations rely on predetermined maximum sizes for data values to optimise their representation, which is inappropriate for this because it would waste space. However, you also don't want to waste time accessing disk to find that a value isn't there. Using a Bloom filter costs a small amount of space (enough to fit in a small "descriptor") but can potentially save a huge number of disk seeks.

There are two ways to write error-free programs; only the third one works.

Working...