Follow Slashdot stories on Twitter


Forgot your password?
Databases Software

The NoSQL Ecosystem 381

abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'
This discussion has been archived. No new comments can be posted.

The NoSQL Ecosystem

Comments Filter:
  • by BBCWatcher ( 900486 ) on Tuesday November 10, 2009 @01:27AM (#30042548)

    I think I've heard of non-relational databases before. There's a particularly famous one, in fact. What could it be []? Let's see: first started shipping in 1969, now in its eleventh major version, JDBC and ODBC access, full XML support in and out, available with an optional paired transaction manager, extremely high performance, and holds a very large chunk of the world's financial information (among other things). It also ranks up there with Microsoft Windows as among the world's all-time highest grossing software products.

    ....You bet non-relational is still highly relevant and useful in many different roles. Different tools for different jobs and all.

  • by Tablizer ( 95088 ) on Tuesday November 10, 2009 @02:07AM (#30042708) Journal

    IMS is very efficient for known query patterns, but not very flexible for stuff not anticipated. This is a common characteristic of non-relational databases: optimize for specific query paths at the expense of general queries (variety).

    Often IMS data is exported and re-mapped nightly or periodically to a RDBMS so that more complex queries can be performed on the adjusted copy. The down-side is that it's several hours "old".

    Note that it's also possible to optimize RDBMS for common queries using well-planned indexing and techniques such as clustered indexes, which put the physical data in the same order as the primary or target key. Whether that can be as fast as non-relational techniques is hard to say. It may depend on the skills of the tuner.

  • Re:bad design (Score:1, Informative)

    by Anonymous Coward on Tuesday November 10, 2009 @02:58AM (#30042942)

    Bloom Filter []

    Been around before it was used to describe computer graphics lighting effects.

  • Re:hmm (Score:3, Informative)

    by QuoteMstr ( 55051 ) <> on Tuesday November 10, 2009 @04:00AM (#30043174)

    I don't think you've thought clearly about the problem.

    If a JOIN is causing problems because it's causing too much non-local data access, then you're going to run into the same problem when you re-code the JOIN in the application. In fact, it might hit you worse because you won't benefit from the database's query optimizer.

    The solution is clearly to improve locality of reference. You can do that by duplicating some data, denormalizing the database, and so on. But you can do all those things just as easily within a RDBMS, and without losing the other benefits a RDBMS gives you.

    Really, your problem is that some of the things RDBMSes allow hurt when a database grows beyond a certain size. The solution is to not do the things that hurt, not ditch the things that RDBMSes do allow.

    It's like complaining that your feet are sore if you walk 20 miles, then cutting off your leg to make it stop.

  • Hstore (Score:1, Informative)

    by Anonymous Coward on Tuesday November 10, 2009 @06:07AM (#30043674)

    You are aware of PostgreSQL's hstore: a type representing basically a name-value mapping (think Perl hash or Python dictionary). You can put an index on it answering queries like "find all records where the field has a mapping "foo => bar", or contains mappings {foo => bar, baz => grumble} and more.

    Cool stuff.

  • Re:Why worry? (Score:2, Informative)

    by Yoozer ( 1055188 ) on Tuesday November 10, 2009 @06:40AM (#30043794) Homepage
    That's when you tell customers about MSDE (now SQL Server Express) [] which does the job a lot better without breaking the bank.
  • by Errol backfiring ( 1280012 ) on Tuesday November 10, 2009 @07:09AM (#30043916) Journal
    I did profile my code. It is not my gut feeling, but my experience.
  • by Shivetya ( 243324 ) on Tuesday November 10, 2009 @07:11AM (#30043928) Homepage Journal

    I work on a very large db2 system. Enterprise systems cost money because they work. There still seems to be this ignorant self absorbed counter culture which believes big iron and similar (anything about look what I can build in my basement) isn't cool so it cannot work.

    Between radix, sparse, derived, encoded vector indexes I can pretty much serve up anything my partners want, whether they are native or foreign db2 ,jdbc or odbc connected. With the tools I have at my disposal I can analyze statements presented by developers to insure I have the access paths needed for their work and guide them to better data retrieval. I can tell if their choices result in full table scans, index probes, hash tables, rrn tables, etc. If I need support its a phone call away.

    I do not care who my client is, data is my job. As such I need tools which are so reliable that only concerns I have are, just what is my customer doing and how can I make their request better. When they query 5tb tables and don't even notice a delay I think I am doing just fine.

  • Re:bad design (Score:3, Informative)

    by Muad'Dave ( 255648 ) on Tuesday November 10, 2009 @08:54AM (#30044436) Homepage

    ...they call themselves "No SQL" and then lash out at relational databases.

    Had you read the article, you would've seen that the "No" in NoSQL stands for Not Only, not No, as in none whatsoever. I welcome any and all research into better, tighter synergy between databases and object persistence.

  • Re:hmm (Score:3, Informative)

    by larry bagina ( 561269 ) on Tuesday November 10, 2009 @09:51AM (#30044844) Journal
    create post insert, update, and delete triggers which file the data (as well as the action, timestamp, and user) into an audit table.
  • by jellomizer ( 103300 ) on Tuesday November 10, 2009 @09:54AM (#30044868)

    The problem with SQL isn't really SQL fault it is the fact that most people don't know how to use it properly. It is amazing how many people "with SQL in their resumes" cannot do joins, cursors, or grouping. I don't consider myself an SQL expert but I am good at it, and I can get the job done. But I know if I make a query and it takes too long to run if I don't need to run it once. Then there is probably some optimization that I will need to do to improve speed. Creating Temporary tables with a reduced dataset, Changing from a single nested sql call to a cursor loop or the other way around depending which will run faster. A lot of times these optimizations are the difference between 40 minutes to run to 30 seconds. Giving the same correct results. A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c
    What this will do is go down the x table O(n) then it will match up the correct record from Y where if you have the correct indexing it will take O(ln(n)) so the overall performance is O(n) meaning it is an order of magnitude faster.

  • Re:hmm (Score:3, Informative)

    by mzito ( 5482 ) on Tuesday November 10, 2009 @09:54AM (#30044874) Homepage

    I sort of agree with you, from the perspective that there's crusaders on either side - people who insist that traditional RDBMSes are the Only Way and people like you who insist they've "never been trialed under real-world conditions". Both statements are clearly incorrect on their face.

    However, there are a multitude of features that these systems have that are not available in NoSQL systems, or only available in such a watered down form that its unfair to compare the two. A list:

    - On-disk encryption
    - Compression
    - Schema/data versioning (present one picture of data to one set of clients, while presenting another layout of the same data to another set during a data migration)
    - Automated failover between servers, clusters, facilities, datacenters
    - "Flashback" - say "I want to run a query against my data as it looked last week at 3pm", and it just works.
    - Shared-disk clustering

    As far as transactions go, they may be a "joke" for scalability (not quite sure what that means), but they're awfully useful when dealing with sensitive information you need ACID compliance for. For example, I would prefer my bank not use an "eventual consistency" model when dealing with my credit card transactions.

    Now, as I said above, a relational database *may not* be the right decision for your application. But the idea that relational databases don't scale is ridiculous. I've seen petabyte datawarehouses running teradata that absolutely scream through data. I've seen Oracle systems that do 10s of thousands of write transactions per second, and several times that in reads. They exist.

  • Re:bad design (Score:2, Informative)

    by AvitarX ( 172628 ) <`gro.derdnuheniwydnarb' `ta' `em'> on Tuesday November 10, 2009 @10:24AM (#30045188) Journal

    But that is my point really, big ass hardware is still quite cheap.

    An inefficient, but easy to create and manipulate design, that is scalable affordabley, has benefits.

    when the workload scales to 10,000 (April 2008) servers, the inefficiency is not problematic.

    If the super efficient embedded style program was easy to maintain and scaled so well, it may only be 5,000 servers, or even 1,000,, but how well would it scale to 15,000 or 3,000 with new features being added (October 2009 Facebook has 30,000 servers).

    Lightweight is not near as important as being able to throw big ass hardware at a problem.

    And if that big ass hardware is lots of little pieces, because the system scales easily that way, it could be less expensive than less hardware overall, but having to be stuffed into larger pieces (not saying that would be an inevitability of efficient embedded style programming, simply a likely outcome).

  • by Just Some Guy ( 3352 ) <> on Tuesday November 10, 2009 @11:40AM (#30046144) Homepage Journal

    A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c

    Dude. That is so unbelievably wrong. First, implicit (comma) joins are inner, not left: your results will differ from the original query. Second, please name one popular database released in the last 3 years that implements inner joins with predicates in the way you describe. I can't speak for the others, but PostgreSQL sure as hell doesn't:

    => select count(1) from invoice;
    select c count

    => select count(1) from ship;

    => select invoice.invid from invoice, ship where invoice.shipid = ship.shipid and ship.name_delpt = 'redacted';

    Each of those queries against our live production database ran in under a second (and I only edited the input and output of the final query). PostgreSQL may be quick, but I promise you it didn't have time or RAM to create 825,129,958,136 tuples and then winnow out the non-matches. Maybe you're stuck on an ancient version of a DB that was crappy to start with, but the rest of us don't put up with the same insanities you describe.

  • Re:bad design (Score:2, Informative)

    by vajrabum ( 688509 ) on Tuesday November 10, 2009 @03:07PM (#30049636)
    But because they can used to partition the filter accross machines in effect it can be used as index. Each machine that stores a portion of the filter gets all the queries that might apply to it and sends any results up to a machine that dispatches what you might call pre-queries on the bloom filters to the machines where the data and traditional indexes are stored. If the search vendor implements delete--google doesn't really, and this is the reason why--then you simply recompute the bloom filters when they become sufficiently out of date. That can be determined tracking how many times you get a false positive. Index lookups are slow for large data sets not because it takes that long to return an individual result but because there are so many queries. Bloom filters allow you to reduce the number of traditional index lookups and to dispatch the ones that have to be computed only to the machines where the data is available.

The optimum committee has no members. -- Norman Augustine