The NoSQL Ecosystem 381

Posted by kdawson on Tuesday November 10, 2009 @01:12AM from the no-relation dept.

abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'

This discussion has been archived. No new comments can be posted.

The NoSQL Ecosystem

Search 381 Comments Log In/Create an Account

Comments Filter:

bad design (Score:2, Insightful)

by girlintraining ( 1395911 ) writes: on Tuesday November 10, 2009 @01:18AM (#30042510)

So... every time I open my inbox in Facebook, it has to search through 50TB of data? That sounds like a design problem. What has always floored me is why people think everything needs to be stuffed into a database. Terabyte sized binary blobs? You know, there's a certain point where people need to stop and actually think about the implimentation.

Share
twitter facebook
hmm (Score:5, Insightful)

by buddyglass ( 925859 ) writes: on Tuesday November 10, 2009 @01:18AM (#30042514)

With regard to scalability, it strikes me that the problem isn't so much SQL but the fact that current SQL-based RDBMS implementations are optimized for smaller data sets.

Share
twitter facebook
Re:bad design (Score:5, Insightful)

by munctional ( 1634709 ) writes: on Tuesday November 10, 2009 @01:29AM (#30042560)

Ever heard of bloom filters? Sharding? Indexes? They are clearly not doing a table scan on 50gb of data every time you open your Facebook inbox.
You know, there's a certain point where people need to stop and actually think about the implimentation.
Um, they do. They regularly blog about their solutions to their problems and open source their solutions and contributions to existing projects. They come up with amazing solutions to their large scale problems. They're running over five million Erlang processes for their chat system!
http://developers.facebook.com/news.php?blog=1 [facebook.com]
http://github.com/facebook [github.com]
Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation".

Parent Share
twitter facebook
Re:hmm (Score:5, Insightful)

by phantomfive ( 622387 ) writes: on Tuesday November 10, 2009 @01:38AM (#30042602) Journal

The biggest problem is the cloud. A lot of cloud APIs don't allow full relational database access, so now it seems we are coming up with all these justifications for why we don't really need it. Notice that this blog is from a company pushing a cloud based solution.

Parent Share
twitter facebook
Re:Hashes are your friend (Score:3, Insightful)

by MightyMartian ( 840721 ) writes: on Tuesday November 10, 2009 @01:43AM (#30042616) Journal

In the olden days you didn't have centralized message stores. That's largely a relic of PC-based networking schemes like Novell, Lotus Notes and Exchange. The Unix model used individual mailboxes (in fact, the whole breakdown was for all of a user's data being in their own hierarchy). Obviously the Unix mailbox scheme wasn't that great as we started saving many megabytes of data, so you create indexed systems, but each user's mail is still effectively independent. I've used Pine to navigate my old mbox archives and it can move through even unindexed email at speeds that put bloated monsters like Exchange to shame.
Clearly the issue with scalability in general is simply one of optimization. If you're returning relatively small pieces of information, then an RDBMS is the way to go. If all your databases are basically blobs, well then it's probably not going to be that effective. I still feel that blobs are heavily abused.
I think part of the problem with RDBMSs is simply that a lot of people don't use them properly, and create the bottlenecks through bad design.

Parent Share
twitter facebook
Re:hmm (Score:5, Insightful)

by MightyMartian ( 840721 ) writes: on Tuesday November 10, 2009 @01:48AM (#30042642) Journal

That's my take as well. We have these crippled semi-databases that lack a lot of useful features that anyone that has used RDBMSs over the last few decades have gotten used to, so suddenly it becomes a justification game; "Well, SQL doesn't deliver the output we need, so here's some half-way-to-SQL tools which are really better, kinda... oh yes, and Netcraft confirms it, SQL is dying!!!!"
I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization. The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

Parent Share
twitter facebook
Re:hmm (Score:5, Insightful)

by KalvinB ( 205500 ) writes: on Tuesday November 10, 2009 @01:48AM (#30042644) Homepage

For the vast majority of use cases, large data sets can be made logically small with indexes or physically small with hashes.
If you're dealing with massive data you're probably not dealing with complex relationships. E-Mail servers associate data with only one index: the e-mail address. Google only associates content with keywords. E-mail servers logically and physically separate email folders. Google logically and physically separates the datasets for various keywords. So by the time you hit it, it knows instantly where to look for what you want. You don't have a whole complex system of relationships between the data. It looks at the keywords , finds the predetermined results for each and combines the results.

Parent Share
twitter facebook
Re:Dynamic Relational: change it, DON'T toss it (Score:2, Insightful)

by Prodigy Savant ( 543565 ) writes: on Tuesday November 10, 2009 @02:19AM (#30042774) Homepage Journal

What you are suggesting is to mimic a key-value design with something like a json or serialized data as the value.
This would work if you never had to index on any of the values in the json. All your sql queries must have there where parts running off the key.
This is a problem that couchdb and mongodb solve.
I am not trying to paint SQL in an unflattering shade -- there would still be a lot of situations where an RDBMS design would be optimal. Infact, I am currently working on a mongodb/mysql hybrid solution for a large web site (larger than /. )

Parent Share
twitter facebook
Re:hmm (Score:5, Insightful)

by mzito ( 5482 ) writes: on Tuesday November 10, 2009 @02:21AM (#30042788) Homepage

Uh, no, that is not correct. Relational DBMSes such as Oracle, Teradata, DB2, even SQL Server are all designed to scale into the multi-terabyte to petabyte range. The issue is one of a couple of things:
- Cost - "real" relational databases are expensive. I once had a conversation with someone who worked at Google, who talked about how much infrastructure they have written/built/maintain to deal with MySQL. Many of those problems were solved in an "enterprise" DBMS 3-10 years ago. However, the cost of implementing one of those enterprise DBMS is so high that it is cheaper to build application layer intelligence on top of a stupid RDBMS than purchase something that works out of the box
- Workload style - most of the literature around tuning DBMS is for OLTP or DSS workloads. Either small question, small response time (show me the five last things I bought from amazon.com) or big question, long response time (look through the last two years worth of shipping data and figure out where the best places to put our distribution centers would be). Many of these workloads are combos - there could be very large data sets and complex data interdependencies, with low latency requirements. It may be possible to write good SQL that does these things (in fact, I know a couple luminaries in the SQL space that will claim just that), but the community knowledge isn't there.
- Application development - when you're building your app from scratch, you can afford to work around "quirks" (bugs) and "gaps" (fatal flaws) to get what you need. This dovetails with the other issues, but when your core business is building infrastructure, it's worth your while to deal with this. When your core business is selling insurance or widgets, or whatever, it is not.
None of this is to say that the "nosql" movement is a bad thing, or that there's no reason for its existence, or that no one should bother looking at it. However, there is a definite trend of "this is so much better than SQL" for no good reason. SQL has scaled for years, and I know loads of companies who work with terabytes and terabytes of data on a single database without any issue.
A far more interesting discussion is the data warehouse appliance space - partitioning SQL down to a large number of small CPUs and pushing those as close to the disk as possible.

Parent Share
twitter facebook
Re:bad design (Score:3, Insightful)

by kestasjk ( 933987 ) * writes: on Tuesday November 10, 2009 @02:24AM (#30042798) Homepage

They use bloom filters for messaging? What for?

Parent Share
twitter facebook
Re:hmm (Score:3, Insightful)

by CaptainZapp ( 182233 ) * writes: on Tuesday November 10, 2009 @03:48AM (#30043134) Homepage

I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization. The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.
Thank you very much for this comment, you put it far more eloquently then my venting, I just wanted to grace this thread with. The real kicker though is
There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.
This statement is just so full of shit. And the real larff riot, for me at least, is when people or shops employing MySQL (for heavens sake!) make such statements.
Ej, folks: Rigid schema design is an asset, not a liability!

Parent Share
twitter facebook
Re:hmm (Score:5, Insightful)

by QuoteMstr ( 55051 ) writes: <dan.colascione@gmail.com> on Tuesday November 10, 2009 @04:04AM (#30043190)

I think I'd rather see the opposite: That non-relation DBs become the mainstream, and they have SQL added for the odd occasion it is useful. Relational has some nice properties for ad-hoc querying, but for everything else they are a nuisance.
Berkeley DB [wikipedia.org] is a very good non-relational database with multiple language bindings, several storage engines, and transaction support. It's been around for 24 years, and has seen some appreciable use.
But that use was nothing compared to the database explosion that SQLite [sqlite.org] brought about when it was released. SQLite is almost exactly like Berkeley DB, except that it has a SQL engine on top. Almost everyone is using SQLite, and many Berkeley DB users are moving over to it.
Why? Because SQLite is relational! That constitutes some serious evidence that relationship databases are more than "a nuisance".

Parent Share
twitter facebook
Perhaps you just aren't that popular? (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday November 10, 2009 @04:14AM (#30043232)

Has this ever occured to you: Maybe people just choose not to answer you? :)

Parent Share
twitter facebook
Re:Why worry? (Score:3, Insightful)

by Linker3000 ( 626634 ) writes: on Tuesday November 10, 2009 @05:27AM (#30043502) Journal

Oh Great! I have just migrated 5 offices from a veterinary management system based around Access 97 onto the new, MS-SQL-based one.
How can I expect to maintain my value to the company if they stick with old, reliable systems instead of moving onto more sophisticated 'solutions' that require a shit-load of tweaking and technical guesswork to keep them running smoothly?

Parent Share
twitter facebook
Re:hmm (Score:2, Insightful)

by tkinnun0 ( 756022 ) writes: on Tuesday November 10, 2009 @05:44AM (#30043590)

What if you ARE dealing with massive data AND complex relationships?

Parent Share
twitter facebook
Re:And I am missing it greatly on Linux (Score:3, Insightful)

by QuoteMstr ( 55051 ) writes: <dan.colascione@gmail.com> on Tuesday November 10, 2009 @06:08AM (#30043684)

One of the big attractions of using a database to store your information is having a consistent API for accessing your data. I'm not convinced that what you want, having both SQL and non-SQL methods to access the same dataset, is ever actually useful. The overhead SQL imposes is actually minuscule compared to the cost of data access itself.
If you go the Berkeley DB route, you're going to need to build an application-level data access layer anyway. If you have a complex query to perform, just do it through that access layer.
On the other hand, if you use a SQL engine, you can go "small and light" simply by using "small and light" queries. There's no particular reason you can't simply run SELECT * FROM mytable WHERE id=? repeatedly, incrementing id each time.

Parent Share
twitter facebook
Re:bad design (Score:4, Insightful)

by Zombywuf ( 1064778 ) writes: on Tuesday November 10, 2009 @06:17AM (#30043718)

The problem is when people don't think about the solution and apply the cargo cult mentality. Facebook uses Eeeerlaaaang therefore we should. Facebook wrote it's own database, therefore we should. People end up writing their own database engines that do exactly the same thing as modern relational engines, with all the bugs that were fixed in the relational engines 10 years ago (5 for Microsoft). Even MS SQL will split a large group by aggregate operation (which takes 3 lines to specify) across multiple CPUS by turning it into a map reduce problem, and it will do this all without you having to be aware of it. Oracle (and many others, Oracles is supposed to be the best) will maintain multiple concurrent versions of your data in order to allow multiple users to work with a snapshot that doesn't change under them while others are changing the data, and this happens transparently. You can go ahead and implement all this stuff yourself if you want, in C and sockets, call me when your done, in 10-20 years.
The real issue I have with the NoSQL people is they're a bunch of whiny babies, who haven't even taken the time to understand the problem before lashing out at the first thing they see. Just the name tells you this, they call themselves "No SQL" and then lash out at relational databases. SQL is is a terrible language, which really needs replacing, but it is only one possible language for querying relational databases. Relational databases represent several decades of research into how to query data in a fault tolerant scalable way as a standing implementation, re-implementing them is a waste of time.

Parent Share
twitter facebook
solution looking for a problem? (Score:4, Insightful)

by timmarhy ( 659436 ) writes: on Tuesday November 10, 2009 @07:33AM (#30044022)

SQL databases if designed properly DO handle enourmous datasets. the problem starts when you have wits designing the database and then managers attempting to use the DB for purposes it wasn't meant for.

Share
twitter facebook
Re:bad design (Score:3, Insightful)

by QuoteMstr ( 55051 ) writes: <dan.colascione@gmail.com> on Tuesday November 10, 2009 @08:14AM (#30044208)

What makes you think that relational calculus can't be extended to support spatial information [mysql.com]? After all, it's just another kind of index.

Parent Share
twitter facebook
Re:hmm (Score:4, Insightful)

by geminidomino ( 614729 ) * writes: on Tuesday November 10, 2009 @08:41AM (#30044354) Journal

It's not without precedent. Drop all the features of SQL databases that make them a good idea and you end up with MySQL.
(Burn, baby, burn)

Parent Share
twitter facebook
Re:bad design (Score:3, Insightful)

by jcnnghm ( 538570 ) writes: on Tuesday November 10, 2009 @09:00AM (#30044466)

Yes it does (look through 50TB of data), and how would you design it?
When a users posts a message, I would have the web server pass the message to a server that listens for messages that are being sent. That server would collect the mail then place them as a payload package in the messaging queue when either a fixed number of mail recipients, probably around 500, or a fixed time passes, probably 500ms, whichever comes first. When the payload reaches the front of the queue, the messaging server working on the payload would parse through all the messages building a model of all the data it needs to render all of the messages. It would then send a low priority FQL multiquery requesting all of the data it needs to render and send all of the requests. From there, the messaging server would render both the updated view of the mail when viewing the thread, and view of the thread when viewing the inbox. These would be passed to a persistent memcached setup.
An FQL query would be generated for each user that would increment their inbox message counter, remove the memcached key of the old thread preview from the array of keys representing their inbox while prepending the new key to the array, and append the key to the array representing the thread. When this was assembled for all mail, another low priority multi-query would be sent committing this change.
At this point I'd purge the old thread preview keys from the persistent memcached setup, and store the raw data in a table indexed by both the thread preview key, and the mail view key. The raw data would be stored in case a design change ever necessitated re-rendering all of the mail, or in the case of a user name change.
Finally, I would generate and send an e-mail to each user telling them they have a new message.
This is complex, but it also means that to render an inbox, the only thing that has to be done is to retrieve the array of message thread preview keys, and request each thread preview by key from memcached. Of course, this collection could also be cached.
Note: I intentionally left out some things in the interest of time, like sent message display, read and unread flagging, spam filtering, new message highlighting, and I'm sure others. It shouldn't be difficult to see how this basic model can be expanded to cover these cases.

Parent Share
twitter facebook
Re:Dynamic Relational: change it, DON'T toss it (Score:3, Insightful)

by pla ( 258480 ) writes: on Tuesday November 10, 2009 @09:17AM (#30044570) Journal

RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management.

Aaaaaaaand, congratulations, you've described "fixing" the problem of schema flexibility by using an RDBMS as a non-relational flat hashed memory storage area, with at least three layers of indirection (not even counting underlying complexity of the DB engine itself).

Aside from why the hell you would ever do this in favor of, y'know, just using a flat block of real memory (since you've given the application the fun task of memory management below what the OS usually handles, with all the overhead of framing each read or write as an SQL query)... Well, no. I have no aside, just what I've written.

Sorry, I'll grant that you have a clever solution to a problem, but a far more effective solution would throw away the problem itself and not try to frame everything in terms of DBM - Kinda like Amazon did.

Parent Share
twitter facebook
Re:I/O bottleneck (Score:3, Insightful)

by cervo ( 626632 ) writes: on Tuesday November 10, 2009 @09:50AM (#30044838) Journal

NO offense, but you probably have no idea what you are talking about. MS-SQL is a relatively solid product. SQL Server 2000 and SQL Server 2005 are pretty stable and can easily handle rather large data sets (in the TB). Of all the Microsoft Products, personally Visual Studio and SQL Server are my favorites. I like PostgreSQL as well, so I'm not strictly a Microsoft Fan. But an awful lot of companies are realizing that MS SQL can manage their data much cheaper than Oracle can. Of course PostgreSQL can do it even cheaper...but many companies like to pay $$ to sleep better at night.

Parent Share
twitter facebook
Re:bad design (Score:2, Insightful)

by Hal_Porter ( 817932 ) writes: on Tuesday November 10, 2009 @09:52AM (#30044848)

For something on the scale of Facebook, 5 GB of wasted overhead for the chat system would not scare me.
It's not about the cost of the memory. Big systens tend to run more slowly because of locality effects. Systems running byte code run more slowly too. I think the Facebook guys have been saved by big ass hardware, not an efficient design.

Parent Share
twitter facebook
Re:hmm (Score:3, Insightful)

by mbourgon ( 186257 ) writes: on Tuesday November 10, 2009 @10:54AM (#30045568) Homepage

Mod parent up. That way you're not dealing with the statements themselves, just the data. And you can add the UserID to the Audit table - then find the most recent row for that particular person, or get the most recent row for each ID and apply that.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The NoSQL Ecosystem 381

The NoSQL Ecosystem More Login

The NoSQL Ecosystem

bad design (Score:2, Insightful)

hmm (Score:5, Insightful)

Re:bad design (Score:5, Insightful)

Re:hmm (Score:5, Insightful)

Re:Hashes are your friend (Score:3, Insightful)

Re:hmm (Score:5, Insightful)

Re:hmm (Score:5, Insightful)

Re:Dynamic Relational: change it, DON'T toss it (Score:2, Insightful)

Re:hmm (Score:5, Insightful)

Re:bad design (Score:3, Insightful)

Re:hmm (Score:3, Insightful)

Re:hmm (Score:5, Insightful)

Perhaps you just aren't that popular? (Score:1, Insightful)

Re:Why worry? (Score:3, Insightful)

Re:hmm (Score:2, Insightful)

Re:And I am missing it greatly on Linux (Score:3, Insightful)

Re:bad design (Score:4, Insightful)

solution looking for a problem? (Score:4, Insightful)

Re:bad design (Score:3, Insightful)

Re:hmm (Score:4, Insightful)

Re:bad design (Score:3, Insightful)

Re:Dynamic Relational: change it, DON'T toss it (Score:3, Insightful)

Re:I/O bottleneck (Score:3, Insightful)

Re:bad design (Score:2, Insightful)

Re:hmm (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot