Forgot your password?
typodupeerror
Databases Facebook Technology

Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL 271

Posted by timothy
from the can-you-trust-a-db-by-that-name? dept.
donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
This discussion has been archived. No new comments can be posted.

Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL

Comments Filter:
  • Reddit (Score:4, Informative)

    by Gudeldar (705128) on Friday March 12, 2010 @10:54PM (#31460748)
    Reddit also recently switched [reddit.com] to Cassandra.
  • by h4rr4r (612664) on Friday March 12, 2010 @11:34PM (#31461092)

    Postgres, for people who care about their data.

  • by h4rr4r (612664) on Friday March 12, 2010 @11:35PM (#31461110)

    Fits, before that mysql was the best way to store data no one cared about.

  • by QuoteMstr (55051) <dan.colascione@gmail.com> on Friday March 12, 2010 @11:47PM (#31461190)

    The page you cited, on column-oriented databases, describes an implementation strategy that's applicable to many types of databases. There are database engines that present a perfectly normal SQL interface to a column store, and there's actually a direct link to LucidDB [wikipedia.org] from the article. Likewise, there's nothing stopping a Cassandra-like database from serializing its on-disk bits the other way around.

    Column-orientation has nothing to do with the "NoSQL" databases that are in vogue. It's completely orthogonal. You're talking about using vectors or linked lists when everyone else is arguing over whether to serialize data with XML or JSON.

  • Re:Good for them (Score:4, Informative)

    by QuoteMstr (55051) <dan.colascione@gmail.com> on Saturday March 13, 2010 @12:00AM (#31461274)

    But since most developers model their domains Object Oriented, why is MySql the default choice for any small application? Why not a document database or a native oo one?

    The relational model is consistent and easy to work with. It's easy to specify constraints that describe what the data should look like, and to allow several applications to interact with the data. It's also easier to optimize a database when you can describe discrete queries instead of directly following links from program code as you would in a navigational/object/document/etc. database.

    Furthermore, application data models aren't all that object-oriented. Most of the time, the manipulated data types (say, "story", "post", and "user") fall into well-defined categories that correspond well to rows in a table. The few mismatches are easily dealt with in application code.

    Sure, using an object database might be "easier" for the first 15 minutes, but you'll kick yourself when you have to manipulate it in any kind of sophisticated fashion.

  • by RelliK (4466) on Saturday March 13, 2010 @12:13AM (#31461362)

    Go with PostgreSQL. Reliable, standards-compliant, fast.

  • by prockcore (543967) on Saturday March 13, 2010 @12:20AM (#31461406)

    Reddit also switched from memcachedb to Cassandra for their kvstore. From research to launch took 10 days.

  • by Tablizer (95088) on Saturday March 13, 2010 @01:38AM (#31461814) Homepage Journal

    Sigh. Most people seem to be stuck on following trends--in pretty much every aspect of their lives. Why think when you can conform to the crowd?

    One can potentially make good money surfing bullshit. It's like the dot-com bubble: get in early, lie about your ability, rake in big bucks, and then get out and move on to the next hype bubble while the last one crashes on those left holding the bag.

    However, I do believe there's perhaps a place for big non-relational databases. They tend to be single-purpose and situations were few will care much if a few records are lost every week or so. If you have a million customers who only make money for you from occasional ad clicks, then losing a few dozen due to lack of A.C.I.D. is not going to be a bottleneck from a business standpoint. And the info can be delay-copied into a RDBMS where traditional statistics and reports can be done.

  • by Neoncow (802085) on Saturday March 13, 2010 @01:49AM (#31461860) Journal

    The reddit blog discussed the issue recently.

    They claim it is not an EC2 issue [reddit.com], but simply the site getting bigger than it was designed to.

    Their lastest entry [reddit.com] discusses why they switched to cassandra. I guess we'll wait for next week to see if the expected performance benefits materialize.

  • by Anonymous Coward on Saturday March 13, 2010 @02:35AM (#31462094)

    i can't tell from the 4 lines of text buried in ads that is this supposed article, but i'm guessing this "nosql" still uses an sql database backend?

    and why wouldn't a relational database system not be perfect for facebook?

    1) NoSQL databases are just that NO SQL, there is no relational database involved.

    2) No relational models are not good for Facebook style data, Facebook uses a lot of trees, networks and graphs, none of which are easy to store in a relational system, Facebook also has a lot of dynamic schema requirements, again SQL does not cope with this well, and at the scale that Facebook operates at they are forced to use techniques like sharding and partitioning of their data sets, at which point a lot of what makes the relational model useful becomes difficult to use, i.e. joins across databases servers are really hard to do etc.

  • by Billly Gates (198444) on Saturday March 13, 2010 @02:51AM (#31462152) Journal

    PostgreSQL is a real relational database that support views, nested sql, triggers, foreign keys, and even statistical analysis.

    I think Mysql supports foreign keys now and my info might be dated. But if a database does not support foreign keys then its not a real relational database and mysql had that problem for years [slashdot.org].

    Once switching over you can find out how hard processor intensive tasks that took minutes can be done easily in seconds with the features I described above with PostgreSQL. You can save alot of speed with complex queries with PostgreSQL.

  • by alexkorban (627031) on Saturday March 13, 2010 @03:58AM (#31462404) Homepage
    I have worked with large PostgreSQL databases (150GB or so) and really, Postgres isn't a solution. You run into issues anyway when some of your tables contain millions or even billions of rows. At that stage things like vacuuming or altering the schema start to become damn near impossible, and even querying starts to become a bottleneck.

    Now how do you scale that if your database is still growing? Postgres doesn't have a decent clustering solution that I know of, so your options are either to roll your own, or to scale vertically. Both of those are expensive options.

    Based on my experience, I don't think that relational databases are appropriate for really large databases, and at present the only realistic option is horizontal scaling which is a lot easier with things like Cassandra or MongoDB.
  • by ducomputergeek (595742) on Saturday March 13, 2010 @04:48AM (#31462582)

    When you're dealing with TB/PB range, you call Teradata. At last check they handle 4 of the 5 largest databases in the world, including eBay/Paypal's 13PB's monster and Walmart.

  • It's "Not Only SQL" (Score:4, Informative)

    by Otis_INF (130595) on Saturday March 13, 2010 @05:02AM (#31462620) Homepage

    The 'n' stands for 'Not' and the 'o' stands for 'Only', so it's wrong to read it as NO SQL, it should be seen as Not Only SQL. I.o.w.: not a move away from sql, but exploring other options besides SQL

  • by Anonymous Coward on Saturday March 13, 2010 @05:28AM (#31462690)

    Scale.. Getting mysql to survive in a world where you need hundreds of machines to host the data layer and manage the 100k+ operations per second is actually quite hard. The replication layer is laughable at best and fault tolerance towards disk + machine failure is awful.

    These systems have no SQL in them at all. Hence the name. In Cassandra you have what amounts in python to a set of dict objects. Its a large, hashed table that stores key value objects.

  • Re:"NoSQL"? (Score:3, Informative)

    by shic (309152) on Saturday March 13, 2010 @05:28AM (#31462696)

    Second, there's nothing wrong with SQL as a language

    I beg to differ - SQL is preposterously baroque!

    That said, if you're problem is of a particular kind, it is a perfectly reasonable, practical, solution to many problems.

  • Re:"NoSQL"? (Score:1, Informative)

    by Anonymous Coward on Saturday March 13, 2010 @06:00AM (#31462778)

    Many of the systems that support SQL as a wrapper do so at a much lower scalability and performance rating.

    The reason its NoSQL vs SQL is that SQL comes with a mindset of "complicated queries." When you say SQL people think of transactions, joins, wheres and such. NoSQL is by design far simpler than that. Its pushing the complexity into the application layer and as such it must be thought of as something that is specifically not SQL.

  • by roman_mir (125474) on Saturday March 13, 2010 @06:12AM (#31462818) Homepage Journal

    I just read your comment and checked the PostgreSQL DB I am working with, it's only 1.7GB at this point, but growing, and the most rows in a table is 12,6 million. This DB is heavily used by a number of background processes, which select, insert, update and delete large volumes of data and by 14 people at this point, who run about 400 various reports per day each as well updating some data. The average time that a single user has to wait is 6 seconds per report. Those reports are optimized of-course, but they normally span between 1 day to one month worth of sales data, average being 1 week, while in a day there are on average 5000 sales (the DB grows by that number of sales a day, plus various other product data, client data etc.) (the db is on a single quad-core 5504 Intel, 12GB of RAM, RAID 1 on Intel's 160GB X25 SSD (2 of them) and it's a Gigabit network. This DB is used by the app server, which is a 2 x 4quad core 5405 Intels, 16GB RAM, Java 6 and Tomcat 6 for the front end, with a number of back end systems also talking to the DB from the App server.

    My point is that for this given setup, PostgreSQL is showing good performance, however I am sure there are differences in the data model setup that really can kill or make the DB work.

  • by alexkorban (627031) on Saturday March 13, 2010 @06:33AM (#31462888) Homepage
    Oh, absolutely, I'm not surprised that your setup works well, Postgres is a great RDBMS. Of course, how you design your schema matters a great deal too.

    But here is another issue I thought of: backup. For our database it was 24 hours to do a full restore, which isn't practical. The only reasonable solution I know is to use replication, which is a nuisance with Postgres and adds maintenance overhead (keeping the schemas in sync). I'd prefer to have built-in redundancy. Again, I think you get that with Cassandra and MongoDB.

    I guess in a few years we'll probably end up with something that combines good properties of both key-value stores (redundancy and scalability) and RDBMS (powerful query language, transactions).
  • by maxume (22995) on Saturday March 13, 2010 @09:51AM (#31463574)

    Or you could just sporge some jargonistic keywords together in an attempt to advertise your get-rich-slowly scheme.

  • by jbellis (142590) <jonathan@carnage ... r.com minus city> on Saturday March 13, 2010 @10:51AM (#31463898) Homepage

    Teradata and the other big relational db products (vertical, greenplum, etc) are all _analytical_ databases, designed for small amounts of complex queries, where adding new data to the system takes minutes if not hours. They are completely unsuitable for running a live application against.

  • by DarkOx (621550) on Saturday March 13, 2010 @12:19PM (#31464420) Journal

    A good RDBMS engine and as much as people Poopoo MSSQL server its a good engine. I have used it for databases in the 150TB range. If you do your schema right, your indexes correctly, plan your partitions and file groups well you can great performance out of affordable hardware. Now you do need to maintain this thing or develop the automation around building those partitions and moving data into and out of them based on tombstones or some other criteria or your get underwater real fast.

    I don't care what technology you pick if you are going deal with that much data you need to:
    1.Understand the problem well
    2.Spend the time with whatever tools you select to really understand how they work and build whatever you need to fill in where they are deficient.

    When you start doing anything that big its not plug and play anymore no matter how you go about it.

  • by Bill, Shooter of Bul (629286) on Saturday March 13, 2010 @04:58PM (#31466604) Journal

    While insightful and informative in its own right, that isn't a logical response to my post.

    He was asking for an alternative to Mysql. I was pointing out that moving from mysql to postgresql was not done by large companies with a lot of smart people working for them, because any performance improvements were not worth it.Postgresql's vertical and horizontal scalability did not represent an improvement over mysql. I didn't even mention vertical vs horizontal scalability. In the end you end up with a raw number saying we can handle X many requests in our total system, regardless of the individual performance numbers of any part of the system.

    You're right he probably isn't the lead engineer of flickr and probably doens't need cassandra's power, but I think it really says something that while a lot of these companies are switching away from mysql, they aren't switching towards postgresql. But as always, anyone considering any kind of switch must do their due diligence in assessing the potential performance improvements of any new solution.

The one day you'd sell your soul for something, souls are a glut.

Working...