Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL 271
donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
Reddit (Score:4, Informative)
Re:Which DB is better? (Score:5, Informative)
Postgres, for people who care about their data.
Re:Facebook, Twitter and now Digg (Score:4, Informative)
Fits, before that mysql was the best way to store data no one cared about.
Re:Which DB is better? (Score:4, Informative)
The page you cited, on column-oriented databases, describes an implementation strategy that's applicable to many types of databases. There are database engines that present a perfectly normal SQL interface to a column store, and there's actually a direct link to LucidDB [wikipedia.org] from the article. Likewise, there's nothing stopping a Cassandra-like database from serializing its on-disk bits the other way around.
Column-orientation has nothing to do with the "NoSQL" databases that are in vogue. It's completely orthogonal. You're talking about using vectors or linked lists when everyone else is arguing over whether to serialize data with XML or JSON.
Re:Good for them (Score:4, Informative)
The relational model is consistent and easy to work with. It's easy to specify constraints that describe what the data should look like, and to allow several applications to interact with the data. It's also easier to optimize a database when you can describe discrete queries instead of directly following links from program code as you would in a navigational/object/document/etc. database.
Furthermore, application data models aren't all that object-oriented. Most of the time, the manipulated data types (say, "story", "post", and "user") fall into well-defined categories that correspond well to rows in a table. The few mismatches are easily dealt with in application code.
Sure, using an object database might be "easier" for the first 15 minutes, but you'll kick yourself when you have to manipulate it in any kind of sophisticated fashion.
Re:Which DB is better? (Score:5, Informative)
Go with PostgreSQL. Reliable, standards-compliant, fast.
Re:Facebook, Twitter and now Digg (Score:3, Informative)
Reddit also switched from memcachedb to Cassandra for their kvstore. From research to launch took 10 days.
Re:Allergic reaction to MySQL (Score:2, Informative)
One can potentially make good money surfing bullshit. It's like the dot-com bubble: get in early, lie about your ability, rake in big bucks, and then get out and move on to the next hype bubble while the last one crashes on those left holding the bag.
However, I do believe there's perhaps a place for big non-relational databases. They tend to be single-purpose and situations were few will care much if a few records are lost every week or so. If you have a million customers who only make money for you from occasional ad clicks, then losing a few dozen due to lack of A.C.I.D. is not going to be a bottleneck from a business standpoint. And the info can be delay-copied into a RDBMS where traditional statistics and reports can be done.
Re:Reddit's reliability has been shitty lately. (Score:4, Informative)
The reddit blog discussed the issue recently.
They claim it is not an EC2 issue [reddit.com], but simply the site getting bigger than it was designed to.
Their lastest entry [reddit.com] discusses why they switched to cassandra. I guess we'll wait for next week to see if the expected performance benefits materialize.
Re:so does it use sql or not? (Score:3, Informative)
i can't tell from the 4 lines of text buried in ads that is this supposed article, but i'm guessing this "nosql" still uses an sql database backend?
and why wouldn't a relational database system not be perfect for facebook?
1) NoSQL databases are just that NO SQL, there is no relational database involved.
2) No relational models are not good for Facebook style data, Facebook uses a lot of trees, networks and graphs, none of which are easy to store in a relational system, Facebook also has a lot of dynamic schema requirements, again SQL does not cope with this well, and at the scale that Facebook operates at they are forced to use techniques like sharding and partitioning of their data sets, at which point a lot of what makes the relational model useful becomes difficult to use, i.e. joins across databases servers are really hard to do etc.
Re:Which DB is better? (Score:4, Informative)
PostgreSQL is a real relational database that support views, nested sql, triggers, foreign keys, and even statistical analysis.
I think Mysql supports foreign keys now and my info might be dated. But if a database does not support foreign keys then its not a real relational database and mysql had that problem for years [slashdot.org].
Once switching over you can find out how hard processor intensive tasks that took minutes can be done easily in seconds with the features I described above with PostgreSQL. You can save alot of speed with complex queries with PostgreSQL.
Re:Which DB is better? (Score:4, Informative)
Now how do you scale that if your database is still growing? Postgres doesn't have a decent clustering solution that I know of, so your options are either to roll your own, or to scale vertically. Both of those are expensive options.
Based on my experience, I don't think that relational databases are appropriate for really large databases, and at present the only realistic option is horizontal scaling which is a lot easier with things like Cassandra or MongoDB.
Re:Allergic reaction to MySQL (Score:4, Informative)
When you're dealing with TB/PB range, you call Teradata. At last check they handle 4 of the 5 largest databases in the world, including eBay/Paypal's 13PB's monster and Walmart.
It's "Not Only SQL" (Score:4, Informative)
The 'n' stands for 'Not' and the 'o' stands for 'Only', so it's wrong to read it as NO SQL, it should be seen as Not Only SQL. I.o.w.: not a move away from sql, but exploring other options besides SQL
Re:so does it use sql or not? (Score:1, Informative)
Scale.. Getting mysql to survive in a world where you need hundreds of machines to host the data layer and manage the 100k+ operations per second is actually quite hard. The replication layer is laughable at best and fault tolerance towards disk + machine failure is awful.
These systems have no SQL in them at all. Hence the name. In Cassandra you have what amounts in python to a set of dict objects. Its a large, hashed table that stores key value objects.
Re:"NoSQL"? (Score:3, Informative)
Second, there's nothing wrong with SQL as a language
I beg to differ - SQL is preposterously baroque!
That said, if you're problem is of a particular kind, it is a perfectly reasonable, practical, solution to many problems.
Re:"NoSQL"? (Score:1, Informative)
Many of the systems that support SQL as a wrapper do so at a much lower scalability and performance rating.
The reason its NoSQL vs SQL is that SQL comes with a mindset of "complicated queries." When you say SQL people think of transactions, joins, wheres and such. NoSQL is by design far simpler than that. Its pushing the complexity into the application layer and as such it must be thought of as something that is specifically not SQL.
Re:Which DB is better? (Score:3, Informative)
I just read your comment and checked the PostgreSQL DB I am working with, it's only 1.7GB at this point, but growing, and the most rows in a table is 12,6 million. This DB is heavily used by a number of background processes, which select, insert, update and delete large volumes of data and by 14 people at this point, who run about 400 various reports per day each as well updating some data. The average time that a single user has to wait is 6 seconds per report. Those reports are optimized of-course, but they normally span between 1 day to one month worth of sales data, average being 1 week, while in a day there are on average 5000 sales (the DB grows by that number of sales a day, plus various other product data, client data etc.) (the db is on a single quad-core 5504 Intel, 12GB of RAM, RAID 1 on Intel's 160GB X25 SSD (2 of them) and it's a Gigabit network. This DB is used by the app server, which is a 2 x 4quad core 5405 Intels, 16GB RAM, Java 6 and Tomcat 6 for the front end, with a number of back end systems also talking to the DB from the App server.
My point is that for this given setup, PostgreSQL is showing good performance, however I am sure there are differences in the data model setup that really can kill or make the DB work.
Re:Which DB is better? (Score:2, Informative)
But here is another issue I thought of: backup. For our database it was 24 hours to do a full restore, which isn't practical. The only reasonable solution I know is to use replication, which is a nuisance with Postgres and adds maintenance overhead (keeping the schemas in sync). I'd prefer to have built-in redundancy. Again, I think you get that with Cassandra and MongoDB.
I guess in a few years we'll probably end up with something that combines good properties of both key-value stores (redundancy and scalability) and RDBMS (powerful query language, transactions).
Re:Seems odd to be keeping PhP (Score:2, Informative)
Or you could just sporge some jargonistic keywords together in an attempt to advertise your get-rich-slowly scheme.
Re:Allergic reaction to MySQL (Score:5, Informative)
Teradata and the other big relational db products (vertical, greenplum, etc) are all _analytical_ databases, designed for small amounts of complex queries, where adding new data to the system takes minutes if not hours. They are completely unsuitable for running a live application against.
Re:Which DB is better? (Score:3, Informative)
A good RDBMS engine and as much as people Poopoo MSSQL server its a good engine. I have used it for databases in the 150TB range. If you do your schema right, your indexes correctly, plan your partitions and file groups well you can great performance out of affordable hardware. Now you do need to maintain this thing or develop the automation around building those partitions and moving data into and out of them based on tombstones or some other criteria or your get underwater real fast.
I don't care what technology you pick if you are going deal with that much data you need to:
1.Understand the problem well
2.Spend the time with whatever tools you select to really understand how they work and build whatever you need to fill in where they are deficient.
When you start doing anything that big its not plug and play anymore no matter how you go about it.
Re:Which DB is better? (Score:3, Informative)
While insightful and informative in its own right, that isn't a logical response to my post.
He was asking for an alternative to Mysql. I was pointing out that moving from mysql to postgresql was not done by large companies with a lot of smart people working for them, because any performance improvements were not worth it.Postgresql's vertical and horizontal scalability did not represent an improvement over mysql. I didn't even mention vertical vs horizontal scalability. In the end you end up with a raw number saying we can handle X many requests in our total system, regardless of the individual performance numbers of any part of the system.
You're right he probably isn't the lead engineer of flickr and probably doens't need cassandra's power, but I think it really says something that while a lot of these companies are switching away from mysql, they aren't switching towards postgresql. But as always, anyone considering any kind of switch must do their due diligence in assessing the potential performance improvements of any new solution.