Please create an account to participate in the Slashdot moderation system


Forgot your password?
Databases Social Networks

How Twitter Is Moving To the Cassandra Database 157

MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.
This discussion has been archived. No new comments can be posted.

How Twitter Is Moving To the Cassandra Database

Comments Filter:
  • by Lunix Nutcase ( 1092239 ) on Tuesday February 23, 2010 @03:36PM (#31248842)

    Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?

  • by azmodean+1 ( 1328653 ) on Tuesday February 23, 2010 @04:08PM (#31249450)

    I think you're missing the point here, the problem with RDBMSs isn't that they are "slow" per-se, which implies that they just need some good ol' fashioned optimization. The problem is that there is a cost associated with the data integrity guarantees they make (usually appears in scalability bottlenecks rather than in pure computational inefficiencies), regardless of how good the implementation is, and if you don't need some of those guarantees, you can dispense with them and end up with better performance (again, this typically means better scalability). Additionally, this is the kind of bottleneck that you just can't throw more resources at. Sure you can find the bottleneck and beef up that particular component to do more transactions/second, but at a certain point you've isolated the bottleneck on a world-class server that is doing nothing but that, and it's still a bottleneck. At that point (preferably long before you reach that point) you have to look at transitioning to an infrastructure that makes some kind of tradeoff that allows the removal of the bottleneck, which is what NoSQL does.

    I doubt Twitter wants very many RDBMS-type data coherency guarantees at all. 160-character text strings with a similarly-sized amount of metadata, and no real-time delivery guarantees? Sounds like their database can get pretty inconsistent without messing things up badly. It seems to me they would be well served by using a database that offers just what they want/need in that area and better performance.

    Oh and this:

    Is there really a huge issue with rdbms speeds?

    yes, and what are you smoking that you would even ask this question?

  • by Heretic2 ( 117767 ) on Tuesday February 23, 2010 @04:51PM (#31250104)

    I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.

    I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.

    I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?

  • by u38cg ( 607297 ) <> on Tuesday February 23, 2010 @05:42PM (#31250930) Homepage
    Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia? I would have thought serving up a fairly straightforward page, a stylesheet, a background image and the tweets or twits or whatever they're called can't be that difficult compared to, say, Facebook.

This universe shipped by weight, not by volume. Some expansion of the contents may have occurred during shipment.