Forgot your password?
typodupeerror
Databases Social Networks

How Twitter Is Moving To the Cassandra Database 157

Posted by kdawson
from the big-table-doesn't-capture-the-half-of-it dept.
MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.
This discussion has been archived. No new comments can be posted.

How Twitter Is Moving To the Cassandra Database

Comments Filter:
  • network issues? (Score:5, Insightful)

    by QuietLagoon (813062) on Tuesday February 23, 2010 @03:26PM (#31248592)
    We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast it would saturate the backplane of our network.

    .

    First time I have ever heard anyone say that a database was too fast. Maybe there are network problems that also need to be addressed.

  • Re:hmmm (Score:1, Insightful)

    by Anonymous Coward on Tuesday February 23, 2010 @03:29PM (#31248666)

    Maybe in 5 years slashdot will get with it.

    Do you realize how many years it took Slashdot to just remove their HTML table layout from Slashcode? I wouldn't bet on a major backend change for Slashdot, ever.

  • by Gruuk (18480) on Tuesday February 23, 2010 @03:44PM (#31248994)

    Scaling. If something turns out to be robust and fast enough for Twitter, it is definitely of interest to anyone working on significantly large and busy websites.

  • by roman_mir (125474) on Tuesday February 23, 2010 @03:45PM (#31249030) Homepage Journal

    who cares what twuufter is running off.

    The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.

    If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs? Databases are not being replaced by NoSQL in projects that need databases. The projects that may not have ever needed databases may benefit by this NoSQL idea, but if you actually need a database... well, you better be really good at working around all kinds of problems that this will create for you.

    I think that relational databases are good at what they do and that many projects may not need them, but if you do need them on the back end, you will end up with them on the back end. Of-course there maybe some caching/hashmaps/files on the front end but at the back stuff will be sorted out within a real datastore.

    Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.

  • I'm Reluctant (Score:1, Insightful)

    by Anonymous Coward on Tuesday February 23, 2010 @03:51PM (#31249160)

    I'm reluctant to believe that Twitter is a good technology bellwether. Twitter seems to have so many technology issues, fail whales, outages, security breeches...

    SO, I'm left wondering; what does this move say? Does it say that Cassandra is so bad that Twitter is using it? Or does it say that a fail whale population boom is imminent?

  • by Lunix Nutcase (1092239) on Tuesday February 23, 2010 @03:56PM (#31249246)

    Yes, because twitter is the epitome of robustness and speed. Oh wait... Just in the 2 months of this year alone they've had something like 4 outages.

  • by AndrewNeo (979708) on Tuesday February 23, 2010 @03:56PM (#31249250) Homepage

    I think their point is not everything needs an RDBMS, whereas before it was the 'go to' method of storing data.

  • by Anonymous Coward on Tuesday February 23, 2010 @04:04PM (#31249390)

    Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.

    Surely that's the point. It isn't possible to practically scale RDBMSs up to the sort of scale you need for a huge website such as Amazon. The requirement to continue to meet all of the constraints of the relational model makes it very hard to split databases over a large cluster without a lock-bound hell. There are two solutions to this - either you spend a vast amount of effort trying to get the relational model to scale a bit, or you bite the bullet and relax the relational model's constraints.

    Don't get me wrong - there are good reasons why the relational model has constraints in the data model to ensure ACID qualities. However beyond a certain point it is easier to deal with the problems that come from using a different model than it is to stretch a conventional RDBMs and deal with the problems of keeping multiple distributed copies of data consistent.

    Take the collection of user reviews and product pictures on a large site like Amazon. Does this need the analytical power of a RDBMS? No. Does it need something a lot more advanced then "flat files or an in memory hash-map" in order to scale to heavy loads across multiple continents? Yes. That's the sort of thing NoSQL databases are working on.

    In general your attitude reminds me of the people who thought personal computers would always be toys. "Proper work" would be done on mainframes/supercomputers and trivial office tasks may as well be done on paper. Well, mainframes / supercomputers are still faster than personal computers, but few people would claim the PC had no impact on the office.

  • Re:I'm Reluctant (Score:3, Insightful)

    by binarylarry (1338699) on Tuesday February 23, 2010 @04:11PM (#31249496)

    Twitter's only moving to this new database written in Java because everyone else is.

  • by Monkeedude1212 (1560403) on Tuesday February 23, 2010 @04:33PM (#31249816) Journal

    I suppose then why would we care if any site made any random change to any part of its infrastructure?

    Twitter is a -very- busy site.

    They are changing their infrastructure to accomodate. Here's what they looked at, here is what they chose. If you are looking for something with equal performance, you don't have to shop around.

  • Re:network issues? (Score:3, Insightful)

    by b0bby (201198) on Tuesday February 23, 2010 @04:50PM (#31250096) Homepage

    I know next to nothing about NoSQL, but what they're talking about there seems to be using BinaryMemtable for the one-time move of data. You can see that you wouldn't want to "saturate the backplane of our network" for several days while that completes, so they're using a slower method & throttling it. It will take a week to do the move, but everything else will keep working.

  • by Abcd1234 (188840) on Tuesday February 23, 2010 @04:51PM (#31250098) Homepage

    Or: use the right tool for the job. The only difference is, now alternative tools actually exist.

  • by roman_mir (125474) on Tuesday February 23, 2010 @04:52PM (#31250120) Homepage Journal

    your question is answered in my post: google does not need a database for ACID properties.

    Can you complain much if in one location google gives you results that are very different for the same search query as for the same query in a different location at the same time? Well, if you do complain, you can ask google for your money back.

  • by kriston (7886) on Tuesday February 23, 2010 @04:53PM (#31250134) Homepage Journal

    No way. Their architecture is about as "best guess" engineering as Facebook. I don't think that's actually what engineering is. "Maybe this one will work?"

    In the meantime, I have not been able to update my avatar image on Twitter, and TwitPic-like feature is still a faint glimmer in Twitter's amateur eyes. Speaking of missed opportunities, why drive so much traffic to Twitter parasites Bit.ly, TwitPic, TinyURL, Twitition, TwitLonger?

    What in the world are Twitter's engineers actually DOING should be the real question.

  • by guru42101 (851700) on Tuesday February 23, 2010 @05:37PM (#31250848)

    I've never dealt with an EMP but a more realistic threat with similar effects would be planning for a hurricane or earthquake. I used to work at a international bank and we had to deal with both (offices in FL and CA). For the most part the best solution was to have an identical setup at another office and having all applications available via VPN and/or web access. We had a separate pipe that was used only for backup data transfers. The DB transaction logs were written both locally and remotely. All user files saved to the server were immediately copied to the backup server. On several occasions the systems were tested due to black/brownouts. The users were sent home where they could work just as effectively as the office.

    Our general emergency plan for hurricanes (I worked at the FL office we used the CA office as our backup). Was to let the users go well in advance of the hurricane and switch CA to being our primary servers with FL as the backup. Once the users were settled then they could continue working from home. The only way we would be screwed is if a hurricane and earthquake happened simultaneously. At that point we'd have to restore VM backups on hardware located at the main corporate offices in NYC or Sydney.

  • by roman_mir (125474) on Tuesday February 23, 2010 @05:43PM (#31250970) Homepage Journal

    You know, the truth is, most data is still stored in individual files, not in databases. So RDBMSs were always a very niche thing used for projects because they are understood and it's easier to develop for them if you really have massive data requirements.

    Files - that's what many projects even today use, not databases. This is basically what they are going back to - files with whatever window dressing on top - a facade of hashes, it's all key/value pairs. It is, my friends, the old old idea of property files.

    I mean, really, I wrote a system in August that uses property files for storage as a database. Property file as a database - because it works. But that's a storage method. So in the NoSQL space they also do clustering by replication across nodes, but it does not really matter much if the data is all the same on all nodes.

    But you can do the same with an RDBMS, really, you can skip the principles of ACID and replicate across nodes and hope that it's good enough. Maybe the implementation for things like 'Cassandra' allows faster replication than what is normally done in an RDBMS, but just you wait and see how the RDBMSs of tomorrow provide a few flags to do the same thing in some 'partial ACID mode' with quick replication.

    This is intended for applications that do not really care about consistency of data - Google does not care. Twewter does not care. Amazon has to jump through more hoops I am sure than Tweeter, because real money is involved.

"The only way for a reporter to look at a politician is down." -- H.L. Mencken

Working...