Forgot your password?
Databases Facebook Technology

Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL 271

Posted by timothy
from the can-you-trust-a-db-by-that-name? dept.
donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
This discussion has been archived. No new comments can be posted.

Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL

Comments Filter:
  • Nothing new ... (Score:2, Interesting)

    by tomhudson (43916) <> on Friday March 12, 2010 @09:51PM (#31460728) Journal
    Cassandra is basically a sloppy implementation of UniVerse and elated products. Why sloppy? Because the idea of a separate file access for each column sucks - use a union or struct as necessary, people!
  • Re:Wow... (Score:4, Interesting)

    by Anrego (830717) * on Friday March 12, 2010 @10:41PM (#31461154)

    Don't be too quick to put Java down.. it's slower but it scales fairly well.

  • by Anonymous Coward on Friday March 12, 2010 @10:48PM (#31461198)

    On a related note, Reddit's performance and reliability has dropped off significantly since switching to Amazon's "Cloud", and dropped off even further after this switch to Cassandra.

    The constant 503 errors, plus horrendous load times when it does manage to work, have driven me and many others away from Reddit. That's why I'm posting here on Slashdot.

    Cloud hosting is a stupid idea for anything beyond a blog getting 10 hits per date. All the talk about scalability is pure bunk. I mean, even with the extensive knowledge and infrastructure of Amazon, the Reddit site is slow (and it wasn't like that before they switched).

  • by QuoteMstr (55051) <> on Friday March 12, 2010 @11:31PM (#31461460)

    Let me ask the question a different way then: which particular tasks related to founding a company would you personally perform in exchange for $2 billion, but not in exchange for $1 million? Would you work longer hours? Talk to your family less?

    I cannot conceive of incentive to work increasing appreciably after about $1 million. We can talk about the exact figure, but clearly $2 billion is ludicrous for a private individual.

    Excessive compensation is rent seeking [] and harms society in numerous ways: it distorts the political process through over-concentration of resources; it leads to production of luxury goods that have less utilitarian benefit than mass-marked ones; and worst of all, excessive compensation leads to financial bubbles because it causes too many dollars to chase too few investment opportunities.

  • Re:Nothing new ... (Score:3, Interesting)

    by hibiki_r (649814) on Friday March 12, 2010 @11:55PM (#31461574)

    Come on, it cannot be any sloppier than actual UniVerse: It performs extremely poorly on large files, especially when record sizes vary wildly. I've seen in-memory files in which any insert or update operation took 5+ seconds! In my experience, even Postgres in far weaker hardware just spanks UniVerse even on the simple queries where it should have an advantage. If you ever need to read two or three files, either by hand or through I dictionary entries, UniVerse is orders of magnitude slower. When you add the low quality of the system monitoring and debugging tools that are available for it, it turns into one big stinker.

    If Cassandra is any slower, it'd have to lock the system up while idle.

  • by uncqual (836337) on Saturday March 13, 2010 @12:25AM (#31461754)
    One aspect of the "cloud" (as in EC2) is that you can not only scale up easily (for $ of course), you can scale down easily (to save $).

    When you have fixed "in house" infrastructure to handle peak loads, there's not a lot of motivation to power off absolutely as many servers as you possibly can when you're not at peak load - all you save is the energy costs (and, if you're using remote hosting, you don't get rewarded for this except for whatever value you attach to feeling "green"). You still pay for the floor space, the machines, and perhaps some sort of maintenance contracts regardless of if the server is powered up or down.

    Using EC2 (depending on how you've structured it - some dedicated, some non-dedicated instances etc), if utilization drops to 80% over 20 instances, the temptation is to release a couple instances to save a couple bucks and drive utilization up to 90% on the remaining instances -- with potentially unfortunate consequences.

    Although I have no idea, I wonder if Reddit is just releasing instances too aggressively now "because they can" in order to save money? If so, the fingrer should be pointed at Reddit, not the cloud (or EC2 specifically).
  • Re:Good for them (Score:2, Interesting)

    by Anonymous Coward on Saturday March 13, 2010 @12:40AM (#31461830)

    In my experiences developing applications in both the business and gaming industries, most applications beyond a simple cookbook app/crappy blog are highly object oriented. How else can you explain the wealth of approaches like ORM mappers, the repository and active record patterns, etc ? They are just patches on the relational model to make them friendly to application code. If your domain objects are consistently flat, you are probably doing something wrong. I for one do not want to use an API with Address1 - Address5 string properties. What you just listed as story, post, etc are all just objects, usually with nesting. Relational databases suck at dealing with complex object hierarchies, hence all the joins just because object A has a collection of object Bs which contain an object C.

    Can you please define what a sophisticated fashion means? Unless you are a DBA and love SQL/config work, it is far easier to write constraints using an object database. You simply use the same validation and rules you should already be using in your application. If you rely on your database along to enforce things like required fields, atomicity, etc, then you have failed at creating a good application and likely are ripe for exploits, security holes, bad data, etc anyway. It is true that relational DBs provide certain easy facilities, but any decent Object Database provides most if not more of these same constructs in another form through its API. For instance, most object databases I have used provide some sort of transactional data structure that supports far more types of locking and concurrency/conflict management than any relational DBs I have ever seen. Further, since most object databases are defined and consumed in the languages you develop against with them, the sophistication is limited to the language. I'd say you can do a lot more in Smalltalk than SQL for instance.

    If you're referring to querying, apparently you've never queried in Smalltalk, C# with LINQ, LISP, or even just using lambdas in python or ruby. Querying using the actual object is typically far easier than writing a SQL query. These days it is becoming increasingly rare that someone rolls all their own queries in your average app anyway (see ORMs). You'll often end up with something like an ORM translating some things from the UI into a boat load of queries, then you'll have to go and find fixes for the ORM to avoid making the application grind to a halt due to all the chatter. Although a lot of that is often the function of UI elements, ultimately there is a lot of overhead created by patching the relational and object disconnect.

    I am wondering how you think going from relational back to objects, even flat ones is somehow easier and more consistent. You're adding an extra language, more layers, and more configuration/management for what gain? Object databases hold records for things like throughput for transactions, data population, etc. The performance thing is a myth of the past. I'd say the stumbling block if anything is simply bad developers. An RDBMS does add some what of an idiot proof layer, but really in the end you just end up with even crappier code in other spots.

    Finally, you mention that discrete queries are easier to optimize. I again must disagree. If you want discrete queries, you could describe each query on an object with another object. This is exactly what any good developer should be doing with an ORM anyway. For instance, you could use the specification pattern with the repository pattern to describe and issue your queries, object db or rdbms. Secondly, instead of some crappy tools from the maker of the RDBMS, using an object DB I now have the full facilities of the language to do performance optimization, profiling, logging, etc rather than what a vendor provides. MSSQL provides some great tools for example, but most other DBs while nice implementation wise, provide horrific tool chains.

    It is true there are some problems an RDBMS is good for, but your post comes off like someone who has never really use

  • Re:Good for them (Score:4, Interesting)

    by QuoteMstr (55051) <> on Saturday March 13, 2010 @12:53AM (#31461880)

    Thanks for the comprehensive reply.

    How else can you explain the wealth of approaches like ORM mappers, the repository and active record patterns, etc ? They are just patches on the relational model to make them friendly to application code.

    ORMs are syntactic sugar for the underlying database operations. It's possible to bypass them when you need SQL's full power and access the same data store.

    I for one do not want to use an API with Address1 - Address5 string properties.

    So create a table of addresses and use foreign keys to connect them to whatever other table you'd like. Since when does a relational structure require a garbage schema like your example. But surely you know all that.

    Further, since most object databases are defined and consumed in the languages you develop against with them, the sophistication is limited to the language

    But doesn't that then preclude accessing the same data set from programs written in other languages? The beauty of SQL is that it's language-agnostic.

    You also make several points relating to toolchains and testing: sure, some databases have better tools than others. But we're talking about differences between models, not differences between particular tools.

  • by TubeSteak (669689) on Saturday March 13, 2010 @12:58AM (#31461908) Journal

    I haven't seen any consideration from potential "NoSQL" adopters of the benefits of using a good relational database like PostgreSQL.
    If you need sloppier semantics for some cases (for example, "eventual consistency"), you can layer that on top of a robust RDBMs.

    When you're dealing with TB/PB of data that doesn't require relational capabilities, there's no reason to use a "good relational database like PostgreSQL" when you can dispense altogether with the relational aspect and its performance hit.

    NoSQL may seem like the fad-de-jure, but until recently, nobody was working with such enormous dynamic datasets. When you look at the growth of all these hi-tech companies, they did an incredible amount of in-house hacking to develop the software necessary to glue together their enormous hardware infrastructure.

  • by DogDude (805747) on Saturday March 13, 2010 @02:17AM (#31462258) Homepage
    I imagine with the continual growth of these social networks, high performance DB methodologies will experience tremendous growth, and perhaps even paradigm shifts in the way we logically think and design database architectures.

    Your statement that social networks push databases to their theoretical limits is laughable. Larger, more frequently accessed, more complicated databases have existed for years (decades?) before the current crop of Friendster clones existed. Just because Facebook is the largest, most "high performance" database application that you can think of doesn't make it remotely true.

    The problem of dealing with very large, frequently changing databases has been addressed and solved, already. The problem is that most PHP-monkeys have -zero- database knowledge, and instead of doing the work to figure out the right way to do things, they feel like they need to re-invent the wheel. A better solution is to pick up a book written by somebody who's been working with RDBMS' for a few decades. It's not a quick fix, but this problem has already been solved many, many times over.
  • by Anpheus (908711) on Saturday March 13, 2010 @05:27AM (#31462860)

    Now, I'm not an expert on database use and don't want to come across as sarcastic, but it's my impression that a lot of the questions that are being asked of these new types of databases simply don't have past analogues, or if they did, they were solved with this sort of approach in an RBDMS, basically using an RBDMS but without the relational part. Hadoop, Google, and all these social networking sites surely aren't all just... confused? Are they?

    Please elaborate on how an RBDMS is applicable to what I guess is now called "scaling horizontally", or perhaps more formally known as sharding, or partitioning with redundancy. It's my impression that most of the RBDMS products available today are simply atrocious at this, but if you can point out which books I need to look at, and which products have good support for this sort of scale, I'd love to learn.


That does not compute.