Forgot your password?
typodupeerror
Databases

Cassandra NoSQL Database 1.2 Released 55

Posted by Soulskill
from the onward-and-upward dept.
Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."
This discussion has been archived. No new comments can be posted.

Cassandra NoSQL Database 1.2 Released

Comments Filter:
  • Hmm. (Score:4, Interesting)

    by Anonymous Coward on Wednesday January 02, 2013 @05:16PM (#42454225)

    Maybe someone can explain this to me. I've been keeping an eye out for situations where it would make more sense to use a nosql solutions like Mongo, Couch, etc. for a year or so now, and I just haven't found one.

    Under what circumstances do people use a data store that doesn't need data relationships?

    • Re: (Score:3, Insightful)

      by Anonymous Coward

      When the project is run by an idiot who thinks he needs to incorporate buzzwords over substance into their work.

    • by vlm (69642)

      Under what circumstances do people use a data store that doesn't need data relationships?

      A crude 1980s filesystem, on a system where they don't officially allow direct file storage but do provide a database capable of holding arbitrary binary data.

    • Re:Hmm. (Score:5, Insightful)

      by Sarten-X (1102295) on Wednesday January 02, 2013 @05:27PM (#42454345) Homepage

      Assuming you're not trolling...

      When one wants to write a ton of data as fast as possible, where the data may not actually be complete or consistent (but still useful). Something on the order of a million rows a minute is a prime candidate for a NoSQL store. Consider, for example, the sum of all posts on Facebook at any given time.

      From the other side, an application like the current trend of "Big Data" models, monitoring every aspect of every action on a website (or in a hospital, or through a retail distribution chain, or the environmental systems of a factory) to glean statistically-meaningful information also makes a good use case for NoSQL. At the expense of consistency, the store is designed to be fast and fault-tolerant, so it really doesn't matter whether the data's complete or not. For Big Data applications, which are interested only in statistics, having a few inconsistent records out of billions doesn't matter much to the end result.

      Sure, traditional RDBMSs can be tweaked and optimized to make any particular query run as fast as any NoSQL engine... but that's an expensive and time-consuming process that's often not feasible.

      • As for your first case, it's less a factor of speed than it is the content of what you are writing. If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes. If you need to be able to query the data at a later point and be able to run statistics on it regularly, then no, especially if accuracy in the statistic is important.

        And on the other side, NoSQL typically fails much more than it succeeds because NoSQL defers most of it's lo

        • Re:Hmm. (Score:4, Informative)

          by samkass (174571) on Wednesday January 02, 2013 @05:53PM (#42454701) Homepage Journal

          If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes.

          I'm going to pretend you weren't trolling to address a good point here. NoSQL is very valuable for human-to-human data. I've seen it be hugely successful in cases when you only need a "human" level of precision about ordering, consistency, and detail. It eliminates single points of failure, global locks, offline operation problems, write contention, etc. It introduces problems for indexing and absolute consistency. But without widespread indexing you tend to get brute-force (Map-Reduce) or narrow-focus (offline indexes on specific field) searches. And that's okay for most humans.

        • Re:Hmm. (Score:5, Informative)

          by Sarten-X (1102295) on Wednesday January 02, 2013 @06:08PM (#42454897) Homepage

          That's almost exactly wrong.

          "Free-form crap" like blogs doesn't really care what database it's in. Use a blob in MySQL, and it won't matter. You'll be pulling the whole field as a unit and won't do analysis anyway.

          The analysis of atomic data is exactly what NoSQL stores are designed for. MapReduce programs are built to evaluate every record in the table, filter out what's interesting, then run computation on that. The computation is done in stages that can be combined later in a multistage process. Rather than joining tables to build a huge set of possibilities, then trimming that table down to a result set, the query operates directly on a smaller data set, leaving correlation for a later stage. The result is a fast and accurate statistic, though there is a loss of precision due to any inconsistent records. Hence, bigger databases are preferred to minimize the error.

          I like the analogy of NoSQL being a cabinet full of files, though I'd alter it a little. Rather than having no idea what's in the files, we do know what they're supposed to contain, but they're old and may not be perfectly complete as expected. To find some information about the contents, we have to dive in, flip through all the files, and make an effort. Yes, some files will be useless for our search, and some will be missing important data - but we can still get information of statistical significance. Note that over time, the forms might even change, adding new fields or changing options. We might have to ask a supervisor how to handle such odd cases, which is analogous to pushing some decisions back to the application.

    • Re:Hmm. (Score:5, Informative)

      by Corporate T00l (244210) on Wednesday January 02, 2013 @05:34PM (#42454471) Journal

      You'll see these kinds of large-scale columnar stores like Cassandra or HBase being used a lot in metrics and log management projects.

      For instance, if you want to generate a histogram of login processing time over the last 90 days, you'll need to record the times of all of your individual logins to do that. If you have millions of logins per hour, that single metric alone is going to generate a lot of rows. If you're also measuring many other points throughout your system, the data starts getting unmanageable with B-tree backed databases and not of high enough value to store in RAM.

      in the past, you might deal with this by adding more sophisticated logic at the time of collection. Maybe I'll do random sampling and only pick 1 out of every 1000 transactions to store. But then, I might have a class of users I care about (e.g. users logging in from Syria compared to all users logging in around the world) where the sample frequency causes them to drop to zero. So then I have to do more complicated logic that will pick out 1 out of every 1000 transactions but with separate buckets for each country. But then every time your bucketing changes, you have to change the logic at all of the collection points. I can't always predict in advance what buckets I might need in the future.

      With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use). The data collected can have indexes (really more like inverted indexes, but that is a longer discussion) up front for quick query of data facets that you know you want in advance, but still retains the property of super-fast-insert-on-cheap-hardware so that you can store all of the raw data and come back for it later when there is something you didn't think of in advance, and map-reduce for the answer.

      • With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use).

        I'll have to do some performance testing later, but you do realize that almost all relational databases support a concept known as clustered indexes, which takes advantage of sequential access, correct? Sounds like you don't understand how current relational databases work.

    • by jbolden (176878)

      It always pays to use relational over NoSQL when you can. But just like in data warehousing where it makes sense to denormalize for performance reasons it can make sense to organize the data around specific computations which damage the ability to use SQL.

      You won't find any good reason with normal sized data sets and normal number of joins. Computations that require large tables that need to join multiple times in complex ways that can't be overcome with tricks like indexing.... then it can make sense to

    • It's for people who were letting their programming frameworks do what the fuck every they want with their database structures and decided to take that one step farther.

      Admittedly, I kind of like it for low(er) value things where you're likely to have some variation in the structures being inserted, like logging and tracking the status of long-running tasks (upsert and appending to arrays FTW). That's about the only use I've found for the tech, though, and I admit that even in those cases its use is largely

    • by Anonymous Coward

      I use it when I need a database that supports relationships, tons of them, and doesn't falter at the same relationship type having completely different fields. It's the same -freaking- relationship, with supporting information from several different systems.

      I use Neo4j, which is only technically NoSQL, but it has a few query languages of it's own. But I always chuckle at "relational" databases because they all seem to collapse under too many relationships "X" is_a this, is_a that, is_a this2, is_a... why do

    • It's not always about the data relationships. Cassandra for example is very easy to scale horizontally (much easier than traditional databases) and can achieve very high throughput. Last time I checked (a year ago) I could get over 50,000 stores/queries per second on a cluster of cheap commodity hardware (4 servers). That result was achieved with full redundancy (n=2). Such a setup is very resilient against failure (provided clients handle failure of individual nodes correctly). Maintaining such a cluster i

    • by Bengie (1121981)
      Sharing a resource, not matter how you spin it, will cause contention. The only way to scale a resource that is both read and write heavy is to scale horizontally. This is where NOSQL takes the crown. This is just a prime example, but not the only.
    • by snemarch (1086057)

      Under what circumstances do people use a data store that doesn't need data relationships?

      Think (huge!) web content management systems with tree-structure, component-based pages where data varies widely from each page-type, and business requirements are constantly in flux.

      While there's definitely data relationships, they're not necessarily very comfortable to map in a traditional RDBMS.

  • I'm not sure if it's a typo or a misunderstanding, but the statement in the summary about atomic batching is hilariously incorrect.

    Atomic batching has nothing to do with "patches can be reapplied if one of them fails", but rather the more pedantic yet common case where you want a set of data updates to be batched atomically, where all or none of the changes occur, but nothing in between.

    • by vlm (69642)

      sounds like a transaction

    • by FooAtWFU (699187)

      But the atomic batches in v1.2 prevent such inconsistencies, by ensuring that groups of updates are treated as indivisible (atomic) units of work: either all the updates succeed or all of them fail. If they all fail, then the batch is reapplied, and there’s no need to determine which individual updates failed or succeeded.

      Looks like there's two parts here. One of them is communicating the changeset to (one or more) nodes, then the other part is actually applying it. If the coordinator failed halfwa

  • There must be something I don't understand. For me the whole point of databases is precisely that they come with SQL to easily do even complex stuff with them.

    How can the absence of the only useful feature be a "selling" point. No SQL? No thanks?...

    • by Anonymous Coward

      SQL is anything but easy from app development viewpoint. You have to either mix it in your code, which is ugly in itself and creates tons of potential SQL injection bugs, or you use ORM and then your database is probably unusable using conventional tools.

      NoSQL solves the problem, as native bindings to different languages are the standard interface in this world.

    • "NoSQL" is a highly-misleading name; the SQL language is really besides the point.

      The important parts of NoSQL really boils down to:
      1. Very high performance.
      2. Ability to handle extremely large data (on the order of tens or hundreds of terabytes.).
      3. Natural way of dealing with non-flat , non-BLOB data.
      4. Better integration with OO languages.

      #1 and #2 all come with trade-offs, which is perfectly fine. Not all problems need ACID compliance..

      #3 & #4 really goes back to the 90s , though
    • by micheas (231635)

      One of the useful features of solr/lucene is the MLT key word (which stands for More Like This).

      Another useful feature of many NOSQL databases is faceted searches with good performance.

      It seems to be a very common practice to store the data in an SQL database and duplicate that database in a nosql database to use for searching, then if the nosql database gets corrupted you rebuild from the original data and your searches are incomplete while the rebuild goes on. (worst case I've had to deal with is a couple

    • by LingNoi (1066278)

      NoSQL does have some advantages. If you have 2+GB of data in a relational database table and you wish to update a table doing some can take a long time during which your services will be down. Since non-relational databases allow for schema less data, you can simply add the extra column in the code and add code for what to do if the new column doesn't exist (i.e. old data) then deploy it with zero downtime.

      These points don't really come into play until you have a huge dataset however so for most stuff I sti

  • by Anonymous Coward

    I can't believe these assholes are getting in an argument about SQL vs NoSQL. Apples and Oranges. NoSQL isn't a complete replacement, nor are rdbms the solve-all solution when you need to scale. Sounds like a bunch of db admins getting threatened that their jobs are going to be in jeopardy.

[Crash programs] fail because they are based on the theory that, with nine women pregnant, you can get a baby a month. -- Wernher von Braun

Working...