Forgot your password?
typodupeerror
Databases

Cassandra NoSQL Database 1.2 Released 55

Posted by Soulskill
from the onward-and-upward dept.
Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."
This discussion has been archived. No new comments can be posted.

Cassandra NoSQL Database 1.2 Released

Comments Filter:
  • Re:Hmm. (Score:5, Informative)

    by Corporate T00l (244210) on Wednesday January 02, 2013 @05:34PM (#42454471) Journal

    You'll see these kinds of large-scale columnar stores like Cassandra or HBase being used a lot in metrics and log management projects.

    For instance, if you want to generate a histogram of login processing time over the last 90 days, you'll need to record the times of all of your individual logins to do that. If you have millions of logins per hour, that single metric alone is going to generate a lot of rows. If you're also measuring many other points throughout your system, the data starts getting unmanageable with B-tree backed databases and not of high enough value to store in RAM.

    in the past, you might deal with this by adding more sophisticated logic at the time of collection. Maybe I'll do random sampling and only pick 1 out of every 1000 transactions to store. But then, I might have a class of users I care about (e.g. users logging in from Syria compared to all users logging in around the world) where the sample frequency causes them to drop to zero. So then I have to do more complicated logic that will pick out 1 out of every 1000 transactions but with separate buckets for each country. But then every time your bucketing changes, you have to change the logic at all of the collection points. I can't always predict in advance what buckets I might need in the future.

    With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use). The data collected can have indexes (really more like inverted indexes, but that is a longer discussion) up front for quick query of data facets that you know you want in advance, but still retains the property of super-fast-insert-on-cheap-hardware so that you can store all of the raw data and come back for it later when there is something you didn't think of in advance, and map-reduce for the answer.

  • Re:Hmm. (Score:4, Informative)

    by samkass (174571) on Wednesday January 02, 2013 @05:53PM (#42454701) Homepage Journal

    If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes.

    I'm going to pretend you weren't trolling to address a good point here. NoSQL is very valuable for human-to-human data. I've seen it be hugely successful in cases when you only need a "human" level of precision about ordering, consistency, and detail. It eliminates single points of failure, global locks, offline operation problems, write contention, etc. It introduces problems for indexing and absolute consistency. But without widespread indexing you tend to get brute-force (Map-Reduce) or narrow-focus (offline indexes on specific field) searches. And that's okay for most humans.

  • Re:Hmm. (Score:5, Informative)

    by Sarten-X (1102295) on Wednesday January 02, 2013 @06:08PM (#42454897) Homepage

    That's almost exactly wrong.

    "Free-form crap" like blogs doesn't really care what database it's in. Use a blob in MySQL, and it won't matter. You'll be pulling the whole field as a unit and won't do analysis anyway.

    The analysis of atomic data is exactly what NoSQL stores are designed for. MapReduce programs are built to evaluate every record in the table, filter out what's interesting, then run computation on that. The computation is done in stages that can be combined later in a multistage process. Rather than joining tables to build a huge set of possibilities, then trimming that table down to a result set, the query operates directly on a smaller data set, leaving correlation for a later stage. The result is a fast and accurate statistic, though there is a loss of precision due to any inconsistent records. Hence, bigger databases are preferred to minimize the error.

    I like the analogy of NoSQL being a cabinet full of files, though I'd alter it a little. Rather than having no idea what's in the files, we do know what they're supposed to contain, but they're old and may not be perfectly complete as expected. To find some information about the contents, we have to dive in, flip through all the files, and make an effort. Yes, some files will be useless for our search, and some will be missing important data - but we can still get information of statistical significance. Note that over time, the forms might even change, adding new fields or changing options. We might have to ask a supervisor how to handle such odd cases, which is analogous to pushing some decisions back to the application.

  • Re:Hmm. (Score:5, Informative)

    by Sarten-X (1102295) on Wednesday January 02, 2013 @08:31PM (#42456445) Homepage

    I think I'd been led the wrong direction on use cases for nosql solutions.

    It sounds like you probably have. There's a lot of misinformation out there parroted by folks who don't really understand NoSQL paradigms. They'll say it lacks ACID, has no schema, relations, or joins, and they'd be right, but sometimes those features aren't actually necessary for a particular application. That's why I keep coming back to statistics: Statistical analysis is perfect for minimizing the effect of outliers such as corrupt data.

    The idea of "agility" sounded good, which to my mind meant worrying less about the schema.

    Ah, but that's only half of it. You don't have to worry about the schema in a rigid form. You do still need to arrange data in a way that makes sense, and you'll need to have a general idea of what you'll want to query by later, just to set up your keys. If you're working with, for instance, Web crawling records, a URL might make a good key.

    If I need to add field to something, I add a field.

    Most NoSQL products are column-centric. Adding a column is a trivial matter, and that's exactly how they're meant to be used. Consider the notion of using columns whose names are timestamps. In a RDBMS, that's madness. In HBase, that's almost* ideal. A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!

    But the part about no relations always seemed like a show stopper for any case I'm likely to encounter.

    It's not that there aren't relations, but that they aren't enforced. A web site might have had a crawl attempted, but a 404 was returned. It could still be logged by just having a missing content column for that particular timestamp, and only the 404 column filled. On later queries about content, a filter would ignore everything but 200 responses. For statistics about dead links, the HTTP code might be all that's queried. On-the-fly analysis can be done without reconfiguring the data store.

    It'd be nice to store user status updates in a way where I don't have to worry too much about types of update, but I can't do that if correlating 'mentions', the user that posted it, and visibility against user groups would be a problem.

    Here's one solution, taking advantage of the multi-value aspect of each row (because that's really the important part [slashdot.org]):

    Store a timestamped column for each event (status update, mention, visibility change). As you guessed, don't worry much about what each event is, but just store the details (much like Facebook's timeline thing). When someone tries to view a status, run a query to pull all events for the user, and run through them to determine the effective visibility privileges, the most recent status, and the number of "this person was mentioned" events. There's your answer.

    As you may guess, that'd be pretty slow, but we do have the flexibility to do any kind of analysis without reconfiguring our whole database. We could think ahead a bit, though, and add to our schema for a big speed boost: Whenever a visibility change happens, the new settings are stored serialized in the event. Sure, it violates normalization, but we don't really care about that.Now, our query need not replay all of the user's events... just enough to get the last status and visibility, and any "mentioned" events. That'll at least be pretty likely constant time, regardless of how long our users have been around.

    Counting all those "mentioned" events might

1 1 was a race-horse, 2 2 was 1 2. When 1 1 1 1 race, 2 2 1 1 2.

Working...