Cassandra NoSQL Database 1.2 Released 55
Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."
Hmm. (Score:4, Interesting)
Maybe someone can explain this to me. I've been keeping an eye out for situations where it would make more sense to use a nosql solutions like Mongo, Couch, etc. for a year or so now, and I just haven't found one.
Under what circumstances do people use a data store that doesn't need data relationships?
Re: (Score:3, Insightful)
When the project is run by an idiot who thinks he needs to incorporate buzzwords over substance into their work.
Re: (Score:2)
But your old fasioned DB isn't "Web Scale": http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html [highscalability.com]
Sorry, I love this video..
Re: (Score:1)
But your old fasioned DB isn't "Web Scale": http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html [highscalability.com]
Sorry, I love this video..
Unfortunately, it didn't work out that well for Hitler either [youtube.com].
Re: (Score:2)
That is funny but the best part is me remembering when MySQL used to be mocked for not being ACID-compliant, robust, etc and the comeback was "well, it's really fast"
Re: (Score:2)
Under what circumstances do people use a data store that doesn't need data relationships?
A crude 1980s filesystem, on a system where they don't officially allow direct file storage but do provide a database capable of holding arbitrary binary data.
Re:Hmm. (Score:5, Insightful)
Assuming you're not trolling...
When one wants to write a ton of data as fast as possible, where the data may not actually be complete or consistent (but still useful). Something on the order of a million rows a minute is a prime candidate for a NoSQL store. Consider, for example, the sum of all posts on Facebook at any given time.
From the other side, an application like the current trend of "Big Data" models, monitoring every aspect of every action on a website (or in a hospital, or through a retail distribution chain, or the environmental systems of a factory) to glean statistically-meaningful information also makes a good use case for NoSQL. At the expense of consistency, the store is designed to be fast and fault-tolerant, so it really doesn't matter whether the data's complete or not. For Big Data applications, which are interested only in statistics, having a few inconsistent records out of billions doesn't matter much to the end result.
Sure, traditional RDBMSs can be tweaked and optimized to make any particular query run as fast as any NoSQL engine... but that's an expensive and time-consuming process that's often not feasible.
Re: (Score:2)
As for your first case, it's less a factor of speed than it is the content of what you are writing. If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes. If you need to be able to query the data at a later point and be able to run statistics on it regularly, then no, especially if accuracy in the statistic is important.
And on the other side, NoSQL typically fails much more than it succeeds because NoSQL defers most of it's lo
Re:Hmm. (Score:4, Informative)
If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes.
I'm going to pretend you weren't trolling to address a good point here. NoSQL is very valuable for human-to-human data. I've seen it be hugely successful in cases when you only need a "human" level of precision about ordering, consistency, and detail. It eliminates single points of failure, global locks, offline operation problems, write contention, etc. It introduces problems for indexing and absolute consistency. But without widespread indexing you tend to get brute-force (Map-Reduce) or narrow-focus (offline indexes on specific field) searches. And that's okay for most humans.
Re:Hmm. (Score:5, Informative)
That's almost exactly wrong.
"Free-form crap" like blogs doesn't really care what database it's in. Use a blob in MySQL, and it won't matter. You'll be pulling the whole field as a unit and won't do analysis anyway.
The analysis of atomic data is exactly what NoSQL stores are designed for. MapReduce programs are built to evaluate every record in the table, filter out what's interesting, then run computation on that. The computation is done in stages that can be combined later in a multistage process. Rather than joining tables to build a huge set of possibilities, then trimming that table down to a result set, the query operates directly on a smaller data set, leaving correlation for a later stage. The result is a fast and accurate statistic, though there is a loss of precision due to any inconsistent records. Hence, bigger databases are preferred to minimize the error.
I like the analogy of NoSQL being a cabinet full of files, though I'd alter it a little. Rather than having no idea what's in the files, we do know what they're supposed to contain, but they're old and may not be perfectly complete as expected. To find some information about the contents, we have to dive in, flip through all the files, and make an effort. Yes, some files will be useless for our search, and some will be missing important data - but we can still get information of statistical significance. Note that over time, the forms might even change, adding new fields or changing options. We might have to ask a supervisor how to handle such odd cases, which is analogous to pushing some decisions back to the application.
Re:Hmm. (Score:5, Informative)
I think I'd been led the wrong direction on use cases for nosql solutions.
It sounds like you probably have. There's a lot of misinformation out there parroted by folks who don't really understand NoSQL paradigms. They'll say it lacks ACID, has no schema, relations, or joins, and they'd be right, but sometimes those features aren't actually necessary for a particular application. That's why I keep coming back to statistics: Statistical analysis is perfect for minimizing the effect of outliers such as corrupt data.
The idea of "agility" sounded good, which to my mind meant worrying less about the schema.
Ah, but that's only half of it. You don't have to worry about the schema in a rigid form. You do still need to arrange data in a way that makes sense, and you'll need to have a general idea of what you'll want to query by later, just to set up your keys. If you're working with, for instance, Web crawling records, a URL might make a good key.
If I need to add field to something, I add a field.
Most NoSQL products are column-centric. Adding a column is a trivial matter, and that's exactly how they're meant to be used. Consider the notion of using columns whose names are timestamps. In a RDBMS, that's madness. In HBase, that's almost* ideal. A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!
But the part about no relations always seemed like a show stopper for any case I'm likely to encounter.
It's not that there aren't relations, but that they aren't enforced. A web site might have had a crawl attempted, but a 404 was returned. It could still be logged by just having a missing content column for that particular timestamp, and only the 404 column filled. On later queries about content, a filter would ignore everything but 200 responses. For statistics about dead links, the HTTP code might be all that's queried. On-the-fly analysis can be done without reconfiguring the data store.
It'd be nice to store user status updates in a way where I don't have to worry too much about types of update, but I can't do that if correlating 'mentions', the user that posted it, and visibility against user groups would be a problem.
Here's one solution, taking advantage of the multi-value aspect of each row (because that's really the important part [slashdot.org]):
Store a timestamped column for each event (status update, mention, visibility change). As you guessed, don't worry much about what each event is, but just store the details (much like Facebook's timeline thing). When someone tries to view a status, run a query to pull all events for the user, and run through them to determine the effective visibility privileges, the most recent status, and the number of "this person was mentioned" events. There's your answer.
As you may guess, that'd be pretty slow, but we do have the flexibility to do any kind of analysis without reconfiguring our whole database. We could think ahead a bit, though, and add to our schema for a big speed boost: Whenever a visibility change happens, the new settings are stored serialized in the event. Sure, it violates normalization, but we don't really care about that.Now, our query need not replay all of the user's events... just enough to get the last status and visibility, and any "mentioned" events. That'll at least be pretty likely constant time, regardless of how long our users have been around.
Counting all those "mentioned" events might
Re: (Score:1)
Re: (Score:2)
A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!
Jesus Christ.
Please explain in what possible universe what you j
Re: (Score:2)
Oh dear... I seem to have offended your RDBMS-is-God sensibilities again [slashdot.org]. I do so love a good argument. I hope I can find one...
Please explain in what possible universe what you just described is better than a normal relational table where each row contains a timestamp, headers, content, and HTTP code. (And presumably a URL, although you left that out.)
One where every row has a monetary (and time) cost, which is conveniently close to the one we live in. On a huge database, pulling a specific set of rows from a date range may or may not actually align well with how the database is sharded. If you've been partitioning the table by the "URL" column, and now you want to query by the "timestamp" column for a single "URL" value, you're
Re: (Score:2)
One where every row has a monetary (and time) cost, which is conveniently close to the one we live in. On a huge database, pulling a specific set of rows from a date range may or may not actually align well with how the database is sharded. If you've been partitioning the table by the "URL" column, and now you want to query by the "timestamp" column for a single "URL" value, you're likely going to be doing all your work on a single shard, on a single server. Conversely, if you partition the table by timest
Re: (Score:1)
One thing I'm not seeing in the comments yet: don't forget that most NoSQL solutions are written for commodity hardware, which also makes them very suitable for cloud solutions. To get the same kind of performance out of a relational DB, you need expensive hardware.
Cassandra can also be made aware of the rack or data center the nodes are running in, so it can lay out its data replica's for regional data safety (think EC2 data center failures, all too common) but still offer optimal local data access.
Re: (Score:2)
Re:Hmm. (Score:5, Informative)
You'll see these kinds of large-scale columnar stores like Cassandra or HBase being used a lot in metrics and log management projects.
For instance, if you want to generate a histogram of login processing time over the last 90 days, you'll need to record the times of all of your individual logins to do that. If you have millions of logins per hour, that single metric alone is going to generate a lot of rows. If you're also measuring many other points throughout your system, the data starts getting unmanageable with B-tree backed databases and not of high enough value to store in RAM.
in the past, you might deal with this by adding more sophisticated logic at the time of collection. Maybe I'll do random sampling and only pick 1 out of every 1000 transactions to store. But then, I might have a class of users I care about (e.g. users logging in from Syria compared to all users logging in around the world) where the sample frequency causes them to drop to zero. So then I have to do more complicated logic that will pick out 1 out of every 1000 transactions but with separate buckets for each country. But then every time your bucketing changes, you have to change the logic at all of the collection points. I can't always predict in advance what buckets I might need in the future.
With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use). The data collected can have indexes (really more like inverted indexes, but that is a longer discussion) up front for quick query of data facets that you know you want in advance, but still retains the property of super-fast-insert-on-cheap-hardware so that you can store all of the raw data and come back for it later when there is something you didn't think of in advance, and map-reduce for the answer.
Re: (Score:2)
With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use).
I'll have to do some performance testing later, but you do realize that almost all relational databases support a concept known as clustered indexes, which takes advantage of sequential access, correct? Sounds like you don't understand how current relational databases work.
Fast (Score:2)
It always pays to use relational over NoSQL when you can. But just like in data warehousing where it makes sense to denormalize for performance reasons it can make sense to organize the data around specific computations which damage the ability to use SQL.
You won't find any good reason with normal sized data sets and normal number of joins. Computations that require large tables that need to join multiple times in complex ways that can't be overcome with tricks like indexing.... then it can make sense to
Re: (Score:2)
It's for people who were letting their programming frameworks do what the fuck every they want with their database structures and decided to take that one step farther.
Admittedly, I kind of like it for low(er) value things where you're likely to have some variation in the structures being inserted, like logging and tracking the status of long-running tasks (upsert and appending to arrays FTW). That's about the only use I've found for the tech, though, and I admit that even in those cases its use is largely
Re: (Score:1)
I use it when I need a database that supports relationships, tons of them, and doesn't falter at the same relationship type having completely different fields. It's the same -freaking- relationship, with supporting information from several different systems.
I use Neo4j, which is only technically NoSQL, but it has a few query languages of it's own. But I always chuckle at "relational" databases because they all seem to collapse under too many relationships "X" is_a this, is_a that, is_a this2, is_a... why do
Re: (Score:1)
It's not always about the data relationships. Cassandra for example is very easy to scale horizontally (much easier than traditional databases) and can achieve very high throughput. Last time I checked (a year ago) I could get over 50,000 stores/queries per second on a cluster of cheap commodity hardware (4 servers). That result was achieved with full redundancy (n=2). Such a setup is very resilient against failure (provided clients handle failure of individual nodes correctly). Maintaining such a cluster i
Re: (Score:2)
Re: (Score:2)
Under what circumstances do people use a data store that doesn't need data relationships?
Think (huge!) web content management systems with tree-structure, component-based pages where data varies widely from each page-type, and business requirements are constantly in flux.
While there's definitely data relationships, they're not necessarily very comfortable to map in a traditional RDBMS.
batch != patch (Score:2)
I'm not sure if it's a typo or a misunderstanding, but the statement in the summary about atomic batching is hilariously incorrect.
Atomic batching has nothing to do with "patches can be reapplied if one of them fails", but rather the more pedantic yet common case where you want a set of data updates to be batched atomically, where all or none of the changes occur, but nothing in between.
Re: (Score:2)
sounds like a transaction
Re: (Score:2)
Looks like there's two parts here. One of them is communicating the changeset to (one or more) nodes, then the other part is actually applying it. If the coordinator failed halfwa
Re: (Score:2)
the whole http://blogs.apache.org/ [apache.org] domain seems to return this 502 error right now. Maintenance, other problem or just slashdotted even if it is an apache domain ?
They seem to be using "Apache/2.0.63 Server at blogs.apache.org Port 80" in reverse-proxy mode and my guess is the server behind it is down.
Blog entry from Google cache (Score:1)
The Apache Software Foundation Blog
The Apache Software... | Main
Wednesday Jan 02, 2013
The Apache Software Foundation Announces Apache Cassandra v1.2
High-performance, super-robust Big Data distributed database introduces support for dense clusters, simplifies application modeling, and improves data cell storage, design, and representation.
Forest Hill, MD –2 January 2013– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source
NoSQL? Then what? (Score:2)
There must be something I don't understand. For me the whole point of databases is precisely that they come with SQL to easily do even complex stuff with them.
How can the absence of the only useful feature be a "selling" point. No SQL? No thanks?...
Re: (Score:1)
SQL is anything but easy from app development viewpoint. You have to either mix it in your code, which is ugly in itself and creates tons of potential SQL injection bugs, or you use ORM and then your database is probably unusable using conventional tools.
NoSQL solves the problem, as native bindings to different languages are the standard interface in this world.
Re: (Score:2)
The important parts of NoSQL really boils down to:
1. Very high performance.
2. Ability to handle extremely large data (on the order of tens or hundreds of terabytes.).
3. Natural way of dealing with non-flat , non-BLOB data.
4. Better integration with OO languages.
#1 and #2 all come with trade-offs, which is perfectly fine. Not all problems need ACID compliance..
#3 & #4 really goes back to the 90s , though
Re: (Score:1)
Re: (Score:1)
Why not LDAP?
Not speaking w/ any authority, but afaik LDAP is just an over-the-wire protocol. It says nothing about the underlying database(s) or what the directory services actually represent. That said, Open LDAP [openldap.org] and LDAP vs RDBMS [openldap.org]
Re: (Score:2)
One of the useful features of solr/lucene is the MLT key word (which stands for More Like This).
Another useful feature of many NOSQL databases is faceted searches with good performance.
It seems to be a very common practice to store the data in an SQL database and duplicate that database in a nosql database to use for searching, then if the nosql database gets corrupted you rebuild from the original data and your searches are incomplete while the rebuild goes on. (worst case I've had to deal with is a couple
Re: (Score:2)
NoSQL does have some advantages. If you have 2+GB of data in a relational database table and you wish to update a table doing some can take a long time during which your services will be down. Since non-relational databases allow for schema less data, you can simply add the extra column in the code and add code for what to do if the new column doesn't exist (i.e. old data) then deploy it with zero downtime.
These points don't really come into play until you have a huge dataset however so for most stuff I sti
Apples and Oranges folks (Score:1)
I can't believe these assholes are getting in an argument about SQL vs NoSQL. Apples and Oranges. NoSQL isn't a complete replacement, nor are rdbms the solve-all solution when you need to scale. Sounds like a bunch of db admins getting threatened that their jobs are going to be in jeopardy.