MapReduce — a Major Step Backwards? 157
The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."
may be missing the (data)points (Score:5, Insightful)
I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:
If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.
Also taking the five tenets listed, here are my observations:
they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach
Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:
Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.
I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.
They're mistakenly assuming this is for database programming
See previous bullet
Are these guys just trying to stake a reputation based on being critical of Google?
Just watch. (Score:2, Insightful)
And watch. It'll be massively successful because it works.
Re:may be missing the (data)points (Score:4, Insightful)
Databases? WTF? (Score:5, Insightful)
Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.
Money, meet mouth (Score:4, Insightful)
Now, this is not to say that a more sophisticated approach wouldn't work. It's just that when you have thousands of boxes in a few ethernet segments, communication overhead becomes really quite large, so large in fact that whatever can be saved with brute-force computation it'll usually be worth it. Consider that from what I've heard, at Google these thousands of boxes are mostly containers for RAM modules so there's rather a lot of computation power per gigabyte available to throw away with a brute force system.
Also, I would like to point out that map/reduce is demonstrated to work. Apparently quite well too. Certainly better than any hypothetical "better" massively parallel RDBMS available in a production quality implementation today.
As one of the comments on the blog ... (Score:4, Insightful)
"You seem to not have noticed that mapreduce is not a DBMS."
Exactly. These are the same sort of criticisms that you hear around memcached [danga.com] - the feature set is smaller, etc - and they make the same mistake. It's not a DBMS, and it's not supposed to be. But it does what it does quite well nonetheless!
Re:may be missing the (data)points (Score:3, Insightful)
Ideas ahead of their time? (Score:5, Insightful)
There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent
Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.
A completely uninformed analysis (Score:3, Insightful)
Even more importantly, you can create schemas with MapReduce by how you write your Map/Reduce functions. This is a matter of the datafunction exchange (all data can be represented as a function, likewise all functions can be represented as data). I admit ignorance to how this MapReduce system works, but I would be surprised if you couldn't get a relational database back out.
The advantage is you get with MapReduce is that you aren't necessarily tied to a single representation of data. Especially for companies like Google, which may want to create dynamic groups of data, this could be a big win. Again, this is all speculative, as I have very little experience with these systems.
A Very Human Response (Score:3, Insightful)
FTFA (Score:5, Insightful)
That's a joke, right?
I think Google's already taken care of all the experimental evaluations you'd need.
Missing the forest for the trees... (Score:4, Insightful)
Comparing it to a DBMS on fanciness is pointless, because the DBMS solution fails where MapReduce succeeds.
Re:may be missing the (data)points (Score:4, Insightful)
Re:may be missing the (data)points (Score:3, Insightful)
Article really misses the point (Score:5, Insightful)
Also, I had a major WTF moment when I read this:
Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.
Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)
This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.
They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!
Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.
MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.
And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.
MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.
steveha
Indexing is useless here. (Score:5, Insightful)
This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.
Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.
Google = statistical database? (Score:3, Insightful)
The thing is if Google uses this to create their index-like structure of the internet for their search engine, and it is not exactly like a RDBMS, well, so what? The MapReduce thing seems to be targeted at large sets of data and semi-accurate data mining, not exact results. No one really cares if there are 3,000,000,000 sites or 3,000,000,002 sites with Linux in it somewhere.
Comparing RDBMS to MapReduce is like comparing math function to a paper graph of that function. The first one gives you exact results for all data in its domain. The second gives out quick, pain-free and semi-accurate results for some parts of the domain.
Now, I will not be using MapReduce but then I don't see why Google should not. It is their business.
Re:may be missing the (data)points (Score:4, Insightful)
1) No indexing.
Which means
2) Certain types of constraints probably don't work (such as UNIQUE constraints)
Which also means
3) Referential integrity checking and other things don't work.
This leads to the conclusion that the idea is good for certain types of data-intensive but not integrity-intensive applications (think Ruby on Rails-type apps) but *not* good for anything Edgar Codd had in mind....
Re:Databases? WTF? (Score:3, Insightful)
1) The fact that MapReduce is being used for specific low level applications does not make it intrinsically different or uncomparable to an RDBMS, although it may not be worthwhile.
2) The more MapReduce gets used for things other than search engine calculations, the more it becomes worthwhile to do the comparison.
Re:Databases? WTF? (Score:3, Insightful)
Re:Databases? WTF? (Score:3, Insightful)
I think TFA is being silly in trying to compare MapReduce to DBMSs. Yes, of course MapReduce compares unfavorably, because it isn't a DBMS. The comment that MapReduce is "A sub-optimal implementation, in that it uses brute force instead of indexing" is particularly telling: MapReduce is not intended for situations where you would want indexing, and never was. In general, the whole article is trying to judge MapReduce on points that are completely irrelevant to what it was designed for and the way it is actually used.
Really, if MapReduce were a DBMS, then why did the creators of MapReduce also create BigTable? BigTable *is* meant to be like a database, although it omits a lot of features in favor of scalability. MapReduce and BigTable are used for completely different things. I think Jeff and Sanjay (creators of both MapReduce and BigTable) probably find it pretty amusing to see MapReduce evaluated as a DBMS.