Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

MapReduce — a Major Step Backwards?

Posted by ScuttleMonkey on Fri Jan 18, 2008 03:53 PM
from the weapons-of-map-reduction dept.
The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."
+ -
story

Related Stories

[+] Technology: Behind the Scenes At Google 196 comments
An anonymous reader writes "University of Wahington TV Presents "behind the Scenes With Google." From the site: 'Search is one of the most important applications used on the internet and poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines. In this program, Jeff Dean of Google describes some of these challenges, discusses applications Google has developed, and highlights systems they've built, including GFS, a large-scale distributed file system, and MapReduce, a library for automatic parallelization and distribution of large-scale computation. He also shares some interesting observations derived from Google's web data.' "
[+] Ask Slashdot: Parallel Programming - What Systems Do You Prefer? 23 comments
atti_2410 asks: "As multi-core CPUs are finding their way into more and more computer systems, from servers to corporate desktops to home systems, parallel programming becomes an issue for application programmers outside the High Performance Computing community. Many Parallel Programming Systems have been developed in the past, yet little is known about which are in practical use or even known to a wider audience, and which are just developed, released and forgotten. Or what problems the actual users of parallel programming systems bother the most. There is not even data on the platforms, that parallel programs are developed for. To shed some light on the subject, I have setup a short survey on the topic, and I would also very much like to hear your opinion here on Slashdot!" What Parallel Computing systems and software have you that really made an impression on you, both good and bad?
[+] Is Parallel Programming Just Too Hard? 680 comments
pcause writes "There has been a lot of talk recently about the need for programmers to shift paradigms and begin building more parallel applications and systems. The need to do this and the hardware and systems to support it have been around for a while, but we haven't seen a lot of progress. The article says that gaming systems have made progress, but MMOGs are typically years late and I'll bet part of the problem is trying to be more parallel/distributed. Since this discussion has been going on for over three decades with little progress in terms of widespread change, one has to ask: is parallel programming just too difficult for most programmers? Are the tools inadequate or perhaps is it that it is very difficult to think about parallel systems? Maybe it is a fundamental human limit. Will we really see progress in the next 10 years that matches the progress of the silicon?"
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by yagu (721525) * <yayagu@@@gmail...com> on Friday January 18 2008, @03:56PM (#22100086) Journal

    I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications
    2. A sub-optimal implementation, in that it uses brute force instead of indexing
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
    4. Missing most of the features that are routinely included in current DBMS
    5. Incompatible with all of the tools DBMS users have come to depend on

    If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.

    Also taking the five tenets listed, here are my observations:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications

      they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach

    2. A sub-optimal implementation, in that it uses brute force instead of indexing

      Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:

      MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses.
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

      Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.

      I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.

    4. Missing most of the features that are routinely included in current DBMS

      They're mistakenly assuming this is for database programming

    5. Incompatible with all of the tools DBMS users have come to depend on

      See previous bullet

    Are these guys just trying to stake a reputation based on being critical of Google?

    • by CajunArson (465943) on Friday January 18 2008, @04:02PM (#22100220) Journal
      Are these guys just trying to stake a reputation based on being critical of Google? I tend to agree, I could probably write a nice article about how map-reduce would be a terrible system to use in making a 3D game. Could an article like that be technically true? Sure. Would it be in anything more than a logical non-sequiter? Not unless Google all of the sudden came out and claimed mapreduce is the new platform for all 3D game development (not likely).
      • by abscondment (672321) on Friday January 18 2008, @06:55PM (#22102862) Homepage

        It's also terrible for painting.

        1. Since the bucket doesn't enforce any schema, you never know what color paint the bucket might hold. Heck, it could even be full of honey. You just can't know, and not being able to know is, well, like programming assembly.
        2. Buckets aren't indexed, so you're not able to find that one ounce of paint that you really want to use next. You've got to split up all of the paint into ounce cups each time and examine very cup. It's very intensive, and really slows down your painting. If you stored the paint in a B-tree of ounce cups, your search for the right ounce of paint would be much more efficient.
        3. Painting is so old. I mean, get with the program. Gold plate your house, or something newer (since newer is always better!). In fact, decades of research into titanium has determined that it'll hold up better to the elements, anyway, so you should just get titanium siding instead of painting.
        4. Painting is an incomplete process. What if you want a window? Yeah, you can't paint a window for yourself, now can you? Did you need a jacuzzi? A fireplace? A new car? Sorry! Painting doesn't support those features yet. You'd better not paint at all if you want those things.
        5. Painting, believe it or not, is incompatible with tennis. There's no racket, there's no court, and there's no ball. There's not even a net (unless you're working from a really tall building, in which case you might fall and so a net is often used). I mean, you don't even need to paint with another person. It's so... incompatible.
    • by starwed (735423) on Friday January 18 2008, @04:08PM (#22100336)
      I thought that this blog post [typicalprogrammer.com] was a pretty good sounding critique of the article in question. (Of course, I don't know a damn thing about DB, relational or otherwise. . )
    • Re: (Score:3, Insightful)

      >If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas >are good", and "Separation of the schema from the application is good, etc. Actually, it says: "The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968. Schemas are good. Separation of the schema from the application is good. High-level access languages are good." Way to conveniently drop important contextual
    • by Anonymous Coward on Friday January 18 2008, @04:11PM (#22100414)
      You missed points 6 through 9:

      6. New things are scary.
      7. Google is on their lawn.
      8. Matlock is the best television show ever.
    • Are these guys just trying to stake a reputation based on being critical of Google?

      I don't know much about database theory, but do know that Michael Stonebraker already has a reputation.

    • by samkass (174571) on Friday January 18 2008, @04:45PM (#22101040) Homepage Journal
      Speaking as someone who works for a company whose product uses a database that is neither relational nor object-oriented, I can say from experience that folks who have devoted a significant amount of their lives to mastering that methodology see anything else as a threat. There are definitely use-cases for non-relational databases-- they're used at both Google and Amazon, as well as many other places. You can either burn significant effort defending your decision to go non-relational, or you can move on and ignore these folks and produce great products. The problem is that sometimes they make good points (especially about some aspects of indexing), but it's almost always lost in the "but... but... but... you're not relational!" argument.
    • by DragonWriter (970822) on Friday January 18 2008, @04:46PM (#22101060)

      I don't know why this article is so harshly critical of MapReduce.


      The primary grounds for complaint seems to be "this isn't the way we do things in the database world". Each of the complaints (except #3) boils down to this (#1: The database community had arguments a few decades back and developed, at the time, a set of conventions; Map Reduce doesn't follow them and is, therefore, bad; #2: All databases use one of two kinds of indexes to accelerate data access; MapReduce doesn't and is, therefore, bad; #3: Databases do something like MapReduce, so MapReduce isn't necessary; #4: Modern databases tend to offer a variety of support utilities and features that MapReduce doesn't, so MapReduce is bad; #5: MapReduce isn't out-of-the-box compatible with existing tools designed to work with existing databases and is, therefore, bad.)

      And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry.

      I suspect part of the reason they are harshly critical is that this is a technology whose adoption and use in large, data-centric tasks is (regardless of efficiency) a threat to the market value of the skills in which they've invested years and $$ developing expertise.

      At the end, they note (as an afterthought) that they recognize that MapReduce is an underlying approach, and that there are projects ongoing to build DBMS's on top of MapReduce, a fact which, if considered for more than a second, explodes all of their criticism which is entirely premised on the idea that MapReduce is intended as a general purposes replacement for existing DBMSs, rather than a lower-level technology which is currently used stand-alone for applications for which current RDBMSs do not provide adequate performance (regardless of their other features), and on which DBMS implementations (with all the features they complain about MapReduce lacking) might, in the future, be built.
    • Re: (Score:3, Insightful)

      Map/Reduce is a very common operation in parallel processing. From my very quick look, it does seem as if the authors are right -- it looks like a quick and dirty implementation of a common operation, and not a "paradigm shift" in the slightest.
    • Did an assignment on map reduce some time ago, while I wasn't really impressed with it as a "Database" it was some really cool stuff they did with distributing the calculations - I did however note back then that it wasn't really useful for the general industry, but still was a very nice piece of software.
    • by SharpFang (651121) on Friday January 18 2008, @05:18PM (#22101570) Homepage Journal
      Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force. Then as you get the list of indices, you brute force them, and this way you get the record.

      This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.

      Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.
    • I thought Google search weren't exact. You know, they were more statistical in nature. The entire algorithm is not probably based on absolute number (guessing, but otherwise it would not make sense).

      The thing is if Google uses this to create their index-like structure of the internet for their search engine, and it is not exactly like a RDBMS, well, so what? The MapReduce thing seems to be targeted at large sets of data and semi-accurate data mining, not exact results. No one really cares if there are 3,000
    • by mishabear (1222844) on Friday January 18 2008, @05:42PM (#22101948)
      > I don't know why this article is so harshly critical of MapReduce.
      > Are these guys just trying to stake a reputation based on being critical of Google?

      Um... yes?

      The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.

      http://www.vertica.com/company/leadership [vertica.com]
      http://en.wikipedia.org/wiki/C-Store [wikipedia.org]
      http://en.wikipedia.org/wiki/Michael_Stonebraker [wikipedia.org]
      http://www.databasecolumn.com/2007/09/contributors.html [databasecolumn.com]
    • Hmmm.... ISTM that the basic critiques come down to:

      1) No indexing.

      Which means

      2) Certain types of constraints probably don't work (such as UNIQUE constraints)

      Which also means

      3) Referential integrity checking and other things don't work.

      This leads to the conclusion that the idea is good for certain types of data-intensive but not integrity-intensive applications (think Ruby on Rails-type apps) but *not* good for anything Edgar Codd had in mind....
    • Re: (Score:3, Interesting)

      What bothers me the most is how much hype it gets. I work for a company that has had a "MapReduce" implementation (used internally) for as long as Google has, and we're not getting drooled over by the tech press. I'm sure tons of companies that have had to solve similar problems have already made this tool, even though the languages and syntax involved might change between implementations, it's nothing all that great.
    • by eh2o (471262) on Saturday January 19 2008, @10:30PM (#22114458)
      MapReduce falls under the category of embarrassingly parallel algorithms. It isn't a step backwards, it just has a limited scope.

      Google's contribution (and yes it does predate them by a long time) is to point out that MapReduce is a bit more than an algorithm -- it is a design pattern. Design patterns help us write clean code by establishing a consistent vocabulary (e.g. actors, containers, operators, etc), and furthermore are important insofar as they making algorithms accessible to programmers. Right now we badly need more well-defined design patterns in the area of parallel computing as this is essentially the future of programming.
  • It's a technical step backwards, they're doing it all wrong, experts say you should do it this other way....

    And watch. It'll be massively successful because it works.
  • Blink blink (Score:4, Funny)

    by Thelasko (1196535) on Friday January 18 2008, @04:02PM (#22100226) Journal
    Once I saw the word paradigm in the summary I just glazed over like I do whenever our CEO gives a speech.
  • Databases? WTF? (Score:5, Insightful)

    by mrchaotica (681592) * on Friday January 18 2008, @04:02PM (#22100228)

    Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.

    • Those newfangled [couchdb.org] document [google.com] databases [thefreedictionary.com] utilize MapReduce to gather records. I'm guessing that's what the article is about.
    • Since when did MapReduce have anything to do with databases?

      MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.

      Now, (largely because of the limitations the authors note), it generally is only used currently for the kind of applications where setting up a traditional RDBMS to handle them would be impractical: Google developed their implementation of MapRed

        • Um, nope. You're not thinking abstractly enough, that is, you're not thinking like a computer scientist. MapReduce is a (rather obvious) framework for processing large lists of (key,data) pairs in parallel, therefore it can be compared with other such systems. Both MapReduce and RDBMSes basically compute a function on a set of (key,data) pairs.

          1) The fact that MapReduce is being used for specific low level applications does not make it intrinsically different or uncomparable to an RDBMS, although it may n

          • Re: (Score:3, Insightful)

            I guess if you consider anything that involves (key, value) pairs to be basically an RDBMS, you might as well classify almost everything as an RDBMS, which seems to make the term pointless. Why write software anymore when we can just use a database? The reality is that I would use MapReduce and MySQL to solve very different problems.

            I think TFA is being silly in trying to compare MapReduce to DBMSs. Yes, of course MapReduce compares unfavorably, because it isn't a DBMS. The comment that MapReduce is "A
      • More and more systems use databases simply as a data archive, not for primary work.
        I wouldn't count on even that being a long term trend. It takes time for people to come up with things to do with a database. Especially really big databases. Wait another ten years, and people will complain that their dumb data archives are not RDBMSes.
  • Money, meet mouth (Score:4, Insightful)

    by tietokone-olmi (26595) on Friday January 18 2008, @04:03PM (#22100242)
    Perhaps the traditional RDBMS experts will return when they can scale their paradigms to datasets that are measured in the tens of terabytes and stored on thousands of computers. Following the airplane rule the solution needs to be able to withstand a crash in a bunch of those hosts without coming unglued.

    Now, this is not to say that a more sophisticated approach wouldn't work. It's just that when you have thousands of boxes in a few ethernet segments, communication overhead becomes really quite large, so large in fact that whatever can be saved with brute-force computation it'll usually be worth it. Consider that from what I've heard, at Google these thousands of boxes are mostly containers for RAM modules so there's rather a lot of computation power per gigabyte available to throw away with a brute force system.

    Also, I would like to point out that map/reduce is demonstrated to work. Apparently quite well too. Certainly better than any hypothetical "better" massively parallel RDBMS available in a production quality implementation today.
    • Re: (Score:3, Interesting)

      Agreed.

      I recently read somewhere (if only I could recall the link...) that on average Google's MapReduce jobs process something in the order of 100 GB/second, 24/7/365

      I've got nothing against RDBMS... but how can you be critical about a tool that scales and performs so well? It's just a matter of selecting and using the right tool for the job.

  • ...entry says;

    "You seem to not have noticed that mapreduce is not a DBMS."

    Exactly. These are the same sort of criticisms that you hear around memcached [danga.com] - the feature set is smaller, etc - and they make the same mistake. It's not a DBMS, and it's not supposed to be. But it does what it does quite well nonetheless!
  • in that it uses brute force instead of indexing
     
    Isn't the overhead of a distributed index usually not worth the bother? This scheme sounds similar to the way Teradata handles its distribution and it manages to get a lot done with hardly any secondary indexes. I think the thinking in the article indicates standalone database server box thinking.
  • by dazedNconfuzed (154242) on Friday January 18 2008, @04:10PM (#22100388)
    it represents a specific implementation of well known techniques developed nearly 25 years ago

    There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent /. post told of ray-tracing being soon used for real-time 3D gaming, and how it beats the socks off "rasterized" methods when a critical mass of polygons is involved; the techniques were well known and developed nearly 25 years ago, but only now do we have the CPU horsepower and vast fast memory capacities available for those "old" techniques to really shine. Likewise "old" "brute force" database techniques: they may not be clever and efficient like what we've been using for highly stable processing of relatively small-to-medium databases, but they work marvelously well when involving big unreliable networks of processors working on vast somewhat-incoherent databases - systems where modern shiny techniques just crumble and can't handle the scaling.

    Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.
  • This article was written from the perspective that map-reduce based architectures is in competition with common relational database architecture. It's not.

    Certainly if you were to implement map-reduce within the confines of the relational database world, there are implementation methodologies that would need to be taken to make it easier for the RDBMS developer to work with the storage and querying mechanisms.

    The article implies that map-reduce is bad because it doesn't place restrictions common to the dat
  • by abes (82351) on Friday January 18 2008, @04:13PM (#22100462) Homepage
    Well, INDBE, but MapReduce seems like a pretty cool idea (even it is old [which in my books does not equate bad]). A similar argument could be made against SQL -- it's not appropriate to all solutions. It's used for most nowadays, in part because it's the simplest to use, but that doesn't make it necessarily better. It (of course) depends on what data you want to represent.

    Even more importantly, you can create schemas with MapReduce by how you write your Map/Reduce functions. This is a matter of the datafunction exchange (all data can be represented as a function, likewise all functions can be represented as data). I admit ignorance to how this MapReduce system works, but I would be surprised if you couldn't get a relational database back out.

    The advantage is you get with MapReduce is that you aren't necessarily tied to a single representation of data. Especially for companies like Google, which may want to create dynamic groups of data, this could be a big win. Again, this is all speculative, as I have very little experience with these systems.
  • by Anonymous Coward on Friday January 18 2008, @04:13PM (#22100478)
    The reaction seems straightforward enough. The MapReduce paradigm has proved to be very effective for a company that lives and breathes scalability, while it apparently ignores a whole bunch of database work that's been going on in academia. That fact that industry was able to produce something so effective without making use of all this knowledge base at least implicitly undercuts the importance of that work, and is thus threatening to the community which produced that work. Is it any surprise that the researchers whose work was completely side-stepped by this approach aren't happy with the current situation?
  • A sub-optimal implementation, in that it uses brute force instead of indexing

    As though these are the exclusive choices. TFA goes on to complain about implementing 25 year old ideas, though they are actually rather older than that--they just didn't strike the RDB types until the eighties. They proceed to insist that the system cannot scale. Arguing google's scalability is like arguing gravity.

  • FTFA (Score:5, Insightful)

    by smcdow (114828) on Friday January 18 2008, @04:19PM (#22100598) Homepage

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.


    That's a joke, right?

    I think Google's already taken care of all the experimental evaluations you'd need.

  • by 644bd346996 (1012333) on Friday January 18 2008, @04:21PM (#22100624)
    If you are starting with a good database, MapReduce is definitely a step backwards. But that isn't what MapReduce is designed to replace. In reality, MapReduce replaces the for loop [joelonsoftware.com], and viewed from that perspective, it is a major step forward. Most languages (C, C++, Java, etc.) define the for loop and other iteration facilities in such a way that the compiler can seldom safely parallelize the loop. MapReduce gives the programmer an easy way to convert probably 90% of their for loops into highly scalable code.
  • "We spent all these years making these complex, elegant algorithms--see how intricate this wonderful indexing algorithm is?--and then they solve things by simply throwing cheap hardware at it. It's not *fair!*"
  • by brundlefly (189430) on Friday January 18 2008, @04:25PM (#22100720)
    The point of MapReduce is that It Works. Cheaply. Reliably. It's not a solution for the Cathedral, it's one for the Bazaar.

    Comparing it to a DBMS on fanciness is pointless, because the DBMS solution fails where MapReduce succeeds.
  • The 1st that come to my mind when i read that was the evolution of a programmer [nus.edu.sg], when a "program" evolving started to get back thin in lines didnt meant that were a step backwards.
  • we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications.
    So much hype that I never even heard of it before their complain hit Slasdot...
  • Vertica (Score:4, Interesting)

    by QuietLagoon (813062) on Friday January 18 2008, @04:42PM (#22101010)
    The column was copyright by Vertica [vertica.com]. Wouldn't they be concerned about the type of competition that MapReduce presents?
  • I gather this is a publication for DBAs. It seems they are worried about their jobs more than anything. With the map-reduce-style databases there isn't a need for any kind of special database expert. The business logic all happens in the application. There is no need for tuning indexes. You don't even need to define a schema. When things get slow any monkey can drop in another computer and you're back up to speed and ready to go.

    Traditional RDBMSes have their place, but we're going to see a lot more applica
  • "...I taped twenty cents to my transmission
    So I could shift my pair 'a dimes..."
  • by steveha (103154) on Friday January 18 2008, @04:48PM (#22101098) Homepage
    I read through the whole article, and was just bemused. According to the article, MapReduce isn't as good as a real database at doing the sorts of things real databases do well. Um, okay, I guess, but MapReduce can do quite a lot of other things that they seem to have missed.

    Also, I had a major WTF moment when I read this:

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

    Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)

    This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.

    They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!

    Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.

    MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.

    And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.

    MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.

    steveha
  • by DragonWriter (970822) on Friday January 18 2008, @05:32PM (#22101812)
    1) They don't look like hammers,
    2) They don't work like hammers,
    3) You can already drive in a screw with a hammer,
    4) They aren't good at ripping out nails, and
    5) They aren't good at driving nails.

    Brought to you by The Hammer Column, a blog written by experts in the hammer industry, and launched by Hammertron, makers of a revolutionary new kind of hammer [vertica.com].

    • a search routine that would attempt to pick 5 records at random from a database containing potentially a billion records
      Yeah, I'd say an index wouldn't be much help in that situation. A monkey with a keyboard could probably handle it, though.