Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Google Databases Technology

Google Caffeine Drops MapReduce, Adds "Colossus" 65

An anonymous reader writes "With its new Caffeine search indexing system, Google has moved away from its MapReduce distributed number crunching platform in favor of a setup that mirrors database programming. The index is stored in Google's BigTable distributed database, and Caffeine allows for incremental changes to the database itself. The system also uses an update to the Google File System codenamed 'Colossus.'"
This discussion has been archived. No new comments can be posted.

Google Caffeine Drops MapReduce, Adds "Colossus"

Comments Filter:
  • Sounds inefficient (Score:5, Interesting)

    by martin-boundary ( 547041 ) on Sunday September 12, 2010 @12:01AM (#33550272)
    This sounds like it's going to be highly inefficient for nonlocal calculations, or am I missing something? Basically, if the calculation at some database entry is going to require inputs from arbitrarily many other database entries which could reside anywhere in the database, then the computation cost per entry will be huge compared to a batch system.
  • by iONiUM ( 530420 ) on Sunday September 12, 2010 @12:09AM (#33550312) Journal

    I read TFA (I know, that's crazy). They don't come right out and say it, but I believe what they did it put a MapReduce type system (MapReduce splits the elements into subtasks for faster calculation) on database triggers. So basically this new system is spreading a database across their file system, across many computers, and allows incremental updates that, when occur, will trigger a MapReduce type algorithm to crunch the new update.

    This way they get the best of both world. At least, I think that's what they're doing, otherwise their entire system would.. stop working.. since MapReduce is the whole reason they can parse such larger amounts of information.

  • Re:I have no idea (Score:5, Interesting)

    by icebike ( 68054 ) on Sunday September 12, 2010 @01:22AM (#33550654)

    Follow the link to the Original Article over on The Register , where you will find a rather lucid explanation, far better than the summary above can provide.

    Short answer:

    The old method of building their search database was essentially a Batch Job, Run it, wait, wait, wait a long time, swap results into production servers.

    The new method is continuous updates into a gigantic database spread over their entire network,

    This is why things show up in Google days, sometimes weeks ahead of the other search engines. The other guys are still trying to clone Google's old method.

  • by kurokame ( 1764228 ) on Sunday September 12, 2010 @02:26AM (#33550850)

    Colossus is incremental, whereas MapReduce is batch-based.

    In MapReduce, you run code against each item with each operation spread across N processors, then you reduce it using a second set of code. You have to wait for the first stage to finish before running the second stage. The second stage is itself broken up into a number of discrete operations and tends to be restricted to summing results of the first stage together, and the return profile of the overall result needs to be the same as that for a single reduce operation. This is really great for applications which can be broken up in this fashion, but there are disadvantages as well.

    MapReduce is a sequence of batch operations, and generally, Lipkovits explains, you can't start your next phase of operations until you finish the first. It suffers from "stragglers," he says. If you want to build a system that's based on series of map-reduces, there's a certain probability that something will go wrong, and this gets larger as you increase the number of operations. "You can't do anything that takes a relatively short amount of time," Lipkovitz says, "so we got rid of it."

    The problem for Google is that the disadvantages scale. The fact that you have to wait for all operations from the first stage to finish and that you have to wait for the whole thing to run before you find out if something broke can have a very high cost at high item counts (noting that MapReduce typically runs against millions of items or more, so "high" is very high). With the present size, it's apparently more advantageous to get changes committed successfully the first time, even if MapReduce might be able to compute the result faster under ideal circumstances.

    For example, why do you use ECC memory in a server? Because you have a bloody lot of memory across a bloody lot of computers running a bloody lot of operations, and failures potentially have more serious consequences than if a program on someone's desktop. At higher scales, non-ideal circumstances are more common and have more serious consequences. So while they still use MapReduce for some functions where it's appropriate, it's no longer appropriate for the purpose of maintaining the search index. It's just gotten too big.

  • Re:I have no idea (Score:5, Interesting)

    by A Friendly Troll ( 1017492 ) on Sunday September 12, 2010 @05:18AM (#33551384)

    This is why things show up in Google days, sometimes weeks ahead of the other search engines.

    For a hands-on example of what icebike is saying, look here:

    http://www.google.com/search?q=%22This+is+why+things+show+up+in+Google+days%2C+sometimes+weeks+ahead+of+the+other+search+engines%22 [google.com]

    Actually, Google will index Slashdot comments in a matter of minutes.

  • Mod Offtopic, please (Score:3, Interesting)

    by Khyber ( 864651 ) <techkitsune@gmail.com> on Sunday September 12, 2010 @05:52AM (#33551500) Homepage Journal

    This is going to give my Camfrog name a new meaning, as I *LOVE* screwing around with file systems. Colossus Hunter, indeed!

  • Re:It is quick (Score:3, Interesting)

    by Surt ( 22457 ) on Sunday September 12, 2010 @10:43AM (#33552588) Homepage Journal

    I assume google polls sites, and polls faster every time it finds a change, slower every time it does not find a change. Eventually it gets to a wobbly around the probable update speed of the site. Otherwise they'd have to trust sites to call their API with updates, and that would let any search engine which DID employ a wobbly poll strategy to beat them in results.

UNIX is many things to many people, but it's never been everything to anybody.