Forgot your password?
typodupeerror
Supercomputing Java Programming IT Technology

Open Source Solution Breaks World Sorting Records 139

Posted by Soulskill
from the out-of-sorts dept.
allenw writes "In a recent blog post, Yahoo's grid computing team announced that Apache Hadoop was used to break the current world sorting records in the annual GraySort contest. It topped the 'Gray' and 'Minute' sorts in the general purpose (Daytona) category. They sorted 1TB in 62 seconds, and 1PB in 16.25 hours. Apache Hadoop is the only open source software to ever win the competition. It also won the Terasort competition last year."
This discussion has been archived. No new comments can be posted.

Open Source Solution Breaks World Sorting Records

Comments Filter:
  • by eddy (18759) on Saturday May 16, 2009 @10:03AM (#27979115) Homepage Journal
    Probably why the second sentence in the article is "All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value."
  • by Anonymous Coward on Saturday May 16, 2009 @10:29AM (#27979283)
    It's 0.20 but it's stable and production ready already. I use it with HBase [apache.org] and it scales awesomely.
  • by Celeste R (1002377) on Saturday May 16, 2009 @10:31AM (#27979297)
    According to a post on the Yahoo developer forums:

    2005 winner used only 80 cores and achieved it in 435 seconds. So with 800 cores what 2007 winner achieved is 297 seconds ?

    Its not only number of cores its how the logic to use parallel nodes properly to do a particular task is important.

    Hadoop won with 1820 cores (910 nodes w/ 2 cores each) at 209 seconds.

    I'm all for better sorting algorithms, but eventually the cost of parallelizing something overtakes the profit made. That being said, Hadoop's internal filesystem made to be redundant, which is an important feature whenever you're dealing with large amounts of data.

    Hadoop uses Google's MapReduce, by the way, whereas the competition didn't. It's nice to see MapReduce being used in a more public eye.

    While better sorting algorithms -do- matter, I have to say that maintenance and running costs also matter.

    I'd also like to see how a compatible C version of this software compares with the Java version. However, as I see it, the Java overhead seems fairly limited; sorting code is wonderfully repetitive, and I'd expect that it's already been optimized a fair amount.

    By the way, the number of nodes and the hardware in the nodes for this Hadoop cluster is -optimized- for this contest.

  • Re:Overlords (Score:4, Informative)

    by rackserverdeals (1503561) on Saturday May 16, 2009 @10:40AM (#27979351) Homepage Journal

    I wouldn't be surprised if they came from Star Wars.

    Actually, it came from Google. Sorta.

    Apache Hadoop is an implementation of MapReduce that Google uses in their search engine. I believe the details were found in a paper Google released on it's implementation of MapReduce.

  • by Sangui5 (12317) on Saturday May 16, 2009 @10:53AM (#27979441)

    Google's sorting results from last yeat (link [blogspot.com]) are much faster; they did a petabyte in 362 minutes, or 2.8 TB/sec. They minute sort didn't exist last year, but Google did 1TB in 68 seconds last year, so I think it may be safe to assume that they could do 1 TB in under a minute this year. Google just hasn't submitted any of their runs to the competition.

    From the sort benchmark page [sortbenchmark.org], the list the winning run as Yahoo's 100TB run, leaving out the 1PB run; that implies the 1PB run didn't conform to the rules, or was late, or something.

    People have commented that this is a "who has the biggest cluster" competition; the sort benchmark also includes the 'penny' sort, which is how much can you sort for 1 penny of computer time (assuming your machine lasts 3 years), and 'Joule' sort, how much energy does it take you to sort a set amount of data. Not surprisingly, the big clusters appear to be neither cost efficient nor energy efficient.

  • Re:Overlords (Score:4, Informative)

    by daemonburrito (1026186) on Saturday May 16, 2009 @11:21AM (#27979653) Journal

    "MapReduce: Simplified Data Processing on Large Clusters [google.com]." Jeffrey Dean and Sanjay Ghemawat, OSDI '04.

    They wrote about it in Beautiful Code, too (great book). MapReduce isn't complex, in fact the name comes from a feature that a lot of functional languages provide (yeah, I know, it's not exactly the same thing).

    There are many implementations of it. The wikipedia article is pretty informative: http://en.wikipedia.org/wiki/MapReduce [wikipedia.org]. I didn't know about "BashReduce"... Heh.

  • by jimicus (737525) on Saturday May 16, 2009 @11:22AM (#27979657)

    Here in the UK, the patent office has been issuing software patents for some time in "anticipation" of them becoming legal at some point in the future.

    No, I don't understand that either.

  • by e9th (652576) <e9th@NOSpaM.tupodex.com> on Saturday May 16, 2009 @01:09PM (#27980393)
    Hadoop's name (and mascot) came from Doug [the project leader] Cutting's son's yellow stuffed elephant toy.
  • by Yosho (135835) on Saturday May 16, 2009 @01:54PM (#27980659) Homepage

    It usually outperforms its Java sibling in an order of magnitude.

    Do you have any actual benchmarks for that? According to the benchmarks page at the official cLucene wiki [sourceforge.net], cLucene is roughly twice as fast as the Java Lucene at indexing, and it's only about 10% faster at the actual searching. That's not even close to an order of magnitude.

  • by Yosho (135835) on Saturday May 16, 2009 @02:05PM (#27980721) Homepage

    But how much of those libraries exist to achieve Java's religious beliefs on abstraction?

    Wow, how did this get modded insightful? For one, calling the design of a programming language a "religious belief", then asking a vague question about it without providing even a basis of an answer is just inflammatory.

    But the answer that anybody who knows what they're talking about will tell you is, none of them. Java's abstraction mechanisms are built into the language. None of the standard libraries are necessary to support it. They take advantage of it, of course, and you'd be crazy to not take advantage of one of the language's features. Try taking a look at a tree representation [sun.com] of all of the classes in the standard library. The vast majority of classes are not more than one or two levels down from the top-level Object. The things that are deeper are typically things that are complex in any language -- CORBA, GUI toolkits, etc. It certainly looks much cleaner than many graphs I've seen of C++ libraries that abused multiple inheritence.

The rich get rich, and the poor get poorer. The haves get more, the have-nots die.

Working...