Forgot your password?
typodupeerror
Supercomputing Java Programming IT Technology

Open Source Solution Breaks World Sorting Records 139

Posted by Soulskill
from the out-of-sorts dept.
allenw writes "In a recent blog post, Yahoo's grid computing team announced that Apache Hadoop was used to break the current world sorting records in the annual GraySort contest. It topped the 'Gray' and 'Minute' sorts in the general purpose (Daytona) category. They sorted 1TB in 62 seconds, and 1PB in 16.25 hours. Apache Hadoop is the only open source software to ever win the competition. It also won the Terasort competition last year."
This discussion has been archived. No new comments can be posted.

Open Source Solution Breaks World Sorting Records

Comments Filter:
  • by blahplusplus (757119) on Saturday May 16, 2009 @10:54AM (#27979035)

    ... truth be told, a lot of good engineering could happen if many of peoples favorite commercial applications could have the souce distributed with them, a lot of old games for instance coudl be updated and maintained.

    I think what holds the progress of open source back is interesting projects that exist that people want to work on but are locked away under corporate lock and key.

  • by berend botje (1401731) on Saturday May 16, 2009 @11:19AM (#27979221)
    Also, you can't patent software in Europe

    Not yet, but they are working on it. They tried to snuck it through by hiding it in the amendments of an agricultural bill. Luckily Poland kept watch and rose a stink about it.

    It's not over. There is too much money to be gained for that.
  • by Rockoon (1252108) on Saturday May 16, 2009 @11:41AM (#27979363)
    I was doing some back-of-the-envelope, and they are sorting 17.7GB/second, which at a minimum would require 177 HD's if each drive can write 100MB/sec.

    If its not written to disk, then there is no achievement here (you don't perform 1 minute+ sorts and then throw the result away in real-world scenarios)
  • by haruchai (17472) on Saturday May 16, 2009 @11:58AM (#27979475)
    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?
  • Google Sort (Score:2, Interesting)

    by jlebrech (810586) on Saturday May 16, 2009 @12:06PM (#27979529) Homepage

    Im looking forward to sorting my search results by Date, Title, Description, Author, etc..

  • by Halo1 (136547) <jonas.maebeNO@SPAMelis.ugent.be> on Saturday May 16, 2009 @12:36PM (#27979757) Homepage

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    The GP is confusing a bunch of things. First, the Council of Ministers threw out all limiting amendments from the European Parliament and reached an Political Agreement on a shoddy text through backdoor maneuvering by Germany and the European Commission [google.com]. That text would have turned the European Patent Office's practice of granting software patents into EU legislation.

    A Political Agreement has no juridical nor legislative value, but it has never happened that a political agreement was later on annulled and that negotiations were reopened. So also in this case, even though the German, Dutch, Spanish and Danish parliaments afterwards passed motions asking to reopen the discussions, the Council's bureaucrats did not want to do that because it "would undermine the efficiency of the decision making process".

    Anyway, once you have a Political Agreement (which is reached by the representatives of the ministries responsible for the matter at hand) and nobody "wants" to discuss it anymore, the agreement can be placed as an "A item" on any EU Council of Ministers meeting, since it only needs rubber stamping in that case. In the case of the Software Patents Directive, it appeared several times as an A item on the agenda of an Agriculture and Fisheries meeting (which is presumably where the GP's confusion stems from).

    In principle, there would have been nothing wrong with that, but in this case there was no actual political agreement, and in particular Poland was very unhappy with the way it had been treated. So 4 times in a row, Poland either had this "A item" removed from the agenda (sometimes at the last minute, because the responsible Polish minister had to be informed that they were again trying to get it through at a meeting he had no business with), or turned it into a "B item", which means that it can't be rubber stamped but that they first have to talk a bit about it (which nobody wanted to do).

    In the end it still did get approved, but that whole circus helped with in convincing the EU Parliament to table a resolution asking the Commission to restart the directive's process [ffii.org], and when the Commission refused to later on squarely reject it [ffii.org].

    You can find some more of my thoughts on the Council's behaviour here [ffii.org].

  • In sorting a terabyte, Hadoop beat Google's time (62 versus 68 seconds). For the petabyte sort, Google was faster (6 hours versus 16 hours). The hardware is of course different. (from Yahoo's blog [yahoo.net] and Google's blog [blogspot.com])

    Terabyte:
        Machines: Yahoo 1,407 Google 1,000
        Disks: Yahoo 5,628 Google 12,000
    Petabyte:
        Machines: Yahoo 3658 Google 4000
        Disks: 14,632 Google: 48,000

    Yahoo published their network specifications, but Google did not. Clearly the network speed is very relevant.

    The two take away points are: Hadoop is getting faster and it is closing in on Google's performance and scalability.

  • by Anonymous Coward on Saturday May 16, 2009 @01:38PM (#27980161)

    You don't _always_ need that much main memory -- there's a concept of something called a data-flow architecture [wikipedia.org].

    The old Tandem (I think HP calls it Neoview now) does this w/ their SQL engine. Of course, you would likely still need the last step to use temporary/overflow files on disk but the intermediate steps could potentially be done w/ data touching disk -- depends on the generated query plan or how you are "reducing" the problem.

  • by jjohnson (62583) on Saturday May 16, 2009 @07:32PM (#27982707) Homepage

    There was an episode of the Simpsons where Springfield is going to be destroyed by a meteor. Congress meets to quickly pass legislation to fund the evacuation of the city. At the last moment, a Congressman steps up to the podium and says "I'd like to add a rider providing $30 million for the perverted arts". The bill is defeated.

    It's funny because it's true.

In any formula, constants (especially those obtained from handbooks) are to be treated as variables.

Working...