Google Sorts 1 Petabyte In 6 Hours 166

Posted by Soulskill on Sunday November 23, 2008 @12:53PM from the sort-of-fast dept.

krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."

Google Sorts 1 Petabyte In 6 Hours

This discussion has been archived. No new comments can be posted.

Search 166 Comments Log In/Create an Account

Comments Filter:

Re:That's Easy (Score:5, Insightful)

by Blakey Rat ( 99501 ) writes: on Sunday November 23, 2008 @01:08PM (#25865335)

I came here to post the same thing. If they sorted a petabyte of Floats, that might be pretty impressive. But if they're sorting 5-terabyte video files, their software really sucks.
Not enough info to judge the importance of this.

Need to benchmark against the best sorts (Score:5, Insightful)

by Animats ( 122034 ) writes: on Sunday November 23, 2008 @01:12PM (#25865371) Homepage

Sorts have been parallelized and distributed for decades. It would be interesting to benchmark Google's approach against SyncSort [syncsort.com]. SyncSort is parallel and distributed, and has been heavily optimized for exactly such jobs. Using map/reduce will work, but there are better approaches to sorting.

Sort? Sort what? (Score:1, Insightful)

by mlwmohawk ( 801821 ) writes: on Sunday November 23, 2008 @01:24PM (#25865453)

One quadrillion bytes, or 1 million gigabytes.
How big are the fields being sorted. Is it an exchange sort or a reference sort?
It is probably very impressive, but without a LOT of details, it is hard to know.

Libraries of congress? (Score:3, Insightful)

by TinBromide ( 921574 ) writes: on Sunday November 23, 2008 @02:03PM (#25865715)

First of all, this isn't a straight up "Libraries of Congress" (better known and mentioned in prior posts as a LoC). Its the web archiving arm of the LoC. I call for the coining of a new term, WASoLoC (Web Archival System of Library of Congress) which can be defined as X * Y^Z = 1 WASoLoC where X is some medium that people can relate to (books, web pages, documents, tacos, water, etc), Y is a volume (Libaries, Internets, Encyclopedias, end to end from A to B, swimming pools, etc) and Z is some number that marketing drones come up with because it makes them happy in their pants.

Honestly, How am i supposed to know what "..the amount of archived web data in the US Library of Congress as of May 2008." Looks like!? I've been to the library of congress, i've seen it, its a metric shit-ton of books (1 shit-ton = Shit * assloads^fricking lots), but i have no clue what the LoC is archiving, what rate they're going at it, and what the volume is of it.

Re:Sort? Sort what? (Score:3, Insightful)

by nedlohs ( 1335013 ) writes: on Sunday November 23, 2008 @02:46PM (#25866023)

You do have to merge them all back together at the end...
But I'm sure you can do better tonight.

Re:Sort? Sort what? (Score:3, Insightful)

by chaim79 ( 898507 ) writes: on Sunday November 23, 2008 @02:46PM (#25866029) Homepage

right, so it's 250gb sorted in 6 hours... now where does the sorting and integration of the 4000 250gb blocks of sorted data come in? :)

Re:20,111 Servers ?? (Score:4, Insightful)

by chaim79 ( 898507 ) writes: on Sunday November 23, 2008 @02:50PM (#25866071) Homepage

Yah, but you gotta wonder at the computing cost of integrating all those datasets into one complete sorted block of data. It could be that those servers can sort at 1gb per min but the overhead for combining is 25% of the computing time.

Re:20,111 Servers ?? (Score:4, Insightful)

by smallfries ( 601545 ) writes: on Sunday November 23, 2008 @03:21PM (#25866291) Homepage

Oh dear. 4000*362 ~= 1440*20111 / 20. So you assumed that the sorting would scale linearly. fail.

Re:Need to benchmark against the best sorts (Score:1, Insightful)

by Anonymous Coward writes: on Sunday November 23, 2008 @06:10PM (#25867679)

I guess it's up to SyncSort to run a benchmark and publish the results, no?

Re:Need to benchmark against the best sorts (Score:4, Insightful)

by ShakaUVM ( 157947 ) writes: on Monday November 24, 2008 @08:22AM (#25871655) Homepage Journal

>>Using map/reduce will work, but there are better approaches to sorting.
It kinda bugs me that Google trademarked (or, at least, what they named their software) after a programming modality that has been in parallel processing for ages. In fact, MPI has a mapreduce() function that, well, does a map/reduce operation. I.e., farms out instances of a function to a cluster, then gathers the data back in, summates it, and presents the results to someone.
It kind of bugs me (in their Youtube video linked in TFA, at least) that they make it seem that this model is their brilliant idea, when all they've done is write the job control layer under it. There's other job control layers that control spawning new processes, fault tolerance, etc., and have been for many, many years. Maybe it's nicer than other packages, in the same way that Google Maps is nicer than other map packages, but I think most people like it just because they don't realize how uninspired it is.
It'd be like them coming out with Google QuickSort(beta) next.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Sorts 1 Petabyte In 6 Hours 166

Google Sorts 1 Petabyte In 6 Hours More Login

Google Sorts 1 Petabyte In 6 Hours

Re:That's Easy (Score:5, Insightful)

Need to benchmark against the best sorts (Score:5, Insightful)

Sort? Sort what? (Score:1, Insightful)

Libraries of congress? (Score:3, Insightful)

Re:Sort? Sort what? (Score:3, Insightful)

Re:Sort? Sort what? (Score:3, Insightful)

Re:20,111 Servers ?? (Score:4, Insightful)

Re:20,111 Servers ?? (Score:4, Insightful)

Re:Need to benchmark against the best sorts (Score:1, Insightful)

Re:Need to benchmark against the best sorts (Score:4, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot