Slashdot Log In
Google Sorts 1 Petabyte In 6 Hours
Posted by
Soulskill
on Sunday November 23, @11:53AM
from the sort-of-fast dept.
from the sort-of-fast dept.
krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."
Related Stories
[+]
MapReduce Goes Commercial, Integrated With SQL 99 comments
CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.

Kudos to Google (Score:5, Funny)
for knowing how important the Library of Congress metric is to us nerds!
Reply to This
Re:Kudos to Google (Score:5, Funny)
for knowing how important the Library of Congress metric is to us nerds!
But at least now we know Google can sort out petafiles.
Reply to This
Parent
Re:Kudos to Google (Score:5, Funny)
Reply to This
Parent
Unit conversion (Score:5, Funny)
Yay! We finally have unit conversion from 1 LoC to bytes! So...20 PB = 6LoC, means that 1 LoC = 3,333... PB :)
Reply to This
Re:Unit conversion (Score:4, Informative)
No, 1 PB = 12 LoC, so 1 LoC = 0.0833... PB
Also, I'd like to make some kind of swimming pool reference.
Reply to This
Parent
That's Easy (Score:5, Interesting)
Reply to This
Re:That's Easy (Score:5, Insightful)
I came here to post the same thing. If they sorted a petabyte of Floats, that might be pretty impressive. But if they're sorting 5-terabyte video files, their software really sucks.
Not enough info to judge the importance of this.
Reply to This
Parent
Re:That's Easy (Score:5, Informative)
I think this is the data set. I could be wrong though. The article (yeah yeah) says that
In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark.
Which lead me to this page [hp.com] that describes the data (and it's available for download).
Reply to This
Parent
Re:That's Easy (Score:5, Informative)
From TFA: they sorted "10 trillion 100-byte records"
Reply to This
Parent
Re:That's Easy (Score:5, Funny)
And yet google don't even convert petabytes to libraries of congress in the google calculator.
Or perhaps I got the syntax wrong.
Reply to This
Parent
Re:That's Easy (Score:5, Funny)
Huh? This isn't the parent post I was trying to reply to.
Reply to This
Parent
Need to benchmark against the best sorts (Score:5, Insightful)
Sorts have been parallelized and distributed for decades. It would be interesting to benchmark Google's approach against SyncSort [syncsort.com]. SyncSort is parallel and distributed, and has been heavily optimized for exactly such jobs. Using map/reduce will work, but there are better approaches to sorting.
Reply to This
Finally... (Score:5, Funny)
I will be able to catalog my pr0n in my lifetime:
Blondes, Brunettes, Red heads, Beastial^H^H^H^H^H "Other"
Reply to This
tagging (Score:5, Interesting)
It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.
That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.
Reply to This
Parent
Re:tagging (Score:5, Funny)
pr0n for Geeks, volume 18: Sorting On-the-Fly
Reply to This
Parent
Not impressive... (Score:5, Funny)
Reply to This
Amazing feat... (Score:5, Funny)
Today from Google, the god of all things and doer of all things good in the universe, many millions of dollars in computer equipment were able to sort lots of things, in about the amount of time you would think it would take for millions of dollars of equipment to sort things.
In other news, a woodchuck was found chucking wood as fast as a woodchuck could chuck wood.
Congrats Google, you have a HUGE data set, and an even bigger wallet.
Reply to This
Re:Sort? Sort what? (Score:5, Informative)
I realize, slashdot..., but maybe you could glance at the article which states:
10 trillion 100-byte records
Reply to This
Parent
Re:Sort? Sort what? (Score:5, Funny)
Reply to This
Parent
Re:Its About Time.... (Score:4, Informative)
Are you sure? It wasn't marked Vista capable.
Reply to This
Parent
Re:Its About Time.... (Score:4, Funny)
Not only that the extra processors aren't covered under the EULA and require special extra licenses.
Reply to This
Parent
Re:20,111 Servers ?? (Score:4, Insightful)
Reply to This
Parent
Re:20,111 Servers ?? (Score:4, Insightful)
Oh dear. 4000*362 ~= 1440*20111 / 20. So you assumed that the sorting would scale linearly. fail.
Reply to This
Parent
Re:One ups Yahoo & Hadoop (Score:5, Interesting)
Reply to This
Parent
Re:MapReduce (Score:5, Informative)
The individual functions map and reduce are quite standard. The innovation here is the systems work they've done to make it work on such a large scale. All the programmer needs to worry about is implementing the two functions, they don't have to worry about distributing the work, ensuring fault tolerance, or anything else for that matter. That is the innovation.
They mention in the article that if you try and sort a petabyte you WILL get hard disk and computer failures. Hell, you can only read a terabyte hard disk a few times before you encounter unrecoverable errors. The system for executing those maps and reduces is what is important here. The important parts are in the design details, such as dealing with stragglers. If you have 4000 identical machines, you won't necessarily get equal performance. If a few of those machines have a bit flipped and started without disk cache, they might see a huge decrease in read/write performance. The system needs to recognize this and schedule the work differently. That can make a huge difference in execution time. If you graph the percentile complete of a MR job, you'll often see that it quickly reaches 95% and then plateaus. The last 5% may take 20% of the time, and good scheduling is required to bring this time down.
But like I said, the innovation isn't in the idea of using a Map and Reduce function, it is the system that executes the work.
Reply to This
Parent