Catch up on stories from the past week (and beyond) at the Slashdot story archive

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs 187

Posted by timothy on Sunday August 29, 2010 @10:22PM from the quick-like-double-time dept.

An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."

This discussion has been archived. No new comments can be posted.

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

Search 187 Comments Log In/Create an Account

Comments Filter:

The video card in question.. (Score:5, Informative)

by black3d ( 1648913 ) writes: on Sunday August 29, 2010 @10:33PM (#33412038)

Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

Share
twitter facebook
Link to Technical Paper (Score:5, Informative)

by PatPending ( 953482 ) writes: on Sunday August 29, 2010 @10:59PM (#33412126)

Revisiting Sorting for GPGPU Stream Architectures [virginia.edu] (PDF)

Share
twitter facebook
Re:x86 (Score:5, Informative)

by emmons ( 94632 ) writes: on Monday August 30, 2010 @12:14AM (#33412374) Homepage

GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.
Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.

Parent Share
twitter facebook
No (Score:5, Informative)

by Sycraft-fu ( 314770 ) writes: on Monday August 30, 2010 @12:29AM (#33412408)

GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:
1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.
2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.
3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.
4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.
5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.
Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.
That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.
In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.

Parent Share
twitter facebook
Re:Ugh. (Score:3, Informative)

by metacell ( 523607 ) writes: on Monday August 30, 2010 @02:26AM (#33412784)

Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.
When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.
The O(n*log(n)) algorith is faster than the O(n) one when
p1*n*log(n)+q1 < p2*n+q2
... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.
This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster than an algorithm which is O(n) on lists with three elements or more. The O(n*log(n)) algorithm may take a hundred times longer to sort a list of two elements than the O(n) one (due each step being more complex), and in that case the lists will need to grow some before the O(n*log(n)) algorithm becomes faster.

Parent Share
twitter facebook
Re:The video card in question.. (Score:2, Informative)

by FilipeMaia ( 550301 ) writes: on Monday August 30, 2010 @02:36AM (#33412818) Homepage

The reason for the GTX480 being faster is that it has 15 SM compared to 14 from the Tesla 2050. Also the GTX 480 runs at a higher clock speed (700 compared to 575). Put together this is 575/700*14/15 = 76.7% which comes pretty close to the 75%.

Parent Share
twitter facebook
Re:Um... (Score:3, Informative)

by Anonymous Coward writes: on Monday August 30, 2010 @02:43AM (#33412832)

Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."
The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).
Here is the relevant Wikipedia article [wikipedia.org].

Parent Share
twitter facebook
Re:Also (Score:1, Informative)

by Anonymous Coward writes: on Monday August 30, 2010 @03:32AM (#33412986)

Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.
Normally big-O notation is applied on a purely theoretical level where all operations are assumed to have the same base cost in terms of time to execute. This does not make the notation invalid for real world applications and implementations, however. But when doing so you have to adjust your formulas by adding the proper weighting to execution time. In practice this is usually a waste of effort as it's generally faster to just write an implementation and time it on various pieces of hardware.
In the article we're talking about, they are comparing a single implementation of a single algorithm on several pieces of hardware. So first of all the summary shouldn't be shouting about breaking any kind of record- they weren't trying to hit any particular benchmark it's a relative test of the hardware, not the algorithm or implementation.
The reason why using big-O and making comparisons is useful, is because if all we use is this type of test the answer to any speed problem is simply "Get faster hardware, or buy a piece of hardware which runs this code faster". In all likliehood, there are other methods which, when implemented on the same hardware, may yield much faster results. Heck, it's possible someone else's implementation of the same algorithm may yield faster results as well.
In regards to your comments about raytracing vs. polygon rendering, all I'm going to say is that you don't have a very good concept of what raytracing really is if you think a sphere created from a million triangles will raytrace faster than one modeled as a mathematical sphere. It won't- those demonstrations are a pure head-to-head comparison operating on a scene which has actually been optimized for a non-raycasting technique.

Parent Share
twitter facebook
Re:No (Score:2, Informative)

by FlawedLogic ( 1062848 ) writes: on Monday August 30, 2010 @03:33AM (#33412992)

The GTX480 can actually do a double precision op per clock cycle. Fermi was designed with DP supercomputing in mind which is why it's so bloody expensive. To get the price down for consumer cards they removed that ability since graphics doesn't generally need it. Consumer cards need four ticks to do the equivalent DP op.

Parent Share
twitter facebook
Re:Excel Charts (Score:4, Informative)

by w0mprat ( 1317953 ) writes: on Monday August 30, 2010 @05:51AM (#33413324)

Amen. Some tools like that would be a godsend. It could be coming. http://en.wikipedia.org/wiki/Linked_Data [wikipedia.org] http://linkeddata.org/ [linkeddata.org] - Not what you are talking about, but what you describe may result from it.

Parent Share
twitter facebook
Re:The video card in question.. (Score:2, Informative)

by ericcj ( 1574601 ) writes: on Monday August 30, 2010 @09:52AM (#33414430)

Chips on the GTX 480, C2050, and C2070 come from the exact same die and wafer. The C20XX GPUs actually run at a lower clock speed for 32-bit floating-point and integer operations than a GTX 480.
C20XX series hardware is intended for physics/science/engineering calculations, where double-precision is preferred. The C20XX series is 4 times faster at double-precision calculations than the GTX 480. This is the sweet spot it is tuned for.

Parent Share
twitter facebook
Re:I think the bubble sort would be the wrong way (Score:1, Informative)

by Anonymous Coward writes: on Monday August 30, 2010 @11:05AM (#33415228)

His naivety hit the real world and exploded.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs 187

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs More Login

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

The video card in question.. (Score:5, Informative)

Link to Technical Paper (Score:5, Informative)

Re:x86 (Score:5, Informative)

No (Score:5, Informative)

Re:Ugh. (Score:3, Informative)

Re:The video card in question.. (Score:2, Informative)

Re:Um... (Score:3, Informative)

Re:Also (Score:1, Informative)

Re:No (Score:2, Informative)

Re:Excel Charts (Score:4, Informative)

Re:The video card in question.. (Score:2, Informative)

Re:I think the bubble sort would be the wrong way (Score:1, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot