## Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs 187

An anonymous reader writes

*"Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."*
## The video card in question.. (Score:5, Informative)

Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

## Link to Technical Paper (Score:5, Informative)

## Re:x86 (Score:5, Informative)

GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.

Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.

## No (Score:5, Informative)

GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:

1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.

2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.

3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.

4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.

5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.

Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.

That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.

In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.

## Re:Ugh. (Score:3, Informative)

Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.

When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.

The O(n*log(n)) algorith is faster than the O(n) one when

p1*n*log(n)+q1 < p2*n+q2

... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.

This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster than an algorithm which is O(n) on lists with three elements or more. The O(n*log(n)) algorithm may take a hundred times longer to sort a list of two elements than the O(n) one (due each step being more complex), and in that case the lists will need to grow some before the O(n*log(n)) algorithm becomes faster.

## Re:The video card in question.. (Score:2, Informative)

## Re:Um... (Score:3, Informative)

Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."

The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).

Here is the relevant Wikipedia article [wikipedia.org].

## Re:Also (Score:1, Informative)

Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

Normally big-O notation is applied on a purely theoretical level where all operations are assumed to have the same base cost in terms of time to execute. This does not make the notation invalid for real world applications and implementations, however. But when doing so you have to adjust your formulas by adding the proper weighting to execution time. In practice this is usually a waste of effort as it's generally faster to just write an implementation and time it on various pieces of hardware.

In the article we're talking about, they are comparing a single implementation of a single algorithm on several pieces of hardware. So first of all the summary shouldn't be shouting about breaking any kind of record- they weren't trying to hit any particular benchmark it's a relative test of the hardware, not the algorithm or implementation.

The reason why using big-O and making comparisons is useful, is because if all we use is this type of test the answer to any speed problem is simply "Get faster hardware, or buy a piece of hardware which runs this code faster". In all likliehood, there are other methods which, when implemented on the same hardware, may yield much faster results. Heck, it's possible someone else's implementation of the same algorithm may yield faster results as well.

In regards to your comments about raytracing vs. polygon rendering, all I'm going to say is that you don't have a very good concept of what raytracing really is if you think a sphere created from a million triangles will raytrace faster than one modeled as a mathematical sphere. It won't- those demonstrations are a pure head-to-head comparison operating on a scene which has actually been optimized for a non-raycasting technique.

## Re:No (Score:2, Informative)

## Re:Excel Charts (Score:4, Informative)

## Re:The video card in question.. (Score:2, Informative)

Chips on the GTX 480, C2050, and C2070 come from the exact same die and wafer. The C20XX GPUs actually run at a lower clock speed for 32-bit floating-point and integer operations than a GTX 480.

C20XX series hardware is intended for physics/science/engineering calculations, where double-precision is preferred. The C20XX series is 4 times faster at double-precision calculations than the GTX 480. This is the sweet spot it is tuned for.

## Re:I think the bubble sort would be the wrong way (Score:1, Informative)

His naivety hit the real world and exploded.