Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Graphics Software

Using GPUs For General-Purpose Computing 396

Paul Tinsley writes "After seeing the press releases from both Nvidia and ATI announcing their next generation video card offerings, it got me to thinking about what else could be done with that raw processing power. These new cards weigh in with transistor counts of 220 and 160 million (respectively) with the P4 EE core at a count of 29 million. What could my video card be doing for me while I am not playing the latest 3d games? A quick search brought me to some preliminary work done at the University of Washington with a GeForce4 TI 4600 pitted against a 1.5GHz P4. My Favorite excerpt from the paper: 'For a 1500x1500 matrix, the GPU outperforms the CPU by a factor of 3.2.' A PDF of the paper is available here."
This discussion has been archived. No new comments can be posted.

Using GPUs For General-Purpose Computing

Comments Filter:
  • Googled HTML (Score:5, Informative)

    by balster neb ( 645686 ) on Sunday May 09, 2004 @02:54AM (#9098535)
    Here's a HTML version of the PDF [216.239.57.104], thanks to Google.

  • by Anonymous Coward on Sunday May 09, 2004 @02:57AM (#9098550)
    General-purpose computation using graphics hardware has been a significant topic of study for the last few years. Pointers to a lot of papers and discussion on the subject are available at: www.gpgpu.org [gpgpu.org]
  • by Knightmare ( 12112 ) on Sunday May 09, 2004 @02:59AM (#9098556) Homepage
    Yes, it's true that it has that many transistors BUT, only 29 million of them are part of the core, the rest is memory. The transistor count on the video cards does not count the ram.
  • PDF to HTML (Score:2, Informative)

    by Libraryman ( 721151 ) on Sunday May 09, 2004 @03:02AM (#9098566)
    Here [adobe.com] is a link at Adobe where you can turn any PDF into HTML.
  • Hacking the GPU (Score:5, Informative)

    by nihilogos ( 87025 ) on Sunday May 09, 2004 @03:03AM (#9098572)
    Is a course being offered at caltech since last summer on using gpus for numerical work. Course page is here [caltech.edu].
  • by Anonymous Coward on Sunday May 09, 2004 @03:05AM (#9098580)
    Before you get excited just remember how asymmetric the APG bus is. Those GPUs will be at much better use when we get them as 64bit pci cards.
  • Siggraph 2003 (Score:5, Informative)

    by Adam_Trask ( 694692 ) on Sunday May 09, 2004 @03:21AM (#9098626)
    Check out the publication list in Siggraph 2003. There is a whole section named "Computation on GPUs" (papers listed below). And the papers for Siggraph 2004 should be out shortly.

    If you have a matrix solver, there is no telling what you can do. And i remember, these papers show that the speed is faster than the matrix calculations of the same stuff using the CPU.

    # Linear Algebra Operators for GPU Implementation of Numerical Algorithms
    Jens Krüger, Rüdiger Westermann

    # Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
    Jeff Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder

    # Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware
    Karl E. Hillesland, Sergey Molinov, Radek Grzeszczuk

  • here ya go (Score:4, Informative)

    by dave1g ( 680091 ) on Sunday May 09, 2004 @03:22AM (#9098630) Journal
    some one else posted this...

    www.gpgpu.org [gpgpu.org]

    Website on this topic (Score:0)
    by Anonymous Coward on Sunday May 09, @01:57AM (#9098550)
    General-purpose computation using graphics hardware has been a significant topic of study for the last few years. Pointers to a lot of papers and discussion on the subject are available at: www.gpgpu.org [gpgpu.org]
  • by LinuxGeek ( 6139 ) <djand.ncNO@SPAMgmail.com> on Sunday May 09, 2004 @03:27AM (#9098641)
    If they are ignoring the cache on the P4 EE, then why mention the Extreme Edition at all? Cache size is the only difference between the Xeon based EE and a regular Northwood P4. Also, modern GPU's certainly do have cache. Read this old GeForce4 preview [pcstats.com] .
    The Light Speed Memory Architecture (LMA) that was present in the GeForce3 has been upgraded as well, with it's major advancements in what nVidia calls Quad Cache. Quad Cache includes a Vertex Cache, Primitive Cache, Texture Cache and Pixel Caches. With similar functions as caches on CPU's, these are specific, they store what exactly they say.
    Another good article [digit-life.com] has a block diagram showing the cache structures of the GeForce FX GPU. Nvidia and ATI both keep quiet about the cache sizes on their GPUs, but that dosen't mean that the full transistor count is dedicated to the processing core.
  • by Lord Prox ( 521892 ) on Sunday May 09, 2004 @03:28AM (#9098645) Homepage
    BrookGPU [stanford.edu]
    from the BrookGPU website...
    As the programmability and performance of modern GPUs continues to increase, many researchers are looking to graphics hardware to solve problems previously performed on general purpose CPUs. In many cases, performing general purpose computation on graphics hardware can provide a significant advantage over implementations on traditional CPUs. However, if GPUs are to become a powerful processing resource, it is important to establish the correct abstraction of the hardware; this will encourage efficient application design as well as an optimizable interface for hardware designers.

    From what I understand this project it aimed at making an abstraction layer for GUP hardware so writing code to run on it is easier and standardsied.
  • Pseudo repost (Score:4, Informative)

    by grape jelly ( 193168 ) on Sunday May 09, 2004 @03:42AM (#9098682)
    I thought this looked familiar:

    http://developers.slashdot.org/developers/03/12/21 /169200.shtml?tid=152&tid=185 [slashdot.org]

    At least, I would imagine most of the comments would be the same or similar....
  • Re:Altivec (Score:2, Informative)

    by John Starks ( 763249 ) on Sunday May 09, 2004 @03:48AM (#9098702)
    I would guess the difference would be comparable. Altivec is no more impressive than the SSE/SSE2/etc. types of instructions of the modern x86.
  • Re:Not so... (Score:2, Informative)

    by Anonymous Coward on Sunday May 09, 2004 @04:21AM (#9098784)
    Hmm. My Newton has a "160Mhz StrongARM SA-110 RISC Processor". Doesn't sound like a GPU to me.
  • by BlueJay465 ( 216717 ) on Sunday May 09, 2004 @04:37AM (#9098831)
    Well they already make DSP cards for audio processing. Simply do a google(TM) search for "DSP card" and you will get [uaudio.com] several [tcelectronic.com] vendors. [digidesign.com]

    I can't imagine it would take a whole lot to hack them for just their processing power outside of audio applications.
  • by nothings ( 597917 ) on Sunday May 09, 2004 @04:42AM (#9098843) Homepage
    Transistor counts keep growing, so I keep updating this and reposting it about once a year.

    486 : 1.2 million transistors
    Pentium : 3 million transistors
    Pentium Pro : 5.5 million transistors
    Pentium 2 : 7.5 million transistors
    Nvidia TNT2 : 9 million transistors
    Alpha 21164 : 9.3 million (1994)
    Alpha 21264 : 15.2 million (1998)
    Geforce 256 : 23 million transistors
    Pentium 3 : 28 million transistors
    Pentium 4 : 42 million transistors
    P4 Northwood : 55 million transistors
    GeForce 3 : 57 million transistors
    GeForce 4 : 63 million transistors
    Radeon 9700 : 110 million transistors
    GeForce FX : 125 million transistors
    P4 Prescott : 125 million transistors
    Radeon X800 : 160 million transistors
    P4 EE : 178 million transistors
    GeForce 6800 : 220 million transistors
    here's the non-sucky version [nothings.org] since <ecode> doesn't actually preserve spacing like <pre>.
  • by NanoGator ( 522640 ) on Sunday May 09, 2004 @04:51AM (#9098862) Homepage Journal
    "The graphics card has a lot of unused computing power, nearly equal to the main processor chip in the computer if not more, that is not being used when there is no game or video being played, right?"

    Longhorn is suppossed to offload a lot of the GUI stuff to the card. So yeah, it'd take advantage of untapped power of the card. However, as for other general purpose stuff, it wouldn't be so interesting. It's kinda like comparing a Ferrari to a school bus. The Ferrari will run circles around the bus, but can only ferry 2 people. The bus can move a LOT of cargo, but not as fast as the Ferrari. We're talking about specialization here. The trick is to find ways to take what the GPU is good at and making them useful.
  • Audio DSP (Score:4, Informative)

    by buserror ( 115301 ) * on Sunday May 09, 2004 @05:23AM (#9098926)
    I've been thinking about using the GPU for audio DSP work for some time, even got to a point where I could transform some signal by "rendering" it into a texture (in a simple way, I could mix two sounds using the alpha as factor).
    The problem is that these cards are made to be "write only" and that basicaly fetching back anything from them is *very* slow, which makes them totaly useless for the purpose, since you *kmow* the results are there, but you can't fetch them in an usefull/fast maneer.
    I wonder if it's deliberate, to sell the "pro" cards they use for the rendering farms
  • Re:audio stuff (Score:4, Informative)

    by Hast ( 24833 ) on Sunday May 09, 2004 @05:25AM (#9098935)
    Look at gpgpu.org I believe they have papers on doing FFT on GPUs. They also have a collection on papers regarding GPU as CPUs.
  • by Hast ( 24833 ) on Sunday May 09, 2004 @05:31AM (#9098945)
    Well, it's really more that the pipelines are very long. On the order of 600 pipelinestages, and that's pretty damned long. (P4 which is a CPU with a deep pipeline has 21 stages IIRC.)

    They do of course store data between those stages, and there are caches on the chip. Otherwise performance would be shot all to hell.

    I doubt that the original statement that GPU designs don't count the on chip memory is correct. That just seems like an odd way to do it.
  • Re:Dual Core (Score:3, Informative)

    by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Sunday May 09, 2004 @05:58AM (#9098990) Homepage
    Video cards are already able to run many things in parallel- they are beyond dual-core.
  • by Crazy Eight ( 673088 ) on Sunday May 09, 2004 @06:00AM (#9098993)
    QE is cool, but it doesn't do anything similar at all to what they're talking about here. FFTs on an NV30 are only incidentally related to texture mapping window contents. Check out gpgpu.org or BrookGPU. In a sense, the idea is to treat modern graphics hardware as the next step beyond SIMD instruction sets. Incidentally, e17 exploited (hardware) GL rendering of 2D graphics via evas a bit before Apple put that into OS X.
  • by Black Parrot ( 19622 ) on Sunday May 09, 2004 @06:21AM (#9099030)


    > Transistor counts keep growing, so I keep updating this and reposting it about once a year.

    For those who don't already know, what we now think of as "Moore's Law" was originally a statement about the rate of growth in the number of transistors on a chip, not about CPU speed.

  • by sonamchauhan ( 587356 ) <sonamc@PARISgmail.com minus city> on Sunday May 09, 2004 @06:38AM (#9099073) Journal
    Seems worth checking out: GPGPU.ORG [gpgpu.org] - "General-Purpose Computation Using Graphics Hardware"

    > AGP does a lot better taking data in, but it's still pretty
    > costly sending data back to the CPU.
    I've heard that mentioned a few times, is it true?

    From the AGP 3.0 spec [intel.com]:
    The AGP3.0 interface is designed to support several platform generations based upon 0.25m (and
    smaller) component silicon technology, spanning several technology generations. As with AGP2.0, the
    physical interface is designed to operate at a common clock frequency of 66 MHz. Its source
    synchronous data strobe operation, however, is octal-clocked and transfers eight double words
    (Dwords) of data within the span of time consumed by a single common clock cycle. The AGP3.0 data
    bus provides a peak theoretical bandwidth of 2.1 GB/s (32 bits per transfer at 533 MT/s). Both the
    common clock and source synchronous data strobe operation and protocols are similar to those
    employed by AGP2.0.11


    Later on Page 96:
    Traditional AGP devices can demand up to the maximum bandwidth available over the AGP ports.
    However, the AGP system does not guarantee to deliver the requested bandwidth, nor does it guarantee
    transfers will take place within some clearly specified request/transfer latency time. ...
    This is done by the system guaranteeing to process a specified number (N) of read or write transactions of a specified size (Y) during each isochronous time period (T). An AGP3.0 device can divide this bandwidth between read and write traffic as appropriate. Further, the system transfers isochronous data over the AGP3.0 Port within a specified latency (L).

    (emphasis mine)

    I'm no expert, just asking if the "low upsream bandwidth" assumption is true. If it is, there could still some applications (eg: simple data compression) that could use it. Also, maybe output from VGA/DVI ports could be tapped.

  • Re:audio stuff (Score:3, Informative)

    by zsazsa ( 141679 ) on Sunday May 09, 2004 @09:17AM (#9099435) Homepage
    It would be really neat if I could do some of the more complicated audio analysis (FFT etc) that requires lots of vector math using the video cards gpu.

    There's a company that actually does this. The Universal Audio UAD-1 [uaudio.com] audio DSP had a previous life as a video card and a DVD hardware accelerator. Check out this thread on the UAD forums [chrismilne.com] for more technical information.
  • Comment removed (Score:3, Informative)

    by account_deleted ( 4530225 ) on Sunday May 09, 2004 @09:49AM (#9099537)
    Comment removed based on user account deletion
  • Re:Dual Core (Score:3, Informative)

    by BrookHarty ( 9119 ) on Sunday May 09, 2004 @10:50AM (#9099787) Journal
    I can tell upto about 80'ish FPS, but I run the refresh rate at 85 or 100 for no flicker. So yes there is a point for higher FPS. But you didnt say you played video games. And if you turn vsync off you get tearing.

    I remember awhile back someone did quake2 benchmarks on accuracy vs FPS, and how 79FPS (i think) was the sweet spot, faster and lower refresh rate had a negative effect on accuracy.

    But I wont argue 20FPS over 80, but 100 seems to be target. imho
  • by thurin_the_destroyer ( 778290 ) on Sunday May 09, 2004 @10:54AM (#9099818)
    Having done a similar work for my final year project this year, I have some experience attempting general purpose computation on a GPU. The results that I recieved when comparing the CPU with the GPU were very different with many of the applications coming in at 7-15 times slower on the GPU. Further, I discovered some problems which I mention below:

    ! Matrix results
    As in mentioned earlier in the report, the graphics pipeline does not support a branch instruction. So with a limitied number of assembly instructions that can be executed in each stage of the pipeline (either 128 or 256 in current cards), how is it possible for them to perform a calculation on a 1500x1500 matrix multiplication. To calculate a single result 1500 multiplications would need to take place and if they are really clever about how they encode the data into texture s to optimise access, they would need two texture accesses for even 4 multiplications. By my calculations that is 1875 instructions, where you can only do 128 or 256.

    My tests found that using the Cg compiler provided by NVidia, that a matrix of size 26x26 could be multiplied before the unrolling of the for loop exceed the 256 limitation.

    One aspect that my evaluation did not get to examine was the possiblity of reading partial results back from the framebuffer to the texture memory along with loading a slightly modified program to generate the next partial result. They don't mention if they used this strategy so I assume that they don't.

    ! Inclusion of a branch instruction
    Even if a branch instruction were to be included into the vertex and fragment stages of the pipeline, it would cause serious timing issues. As student of Computer Science, I have been taught that the pipeline operates at the speed of the slowest stage and from designing simple pipelined ALUs, I see the logic behind it. However, if a branch instruction is included then the fragment processing stage could become the slowest as the pipeline stalls waiting for the fragment processor to output its information into the framebuffer. I believe it for this reason that the GPU designers specifically did not include a branch instruction.

    ! Accuracy
    My work also found a serious accuracy issue with attempting compuation on the GPU. Firstly, the GPU hardware represents all number in the pipeline as floating point values. As many of you can probably guess, this brings up the ever present problem of 'floating point error'. The interface between GPU and CPU are traditionally 8-bit values. Once they are imported into the 32-bit floating point pipeline the representation has them falling between 0 and 1, meaning that these numbers must be scaled up to their intended representations (integers between 0 and 255 for example) before computation can begin. Combine these two necessary operations and what I saw was a serious accuracy issue where five of my nine results(in the 3x3 matrix) were one integer value out.

    While I don't claim to be an expert on these matters, I do think there is the possiblity of using commodity graphics cards for general purpose computation. However, using hardware that is not designed for this purpose holds some serious constraints in my opinion. Anyone who cares to look at my work can find it here [netsoc.tcd.ie]
  • by mc6809e ( 214243 ) on Sunday May 09, 2004 @11:00AM (#9099850)
    Yes, it's true that it has that many transistors BUT, only 29 million of them are part of the core, the rest is memory. The transistor count on the video cards does not count the ram.

    Sure it does, it's just that the ram isn't cache, it's mostly huge register files.

  • Re:Commodore 64 (Score:3, Informative)

    by curator_thew ( 778098 ) on Sunday May 09, 2004 @11:42AM (#9100093)
    I don't recall exactly: maybe Horizon, definitely scandinavian. I remember because I decompiled it! What happened was that I started the demo, and unusually the disk drive kept spinning: so I turned if off which caused the demo to fail. Tested loading, then trying to start the demo and it didn't work, so curiosity, an Action Reply and an irq investigation revealed what was going on. I think it was a single part demo: the most memorable C64 demo for me because of that trick.

  • Re:Commodore 64 (Score:1, Informative)

    by Anonymous Coward on Sunday May 09, 2004 @11:55AM (#9100181)
    This concept was being used back in 1988. The Commodore 64 (1mhz 6510, a 6502 like micro processor) had a peripheral 5.25 disk drive called the 1541, which itself had a 1mhz 6510 cpu in it, connected via. a serial link.

    Actually, the processor in the 1541 was an ordinary 6502. The 6510 added some memory mapping stuff that the drive didn't need.
  • by peter303 ( 12292 ) on Sunday May 09, 2004 @12:01PM (#9100235)
    GPUs pass input and output from GPU memory at 4-12 bytes per flop. This is much faster than CPUs which are limited by bus speeds that are likely to deliver a number every sever several operations. So CPU benchmarks are bogus, using algorithms that use internal memory over and over again.

    Its not always easy to reformulate algorithms to fit streaming memory and other limitations of GPUs. This issue has come up in earlier generations of custom computers. So, there are things like cyclic matrices tha map multi-dimensional matrix operations into 1-D streams, and so on.

    The 2003 SIGGRAPH had a session [siggraph.org] on this topic showing you could implement a wide variety of algorithms outside of graphics.
  • by Slack3r78 ( 596506 ) on Sunday May 09, 2004 @12:24PM (#9100393) Homepage
    Actually, the GeForce 6800 includes the hardware to do just that [anandtech.com]. I'm surprised no one else has mentioned it by now, as I thought it was one of the cooler features of the new NV40 chipset.
  • by Anonymous Coward on Sunday May 09, 2004 @02:10PM (#9101050)
    ATI has had this for even longer. The all-in-wonder series uses the video card to do accelerated encoding and decoding.

    Also, I believe that mplayer, the best video player/encoder I have seen also uses openGL (and thus the video card on a properly configured system) to do playback.

    Personally, I don't think there is anything really new in this article.
  • Re:What comes next. (Score:4, Informative)

    by SmackCrackandPot ( 641205 ) on Sunday May 09, 2004 @04:56PM (#9101760)
    64-bit floating point texture filtering and blending and support for the D3D vertex and pixel shader 3.0 standard,

    That's 64-bits for a four element vector (RGBA) or (XYZW), which is thus 16-bits per float. This is referred to as the 'half' floating point data type, as opposed to 'float' or 'double'. This is compatible with Renderman.
  • Re:Three questions (Score:3, Informative)

    by be-fan ( 61476 ) on Sunday May 09, 2004 @06:12PM (#9102106)
    1. Is anyone except Apple trying to leverage the GPU for non-3D tasks? Apple has been doing Quartz Extreme for a while but I have not heard if anyone else is doing it.
    Microsoft, for Longhorn, and freedesktop.org, for X11. Both go quite a bit beyond Quartz Extreme by using D3D/OpenGL for all drawing, not just compositing.

    3. How come GPU makers are not trying to make a CPU by themselves?
    GPUs are very different from CPUs. Graphics is almost infinitely parallizable, so you are really just limited by how many execution units you can stick on the CPU. Assuming enough memory bandwidth, you get nearly a linear increase with increasing numbers of execution units. CPUs, on the other hand, deal with general-purpose code that has an inherent parallelism of about 3-way to 4-way at most. So CPU manufacturers have to do clever things like SMT to take advantage of increased execution resources, but mainly must concentrate on ramping up clock speed and memory bandwidth.

    Interestingly enough, GPU makers wouldn't be very good at making CPUs. GPUs are designed using high-level software, like VHDL. This has a big impact on their maximum clock speed, but that doesn't really matter, because they can always double the number of pipelines and get a nearly 2x increase in performance. Meanwhile, CPUs are designed by hand, and tweeked to get every last MHz, because throwing twice as many execution units on the CPU wouldn't help performance much at all.
  • by cehardin ( 163989 ) on Sunday May 09, 2004 @06:59PM (#9102315)
    Utter crap, fanboy. OS X's directory structure is a basic UNIX system hidden by the file manager, with applications thrown on to '/'."

    Boy, you really have no idea what the heck you are talking about, do you? Of course the basic UNIX stuff is there, /bin, /sbin, /usr/local, all that stuff.

    Those directories have very little files in them, you will also notice a lack of init.d startup scripts. Most of the system is contained in /System.

    For example, rather than /etc/init.d, it has startup services in /System/Library/StartupItems. For example there is an apache folder, in that are the scripts necessary to start Apache along with a file which describes Apache's dependencies. Also, these startup items are multi lingual. You can boot into any language you want. All of this in one folder. That's f*cking elegance, yet it is only a very small example.

    Check it out, you will see.

  • Re:video stuff (Score:1, Informative)

    by Anonymous Coward on Sunday May 09, 2004 @09:55PM (#9103178)
    Pinnacle has a video editing product called Liquid Edition that uses the GPU for processing video effects & such.
  • by cliffwoolley ( 506733 ) on Monday May 10, 2004 @12:47AM (#9103976)

    As for organizations beating slashdot to the punch on this one, that's true... but it's good to see this getting even more exposure. :)

    GPGPU (General-Purpose computation on GPUs) was a hot topic at various conferences in 2003; a number of papers were published on the subject. At SIGGRAPH 2004 [siggraph.org] there will be a full-day course [gpgpu.org] on GPGPU given by eight of the experts in the field (including myself).

    Mark Harris of NVIDIA [nvidia.com] maintains a website [gpgpu.org] dedicated to GPGPU topics, including discussion forums and news postings. Well worth a browse if you're interested in GPGPU topics.

    I look forward to seeing some of you at SIGGRAPH! :)

    --Cliff

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...