An Open Source Compiler From CUDA To X86-Multicore 71
Gregory Diamos writes "An open source project, Ocelot, has recently released a just-in-time compiler for CUDA, allowing the same programs to be run on NVIDIA GPUs or x86 CPUs and providing an alternative to OpenCL. A description of the compiler was recently posted on the NVIDIA forums. The compiler works by translating GPU instructions to LLVM and then generating native code for any LLVM target. It has been validated against over 100 CUDA applications. All of the code is available under the New BSD license."
Re:Alternative? (Score:3, Informative)
The reason is that today CUDA has a headstart and is more mature. Eventually things will probably shift to OpenCL but that takes time and people don't want to sacrifice features today.
Re:Alternative? (Score:3, Informative)
I've seen feature requests suggesting they are considering it, but at the moment too much information is lost in the PTX->LLVM step to be able to generate CAL or OpenCL.
Re:Alternative? (Score:3, Informative)
Here it is :
http://code.google.com/p/gpuocelot/issues/detail?id=30 [google.com]
Re:Alternative? (Score:2, Informative)
Pardon? OpenCL does not in any way bind you to an nVidia card, it was a standard created by Apple (not nVidia) and pushed to Khronos to manage as an open standard (also not nVidia). ATI have just announced drivers for their cards for OpenCL.
Re:Wait wut? (Score:3, Informative)
Which is exactly why you should be using OpenCL, not CUDA – because it lets the OpenCL driver decide whether to run it on the CPU or the GPU.
Re:Alternative? (Score:4, Informative)
He means CUDA was here first, and it does(did) lock you into Nvidia. So if you jumped on the bandwagon early, your code is Nvidia only. If you waited for a standard (opencl) (or ported your app) then you're cross-platform.
Re:Alternative? (Score:3, Informative)
I think Cuda was first out there, later on OpenCL occurred.
Yes and no. CUDA and CTM/Brook+/FireStream came to live more or less at the same time when NVIDIA and ATI realized that GPGPU (General Purpose computing on the GPU) was getting traction in the scientific computing world (originally implemented using OpenGL and shaders).
OpenCL was essentially an effort (by Apple first and foremost, although obviously with cooperation from both NVIDIA and ATI) to get a standardized interface to SIMD multicore programming. It's actually quite close to low-level CUDA programming, although I'm not sure how close it is to the ATI solution (I've tried going through the ATI docs a couple of time, but their stuff is absolutely abysmal when compared to the NVIDIA docs and SDKs, sadly).
Re:Alternative? (Score:5, Informative)
it lets CUDA code run on x86, but still doesn't do anything for AMD graphics cards
Actually, it does. It lets CUDA code run on any processor that has an LLVM back end. The open source Radeon drivers have an experimental LLVM back end and use LLVM for optimising shader code.
Re:Doesn't sound like a compiler (Score:4, Informative)
There is no LLVM backend for AMD/ATI cards. Of the few of us that actually understand ATI hardware, most of us are working on other things besides GPGPU. Sorry.
Re:Alternative? (Score:2, Informative)
OpenCL isn't ALL that close either to CUDA or anything from AMD (CAL, Brook+). The status quo with AMD is that the OpenCL implementation they have is very immature e.g. doesn't support a lot of fairly basic and highly desirable OpenCL "extensions" (actually it didn't support ANY until about 2 days ago, and now they're just beta testing a few of the most rudimentary ones). Additionally there are still lots of issues with missing / unclear documentation, missing features, bugs, development / runtime platform portability issues, et. al. Most significantly, the openCL performance is still a fraction of the performance commonly achievable with Brook+ or CAL in many common scenarios on the AMD platform. This is sometimes / often true for their 58xx series boards, and pervasively so for their older 4xxx series cards (which by architectural limitations as well as by lack of planned OpenCl development toolchain support / optimization will never really perform well with OpenCL).
On the NVIDIA side, CUDA performance and usage flexibility is still typically and substantially higher than is achievable via OpenCL, since obviously CUDA exists to fairly optimally exploit their GPU architectural capabilities whereas OpenCL is a generic GPU-vendor / architecture "neutral" platform that doesn't give as much card specific control as CUDA (or CAL in AMD's case).
Development tools and platform portability are still poor in both NVIDIA and AMD cases. NVIDIA, for instance, lacks CUDA/OpenCL support on platforms like Solaris, FreeBSD. AMD AFAIK doesn't even have graphics driver support (much less OpenCL/Stream/CAL/Brook+) on BSD, Solaris, Mac(?), and the support is pretty rocky on LINUX still.
LINUX Open Source drivers for AMD hardware are still barely at the stage of providing high quality basic 2D functionality for R600/R500 GPUs, R700 isn't there yet, and R800 is farther out still. In none of these cases does anything like Stream / Brook+ / OpenCL work with the open source driver. It seems as if it may take the better part of 2010 to go by before we see even the first good previews of OpenCL and decently useful 3D graphics running on R600/R700/R800 GPUs with Gallium, X.org, Mesa, et. al. all coming together with the open source radeon drivers.
Basically if you want high performance within the next few months, plan on writing GPU model specific code in CUDA for NVIDIA, and deal with platform / software / card portability issues that will come up frequently. If you're targeting AMD, either target R800 generation cards only, or assume that you'll be getting only a fraction of the performance from R700/R600 cards using OpenCL, and even in the case of R800, don't assume there will be production quality comprehensive high performance driver/toolchain support before mid to late 2010.
If you just want stuff to be "portable" across GPU vendors and do graphics-like computations with the GPUs, use either OpenCL or DX11 (on Windows Vista/7 platforms), or just stick to shaders in DX9/DX10 for even better portability.
Don't expect OpenCL to be "write once run anywhere" with minimal developer issues or end user runtime configuration / linking issues for at least a few more months in the case of AMD/NVIDIA on Windows. As of now even a lot of developers have issues with DLL compatibility / versioning / paths / capabilities detections etc.
I think 18 months from now maybe it will be really a more streamlined experience to use OpenCL across OS platforms and GPU cards, but still probably mostly for GPU generations that are DX11 and beyond only, not really so much the legacy models (which are still 95% of the deployed market).
Re:Alternative? (Score:3, Informative)
On top of that, the CUDA tools are still much better than OpenCL. OpenCL is basically equivalent to CUDA's low-level "driver" interface, but it has no equivalent to the high-level interface that lets you combine host/device code in a single source, etc. CUDA also supports a subset of C++ for device code (e.g. templates), which I don't believe is the case for OpenCL. CUDA also has a debugger (of sorts), profiler, and in version 3 apparently a memory checker. But I haven't been following OpenCL that closely lately -- it may be catching up on the tool front.
If you're developing an in-house project where you have control over the hardware you're going to run on, or you know that most of your customers have Nvidia cards anyway, there are still good reasons to go with CUDA.
Re:Alternative? (Score:4, Informative)
The greatest challenges lie in accommodating arbitrary control flow among threads within a cooperative thread array. NVIDIA GPUs are SIMD multiprocessors, but they include a thread activity stack that enables serialization of threads when they reach diverging branches. Without hardware support, this kind of thing becomes difficult on SIMD processors which is why Ocelot doesn't include support for SSE yet. It is also one of the obstacles for supporting AMD/ATI IL at the moment, though solutions are in order.
Translation from PTX to LLVM to multicore x86 does not necessarily throw away information concerning the PTX thread hierarchy initially. The first step is to express a PTX kernel using LLVM instructions and intrinsic function calls. This phase is [theoretically] invertible and no information concerning correctness or parallelism is lost.
To get to multicore from here, a second phase of transformations insert loops around blocks of code within the kernel to implement fine-grain multithreading. This is the part that isn't necessarily invertible or easy to translate back to GPU architectures and is what is referenced in the note you are citing.
Disclosure: I'm one of the core contributors to the Ocelot project.
Re:OpenCL not a magic bullet (Score:3, Informative)
Not saying that portability isn't a good thing, but a lot of people seem to be thinking that OpenCL will solve all your portability problems. It won't. It only will let code run on multiple architectures. You'll still have to more or less hand optimize to the architecture.
Like the argument of assembler vs C, I think as time goes on we will find ourselves with code that can do a better job of optimising the code for a specific processing core, given a block of OpenCL code than the programmer. Sure there will always be specific cased where the programmer can do a better job, but most programmers IMHO would rather write portable code and let the optimisation left to code which does a better than them - for reasons of lack of intrinsic knowledge and time.
Why? (Score:3, Informative)
So there seem to be several questions as to why people would want to use CUDA when an open standard exists for the same thing (OpenCL).
Well, honestly, the reason why I wrote this was because when I started, OpenCL did not exist.
I have heard the following reasons why some people prefer CUDA over OpenCL:
Additionally I would like to see a programming model like CUDA or OpenCL replace the most widespread models in industry (threads, openmp, mpi, etc...). CUDA and OpenCL are each examples of Bulk Synchronous Parallel [wikipedia.org] models, which explicitly are designed with the idea that communication latency and core count will increase over time. Although I think that it is a long shot, I would like to see more applications written in these languages so there is a migration path for developers who do not want to write specialized applications for GPUs, but can instead write an application for a CPU that can take advantage of future CPUs with multiple cores, or GPUs with a large degree of fine-grained parallelism.
Most of the codebase for Ocelot could be re-used for OpenCL. The intermediate representation for each language is very similar, with the main differences being in the runtime.
Please try to tear down these arguments, it really does help.