Grand Unified Theory of SIMD 223
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
Altivec (Score:5, Informative)
For those who want a little background on Altivec, of course Wiki has a description here [wikipedia.org]. Apple, who now ships Altivec in every system they make has a pretty good page here [apple.com] and Motorola nee Freescale has one here [freescale.com].
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
Re:Altivec (Score:5, Informative)
Apple provides source code for some of their vector libraries [apple.com]
Re:Altivec (Score:3, Interesting)
I managed to pick up a ThunderIV last year with the DSP card, and had a run around with photoshop on it. It's impressive stuff. I have an iMac 350 here
Other way around (Score:2)
Re:Altivec (Score:2)
For more, read http://en.wikipedia.org/wiki/Wiki [wikipedia.org].
Re:Altivec (Score:2)
Re:Altivec (Score:4, Informative)
And is part of every G4
More AltiVec Goodness (Score:4, Informative)
Re:More AltiVec Goodness (Score:2)
Re:More AltiVec Goodness (Score:3, Insightful)
Umm (Score:2, Informative)
Re:Umm (Score:3, Informative)
Re:Umm (Score:2)
Yes. (Score:3, Informative)
Re:Yes. (Score:3, Informative)
The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.
A little background (Score:5, Informative)
Re:A little background (Score:2)
Re:A little background (Score:4, Informative)
Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.
Re:A little background (Score:2)
Re:A little background (Score:5, Informative)
MMX (x86): 8-byte registers, only integer operations
SSE (x86): 16-byte registers, single-precision float ops
AltiVec (PPC): 16-byte registers, integer and single-precision float ops
SSE2 (x86): 16-byte registers, double-precision float ops
In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.
AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.
Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.
Re:A little background (Score:3, Informative)
Long thread about using Altivec (Score:5, Informative)
Read the Altivec mailing list (Score:5, Informative)
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac [freeminimacs.com]
License issues (Score:5, Informative)
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
Re:License issues (Score:2, Interesting)
Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.
--#voxlator
Re:License issues (Score:4, Informative)
True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.
One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.
The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing [slashdot.org], but he chose instead to do something a whole lot more obnoxious.
Re:License issues-Smells funny. (Score:2)
Re:License issues (Score:2)
IANAL, but I read the intent as "if you improve macstl you have to publish your changes to macstl" not "if you link macstl you have to publish source to the entire project".
Obviously I can't say which one matches the legalese.
Re:License issues (Score:3, Informative)
Look, a troll! The GPL doesn't require you to release your code, unless you distribute it. This RPL thing requires you to release your code, even if you don't distribute it. I've discussed the linking issue elsewhere.
Re:License issues (Score:2)
With C++ templates this is a very thorny issue. When your code instantiates the template, the library code is very inextricably an integral part of your code, and not easily (if at all) separable. This might be a different issue if it were a C library you could just call through an API.
Currently under the GPL/LGPL th
About the RPL (Score:5, Informative)
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
Re:About the RPL (Score:3, Informative)
And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?
Re:About the RPL (Score:2)
Yes, unless he has an identifiable successor-in-interest.
Re:About the RPL (Score:2)
It's no more incompatible than is a class that overrides a method of a superclass "incompatible" with that superclass. In this instance, the release "method" is more strict.
Pedantic Pissing Contests Aside (Score:2)
Re:About the RPL (Score:2)
Re:About the RPL (Score:2)
Re:About the RPL (Score:2)
Black Art? Uh... (Score:4, Interesting)
The nice thing about altivec is that it has a C interface. You don't have to use assembly!
Take a look at this Apple tutorial [apple.com] to see how easy it is.
Re:Black Art? Uh... (Score:4, Funny)
Re:Black Art? Uh... (Score:2)
But one could counter that even in the C library, unless you know what you're doing, you may not get as dramatic a speedup as you wanted. Until I looked at serveral of Apple's examples, I couldn't write altivec-aware code properly (i.e. maximum performance benefit).
Once I knew what I was doing I went back and redid the code, and it ran much faster. So it is still tricky to maximize your bang-for-buck.
More source-distro goodness to follow? (Score:2)
I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!
Yay!
Too expensive? (Score:2)
Re:Too expensive? (Score:2, Insightful)
Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...
--#voxlator
Re:Too expensive? (Score:2)
But, given this is an optimization and replacement for STL then the question is "Do I just live with STL, or buy this technology?"
In other words, it isn't an essential development cost, it's an extra (I imagine most interested parties already have shipping apps that use STL).
And at this price point, IMHO, I think the answer may be "if it ain't broke, don't fix it."
Slides about SIMD (Score:2, Informative)
Assembly or C++? (Score:2)
TWW
Re:Assembly or C++? (Score:2)
TWW
Autovectorization being add in GCC 4.0 (Score:5, Interesting)
GCC vectorizatoin project [gnu.org] (site seem offline atm) but the abstract from a recent GCC summit [gccsummit.org] is up.
Autovectorization Talk (google html view of pdf) [216.239.57.104]
Re:Autovectorization being add in GCC 4.0 (Score:2)
Re:Autovectorization being add in GCC 4.0 (Score:2)
So yes, you might see some performance improvements due to vectorization in 4.0, but you'll have to wait until 4.1 or maybe even 4.2 before you'll see the full potential of it.
-joib, occasional GCC contributor (although I have absolutely zilch to d
It's in the compiler (Score:3, Informative)
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
Re:It's in the compiler (Score:2)
Intel has a great book on performance tuning that has been extremely helpful, as has Intel's VTune.
Re:It's in the compiler (Score:2)
Re:It's in the compiler (Score:2)
Actually, you DO get automagical compiler speedup. In some cases it can identify vector-izable (is that word?) loops and promote them to SIMD operations.
But yes, otherwise, you need to re-code if the compiler doesn't take the hint, especially in structures/classes. The only objection I have to the Intel intrinsics is they don't look pretty!
I haven't used VTune since circa 1998, and it had this awesome feature that would point out boneheaded things in your code. One interesting suggestion it made: i
Re:It's in the compiler (Score:2)
already exists (Score:3, Informative)
Re:already exists (Score:2)
The future (Score:4, Insightful)
The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.
The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.
AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
Re:The future (Score:2)
Isn't it what std::valarray is for? (Score:2)
Re:Isn't it what std::valarray is for? (Score:3, Insightful)
That's exactly what this is. If you read the part on his website about valarray [pixelglow.com] then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either ass
Re:Isn't it what std::valarray is for? (Score:2)
OS X Tiger will do it for you (Score:2, Interesting)
Re:OS X Tiger will do it for you (Score:2)
For the record, this has been in Intel's C compiler for years now. It's also in the current release of the Microsoft Visual C++ compiler, including the free download version.
Re:OS X Tiger will do it for you (Score:5, Informative)
Re:OS X Tiger will do it for you (Score:2)
From the limewire... (Score:3, Interesting)
This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.
Perhaps we will see GFX manufacturers selling their technology to the CPU makers.
I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.
With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
Algorithms (Score:2)
You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.
Ignorant submitter, or smart marketing? (Score:3, Interesting)
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?
Maybe it's just Ignorant criticism... (Score:4, Informative)
SIMD programming becomes as easy as this: He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac [freeminimacs.com]
Re:Maybe it's just Ignorant criticism... (Score:2)
My point exactly. Does the story say cross-platform anywhere? No, it says :
programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians
er... so, instead of saying something like "here's a product which allows you to use the same API for both PPC and Intel SIMD", the submitter puts in th
liboil (Score:3, Interesting)
However in the future I can see things changing for the structure of the stardard PC.
At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.
Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.
I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.
PCs will ship with two processors - one scalar, one vector. And everyone will be happy.
Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.
Why? Altivec-optimized libraries supplied by Apple (Score:4, Interesting)
Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/ve
Why limit yourself to Altivec when you have NVidia (Score:4, Insightful)
Re:Why limit yourself to Altivec when you have NVi (Score:3, Insightful)
OSI-approved RPL goodness. Admit it.... (Score:3, Funny)
Content Addressable Parallel Processors (Score:3, Interesting)
Fortunately there is at least a little ongoing research [mit.edu].
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
Re:16X increase? (Score:2, Interesting)
What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...
Re:16X increase? (Score:5, Informative)
Re:16X increase? (Score:3, Interesting)
There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common s
Re:16X increase? (Score:2, Informative)
A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.
Matrix a,b,c,x;
x = a + b + c;
The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:
Moore's Law has nothing to do with assembly (Score:2, Insightful)
Moore's Law has eroded the need for assembly
Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia [wikipedia.org]:
Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.
I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.
Re:Moore's Law has nothing to do with assembly (Score:2)
>> Moore's Law has eroded the need for assembly
> Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:...
The grandparent was saying that because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.
Re:Moore's Law has eroded the need for assembly (Score:3, Funny)
90% of the worlds' people do not own cars. Therefore, there is no need for gas stations. If you pick a living human completely at random from the earth, chances are they don't drive one of these "car" things.
Re:Moore's Law has eroded the need for assembly (Score:2)
I don't consider Doom 3 to be a niche.
Re:Moore's Law has eroded the need for assembly (Score:2)
Just because you use it, doesn't mean you engineer it.
You use a TV... when was the last time you even thought of any of the eletronics inside of it?
-M
Assembly (Score:3, Insightful)
Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.
Re:Assembly-DSPs (Score:2)
It is when programming DSP's (and related devices).
From my experience, yes and no. Fixed-point DSP tends to be done in assembly, mainly because FP techniques don't translate well to C. The compilers also tend to suck. A fair to large amount of floating-point DSP is done with C when the compiler support is good. I have done a lot of floating-point DSP, and we found that the write in C, refine in ASM workflow was best.
Don't forget that microcontrollers outnumber microprocessors by a large margin.
Re:Moore's Law has eroded the need for assembly (Score:2, Insightful)
Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.
Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it
Depends on what you are doing (Score:5, Insightful)
Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.
I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.
It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
Re:Depends on what you are doing (Score:2)
Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.
And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.
Re:Depends on what you are doing (Score:2)
Baseless FUD. Why would a few dozen lines of hand coded assembly suddenly invalidate the results?
Re:Depends on what you are doing (Score:3, Insightful)
Nope. Technically, there are two constant burried in here. The definition is g(x) = O(f(x)) => g(x) <= k*f(x) where x > a for some orbitrary a. If you don't change algorithms, all you can do is manipulate the k. For a given k and a given level of improvement, I can give you a new k that hits that level of improvement.
Honestly: TO be able to get a 1000 times boost, you
Re:Depends on what you are doing (Score:3)
Moore's Law is OVER (Score:2)
See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.
But times are changing, this is becoming valuable (Score:2)
While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article [www.gotw.ca] where he clearly shows what's happening.
This brings an interesting dilemma to modern pro
Re:Moore's Law has eroded the need for assembly (Score:4, Insightful)
Moore's Law has eroded the need for such knowledge
Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.
It would be like concerning myself on how to design circuits...
No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.
Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche
By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.
Those situations are about the last place you will find anyone caring about something called "assembly language."
Again, completely irrelevant.
The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.
These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.
Re:Moore's Law has eroded the need for assembly (Score:2)
Obviously, you arent a PS2 graphics programmer.. (Score:2)
On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.
But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference