Grand Unified Theory of SIMD 223
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
Altivec (Score:5, Informative)
For those who want a little background on Altivec, of course Wiki has a description here [wikipedia.org]. Apple, who now ships Altivec in every system they make has a pretty good page here [apple.com] and Motorola nee Freescale has one here [freescale.com].
The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
More AltiVec Goodness (Score:4, Informative)
Umm (Score:2, Informative)
A little background (Score:5, Informative)
Long thread about using Altivec (Score:5, Informative)
License issues (Score:5, Informative)
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
Re:Altivec (Score:5, Informative)
Apple provides source code for some of their vector libraries [apple.com]
Re:Altivec (Score:4, Informative)
And is part of every G4
oops (Score:1, Informative)
Propellerheads.SE [propellerheads.se]
Re:16X increase? (Score:5, Informative)
About the RPL (Score:5, Informative)
The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).
So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
Slides about SIMD (Score:2, Informative)
It's in the compiler (Score:3, Informative)
The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
Re:About the RPL (Score:3, Informative)
And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?
already exists (Score:3, Informative)
Re:License issues (Score:4, Informative)
True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.
One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.
The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing [slashdot.org], but he chose instead to do something a whole lot more obnoxious.
Read the Altivec mailing list (Score:5, Informative)
I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
--
Join the Pyramid - Free Mini Mac [freeminimacs.com]
Re:Other way around (Score:1, Informative)
No need for anyone to whip out the online dictionary and tell me "formerly known as" is an acceptable alternative.
Re:Umm (Score:3, Informative)
Re:License issues (Score:3, Informative)
Look, a troll! The GPL doesn't require you to release your code, unless you distribute it. This RPL thing requires you to release your code, even if you don't distribute it. I've discussed the linking issue elsewhere.
Re:A little background (Score:4, Informative)
Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.
Re:OS X Tiger will do it for you (Score:5, Informative)
Yes. (Score:3, Informative)
Maybe it's just Ignorant criticism... (Score:4, Informative)
SIMD programming becomes as easy as this: He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
--
Join the Pyramid - Free Mini Mac [freeminimacs.com]
Re:16X increase? (Score:2, Informative)
A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.
Matrix a,b,c,x;
x = a + b + c;
The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this: All those temporary copies and inlined loops really kill performance.
Now, with an expression library, it handles each arithmetic expression discretely by type. By treating the expressions, as well as the types involved, you can do more sophisticated things. In this case, the Expression Template Library solves the problem thusly: Here the library has carnal knowledge of the data structures involved as well as order of operations to come to such a succint solution.
In the case of MACSTL, its still using these principals of "vectorizing" the expressions as well as unrolling and other traditional optimization techniques. Its also going the extra mile and using processor specific code and/or C code that targets *extremely* well to PPC. For example, the above example would opitmize well using Altivec, due to the platform's built-in vector type; you wouldn't even need a loop for adding several 'vec' instances.
I wish I knew enough about MACSTL and altivec to give a hard example of a 16X speedup. I hope this gets you closer to seeing at least *where* the reducable overhead is coming from.
Check out Blitz++'s papers listing for more info:
http://www.oonumerics.org/blitz/papers/ [oonumerics.org]
Re:Yes. (Score:3, Informative)
The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.
Re:A little background (Score:5, Informative)
MMX (x86): 8-byte registers, only integer operations
SSE (x86): 16-byte registers, single-precision float ops
AltiVec (PPC): 16-byte registers, integer and single-precision float ops
SSE2 (x86): 16-byte registers, double-precision float ops
In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.
AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.
Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.
Re:A little background (Score:3, Informative)