Grand Unified Theory of SIMD 223

Posted by Hemos on Monday February 07, 2005 @12:30PM from the the-string-theory-of-SIMD dept.

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "

Grand Unified Theory of SIMD

This discussion has been archived. No new comments can be posted.

Search 223 Comments Log In/Create an Account

Comments Filter:

Moore's Law has eroded the need for assembly (Score:1, Interesting)

by betelgeuse68 ( 230611 ) writes: on Monday February 07, 2005 @12:38PM (#11597396)

Moore's Law has eroded the need for such knowledge. It would be like concerning myself on how to design circuits to convert a DC current to AC current because I happen to use devices that use electricity, e.g., my toaster (as in bread).

I learned assembly long ago, still retaining a fair amount of it (80x86). There have been a few occasions where I've called upon its use, yeah twice in the last eight years... and that's about it.

Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.

90% of IT jobs are with non-tech companies. Those situations are about the last place you will find anyone caring about something called "assembly language."

-M

Re:16X increase? (Score:2, Interesting)

by mirko ( 198274 ) writes: on Monday February 07, 2005 @12:39PM (#11597412) Journal

When using Reason 3 [propelerheads.se], some virtual synths have the option to produce an enhanced sound.
What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...

Re:Altivec (Score:3, Interesting)

by baryon351 ( 626717 ) writes: on Monday February 07, 2005 @12:44PM (#11597463)

It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

I managed to pick up a ThunderIV last year with the DSP card, and had a run around with photoshop on it. It's impressive stuff. I have an iMac 350 here I also ran photoshop on, and while the 350 kicked the Thunder in a Quadra for many unaccelerated things, on those operations where the DSPs kicked in (and the card has those cool little LEDs to show just when it's happening) it could keep up with the iMac nearly neck & neck.

That's a 25MHz 68040 from 1992 and Thunder IVGX vs a 350MHz G3 from 2000. Very cool.

Black Art? Uh... (Score:4, Interesting)

by arekusu ( 159916 ) writes: on Monday February 07, 2005 @12:47PM (#11597496) Homepage

"...the black art of assembly language magicians."

The nice thing about altivec is that it has a C interface. You don't have to use assembly!

Take a look at this Apple tutorial [apple.com] to see how easy it is.

Autovectorization being add in GCC 4.0 (Score:5, Interesting)

by shawnce ( 146129 ) writes: on Monday February 07, 2005 @12:50PM (#11597543) Homepage

For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

GCC vectorizatoin project [gnu.org] (site seem offline atm) but the abstract from a recent GCC summit [gccsummit.org] is up.

Autovectorization Talk (google html view of pdf) [216.239.57.104]

Re:License issues (Score:2, Interesting)

by voxlator ( 531625 ) writes: on Monday February 07, 2005 @12:51PM (#11597555)

True, but only if you don't purchase a license.

Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

--#voxlator

OS X Tiger will do it for you (Score:2, Interesting)

by jilbert ( 520628 ) writes: on Monday February 07, 2005 @01:06PM (#11597722)

Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.

From the limewire... (Score:3, Interesting)

by WilyCoder ( 736280 ) writes: on Monday February 07, 2005 @01:07PM (#11597731)

As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.

Ignorant submitter, or smart marketing? (Score:3, Interesting)

by javaxman ( 705658 ) writes: on Monday February 07, 2005 @01:08PM (#11597746) Journal

Sorry, I can't read a story submitted by someone who doesn't even know about C [apple.com] libraries [intel.com] that have been around for years.
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?

liboil (Score:3, Interesting)

by labratuk ( 204918 ) writes: on Monday February 07, 2005 @01:23PM (#11597901)

Another project trying to do something similar is liboil [schleef.org], the Library of Optimised Inner Loops.

However in the future I can see things changing for the structure of the stardard PC.

At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.

Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.

I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.

PCs will ship with two processors - one scalar, one vector. And everyone will be happy.

Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.

Why? Altivec-optimized libraries supplied by Apple (Score:4, Interesting)

by coult ( 200316 ) writes: on Monday February 07, 2005 @01:38PM (#11598089)

You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ [sourceforge.net] already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ [fftw.org] include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html [apple.com] includes ATLAS so you don't even have to download or install anything - it comes with OS X.

Content Addressable Parallel Processors (Score:3, Interesting)

by Baldrson ( 78598 ) * writes: on Monday February 07, 2005 @02:10PM (#11598517) Homepage Journal
The real "grand unified theory" of SIMD is CAPP or content addressable parallel processors. I read a book [amazon.com] on this topic back in the 1970s and it was pretty clear to me that it:
1. Was a great way of dealing with relational data
2. Would have to await much larger scales of integration before becoming practical.
Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.
Fortunately there is at least a little ongoing research [mit.edu].
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
macstl vs. Blitz++ (Score:1, Interesting)

by ljubom ( 147499 ) writes: on Monday February 07, 2005 @02:48PM (#11599017)

It will be interesting to compare performance of the macstl library to other "high speed" template libraries like Blitz++ (see http://www.oonumerics.org/blitz/)

Re:16X increase? (Score:3, Interesting)

by sribe ( 304414 ) writes: on Monday February 07, 2005 @03:02PM (#11599187)

So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.

There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common series of multiple scalar instructions.

The result is that speed ups of more than 16x are not at all rare. 30x is not uncommon in graphics manipulations; I would venture to say that 100x is "rarely the case." ;-)

Assembly lives! (Score:2, Interesting)

by Omigod ( 857189 ) writes: on Monday February 07, 2005 @04:45PM (#11600289)

The more complex the architecture the greater need to keep around low level coding. Compilers just can't keep up. During the early days of the PS2 we commonly got 300x performance improvements when switching from high level code to carefully architected and coded assembly. Programmers have gotten lazy and have lost the skills required to maximize the performance on current architectures. If you code carefully you can make sure that you are executing the maximum number of instructions per cycle. When you use a compiler it abstracts you from seeing that if you change your instruction pairing or split off some of the instructions into another pipeline you might get better performance. In school they teach you that algorythm is the most important thing to look at and that implementation doesn't matter that much, but with todays complex bus architectures, and with everything fighting for control of the bus, if you aren't careful you can end up wasting most of your time waiting for access to data or stalling the instruction pipeline waiting for results to calculations.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Grand Unified Theory of SIMD 223

Grand Unified Theory of SIMD More Login

Grand Unified Theory of SIMD

Moore's Law has eroded the need for assembly (Score:1, Interesting)

Re:16X increase? (Score:2, Interesting)

Re:Altivec (Score:3, Interesting)

Black Art? Uh... (Score:4, Interesting)

Autovectorization being add in GCC 4.0 (Score:5, Interesting)

Re:License issues (Score:2, Interesting)

OS X Tiger will do it for you (Score:2, Interesting)

From the limewire... (Score:3, Interesting)

Ignorant submitter, or smart marketing? (Score:3, Interesting)

liboil (Score:3, Interesting)

Why? Altivec-optimized libraries supplied by Apple (Score:4, Interesting)

Content Addressable Parallel Processors (Score:3, Interesting)

macstl vs. Blitz++ (Score:1, Interesting)

Re:16X increase? (Score:3, Interesting)

Assembly lives! (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot