Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Software Technology

Grand Unified Theory of SIMD 223

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
This discussion has been archived. No new comments can be posted.

Grand Unified Theory of SIMD

Comments Filter:
  • by Anonymous Coward on Monday February 07, 2005 @12:42PM (#11597448)

    Moore's Law has eroded the need for assembly

    Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia [wikipedia.org]:

    Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.

    I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.

  • Assembly (Score:3, Insightful)

    by bsd4me ( 759597 ) on Monday February 07, 2005 @12:55PM (#11597605)

    Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.

  • The future (Score:4, Insightful)

    by johnhennessy ( 94737 ) on Monday February 07, 2005 @12:55PM (#11597612)
    Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/ [www.gotw.ca]).

    The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

    The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

    AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
  • Re:Too expensive? (Score:2, Insightful)

    by voxlator ( 531625 ) on Monday February 07, 2005 @12:57PM (#11597632)
    In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?

    Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...

    --#voxlator
  • by lowe0 ( 136140 ) on Monday February 07, 2005 @01:00PM (#11597661) Homepage
    Which is exactly why this sort of thing is so important.

    Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.

    Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it does... (nearly) free performance is always a good thing.
  • by dsci ( 658278 ) on Monday February 07, 2005 @01:09PM (#11597755) Homepage
    We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.

    Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.

    I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.

    It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
  • by kuwan ( 443684 ) on Monday February 07, 2005 @01:21PM (#11597870) Homepage
    So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?

    That's exactly what this is. If you read the part on his website about valarray [pixelglow.com] then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either assembly or using the C-based intrinsics if you wanted do the SIMD programming by hand.

    So basically, this allows you to code using std::valarray using normal C++ and then plug this in under the hood to get a nice speed boost.

    --
    Join the Pyramid - Free Mini Mac [freeminimacs.com]
  • by kompiluj ( 677438 ) on Monday February 07, 2005 @01:45PM (#11598166)
    Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler [stanford.edu]! Furthemore, I see they [pixelglow.com] introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz [oonumerics.org] to the Altivec and/or to the GPU?
  • by groomed ( 202061 ) on Monday February 07, 2005 @02:01PM (#11598405)
    Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.

    Moore's Law has eroded the need for such knowledge

    Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.

    It would be like concerning myself on how to design circuits...

    No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.

    Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche

    By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.

    Those situations are about the last place you will find anyone caring about something called "assembly language."

    Again, completely irrelevant.

    The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.

    These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.
  • by Anonymous Coward on Monday February 07, 2005 @02:27PM (#11598733)
    the idea that assmebler programmers can write better code than a compiler can generate is one of those urban myths that refuses to die. compilers can and do undertake code analysis that no assembler programmer could ever do - like trace back the control flow through every single branch point to find instances where data has already been precalculated. code hoisting of temporaries outside loops in a way that maximises register use over memory hits. undertaking such analysis before coding in assembler would be extremely high risk for an assembler programmer. also would you as an assmbler programmer go about inlining all your assembler functions - the code would be unmanageable? how many assembler programmers would know how to reorder their instructions to avoid pipeline stalls. all the knowledge about optimising assembly programs has been incorporated into compiler backends over the years- why wouldnt it have been?

    its been tested - get a program that converts assembler to c and then recompile with optimisation - it *will* run faster.

    the only exceptions are where the compiler lacks an algebraic or RTL awareness of an instruction on a specific architecture.

    jxxx
  • by bryanzak ( 598580 ) on Monday February 07, 2005 @02:41PM (#11598930)
    One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.
  • by TheRaven64 ( 641858 ) on Monday February 07, 2005 @04:27PM (#11600116) Journal
    The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.
  • by Dasein ( 6110 ) <tedc@codebig. c o m> on Monday February 07, 2005 @05:17PM (#11600587) Homepage Journal
    Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).

    Nope. Technically, there are two constant burried in here. The definition is g(x) = O(f(x)) => g(x) <= k*f(x) where x > a for some orbitrary a. If you don't change algorithms, all you can do is manipulate the k. For a given k and a given level of improvement, I can give you a new k that hits that level of improvement.

    Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.

    Also, his original code may have been "bullshit" but it may not have. It depends a lot on the algorithm in question. The higher the exponent on an exponential algorithm, they more sensitive its running time is to some optimization in an inner loop.

    And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.

    This is a simple matter of economics. There's a cost/benifit to expending the effort to optimize in assembly. If the compiler generates good code, then obviously, the cost/benefit of recoding in assembly is pretty high. However, without specific knowledge of *HIS* economics, I would suggest that you not spout off.

To do nothing is to be nothing.

Working...