Forgot your password?
typodupeerror
Software Technology

Grand Unified Theory of SIMD 223

Posted by Hemos
from the the-string-theory-of-SIMD dept.
Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "
This discussion has been archived. No new comments can be posted.

Grand Unified Theory of SIMD

Comments Filter:
  • Altivec (Score:5, Informative)

    by BWJones (18351) * on Monday February 07, 2005 @12:31PM (#11597314) Homepage Journal

    For those who want a little background on Altivec, of course Wiki has a description here [wikipedia.org]. Apple, who now ships Altivec in every system they make has a pretty good page here [apple.com] and Motorola nee Freescale has one here [freescale.com].

    The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

  • by LordRPI (583454) on Monday February 07, 2005 @12:33PM (#11597342)
    Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.
    • How is parent flamebait? It's a fact, and its not flamebait considering Apple is one of the only companies currently shipping Altivec systems.
    • One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.
  • Umm (Score:2, Informative)

    by TheKidWho (705796)
    Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
    • Re:Umm (Score:3, Informative)

      The next version of Xcode will support autovectorisation, but I dont think it does it atm.
    • by HeghmoH (13204)
      No.
    • Yes. (Score:3, Informative)

      by Trillan (597339)
      Yes it does [apple.com].
      • Re:Yes. (Score:3, Informative)

        by homb (82455)
        No the current version of XCode uses GCC 3.3 and does NOT support autovectorization.
        The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
        The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.
  • A little background (Score:5, Informative)

    by xXunderdogXx (315464) on Monday February 07, 2005 @12:35PM (#11597359) Homepage Journal
    From the Wikipedia article on SIMD:
    An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three 8-bit values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.


    With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get this pixel", a SIMD processor will have a single instruction that effectively says "get all of these pixels" ("all" is a number that varies from design to design). For a variety of reasons, this can take much less time than it would to load each one by one as in a traditional CPU design.
    But of course I'm sure everyone here knew that..
  • by ThousandStars (556222) on Monday February 07, 2005 @12:37PM (#11597380) Homepage
    The Mac forum at Ars Technica has a long, continuing post [arstechnica.com] about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.
    • by kuwan (443684) on Monday February 07, 2005 @01:05PM (#11597703) Homepage
      A better resource for Altivec and SIMD in general is the SIMDtech.org [simdtech.org] website and Altivec [simdtech.org] mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.

      I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.

      --
      Join the Pyramid - Free Mini Mac [freeminimacs.com]
  • License issues (Score:5, Informative)

    by IO ERROR (128968) * <error@iCHICAGOoerror.us minus city> on Monday February 07, 2005 @12:39PM (#11597404) Homepage Journal
    Be careful; the "open source" license [pixelglow.com] (PDF) is not GPL-compatible. I don't even think it's BSD-compatible on first reading.

    The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

    • Re:License issues (Score:2, Interesting)

      by voxlator (531625)
      True, but only if you don't purchase a license.

      Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

      --#voxlator
      • Re:License issues (Score:4, Informative)

        by IO ERROR (128968) * <error@iCHICAGOoerror.us minus city> on Monday February 07, 2005 @01:04PM (#11597702) Homepage Journal
        Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.

        True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.

        One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.

        The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing [slashdot.org], but he chose instead to do something a whole lot more obnoxious.

    • The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

      IANAL, but I read the intent as "if you improve macstl you have to publish your changes to macstl" not "if you link macstl you have to publish source to the entire project".

      Obviously I can't say which one matches the legalese.
  • About the RPL (Score:5, Informative)

    by pavon (30274) on Monday February 07, 2005 @12:47PM (#11597495)
    The RPL ( Reciprocal Public License [pixelglow.com]) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.

    The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).

    So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.
    • Re:About the RPL (Score:3, Informative)

      by geoffspear (692508) *
      Clearly, we need to get everyone in the world to download the source, make one superficial change, and email the entire thing back to the original developer.

      And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

      • And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?

        Yes, unless he has an identifiable successor-in-interest.

    • The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL.

      It's no more incompatible than is a class that overrides a method of a superclass "incompatible" with that superclass. In this instance, the release "method" is more strict.

    • #1 is understandable, if odd, but #2 is just ridiculous. In-house use doesn't fall under copyright protection to begin with, so how can the RPL regulate it?
      • In-house use doesn't fall under copyright protection to begin with
        False. You may be confusing in-house use with the doctrine of fair use.
        • You're right, I wasn't thinking. Wide-scale internal use would in fact be governed by the RPL. Small-scale use that fell under fair use would not.
  • Black Art? Uh... (Score:4, Interesting)

    by arekusu (159916) on Monday February 07, 2005 @12:47PM (#11597496) Homepage
    "...the black art of assembly language magicians."

    The nice thing about altivec is that it has a C interface. You don't have to use assembly!

    Take a look at this Apple tutorial [apple.com] to see how easy it is.
    • by Leo McGarry (843676) on Monday February 07, 2005 @01:03PM (#11597688)
      Yes, I think the person who wrote the summary revealed a little more of his own ignorance than he meant to. I don't consider calling "vec_add" inside a loop to be a black art.
    • Yeah, the C library is out there, and it's not too hard to use. :)

      But one could counter that even in the C library, unless you know what you're doing, you may not get as dramatic a speedup as you wanted. Until I looked at serveral of Apple's examples, I couldn't write altivec-aware code properly (i.e. maximum performance benefit).

      Once I knew what I was doing I went back and redid the code, and it ran much faster. So it is still tricky to maximize your bang-for-buck.
  • Does this mean we can expect source Linux distros to start taking advantage of this?

    I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!

    Yay!
  • Sounds great, but $2499 for a redistributable binary? Ouch.
    • Re:Too expensive? (Score:2, Insightful)

      by voxlator (531625)
      In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?

      Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...

      --#voxlator
      • If the question was "Do I hire my own programmer or buy this technology?" then you would be correct.

        But, given this is an optimization and replacement for STL then the question is "Do I just live with STL, or buy this technology?"

        In other words, it isn't an essential development cost, it's an extra (I imagine most interested parties already have shipping apps that use STL).

        And at this price point, IMHO, I think the answer may be "if it ain't broke, don't fix it."
  • Slides about SIMD (Score:2, Informative)

    by quigonn (80360)
    A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf [www.fefe.de]
  • I'll take the Assembly Language, thanks. Especially on such a nice processor.

    TWW

  • by shawnce (146129) on Monday February 07, 2005 @12:50PM (#11597543) Homepage
    For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

    GCC vectorizatoin project [gnu.org] (site seem offline atm) but the abstract from a recent GCC summit [gccsummit.org] is up.

    Autovectorization Talk (google html view of pdf) [216.239.57.104]
    • If you're serious about performance, use XLC. GCC is great if you're cheap, but it's kind of like putting monster truck tires on a Ferarri.
    • Yes, the new ssa architecture in GCC 4.0 allows for autovectorization, but at the moment the focus is on getting GCC 4.0 sufficiently stable for release in a few months. Because of this, IIRC, some of the fancier vectorization passes were deferred until GCC 4.1.

      So yes, you might see some performance improvements due to vectorization in 4.0, but you'll have to wait until 4.1 or maybe even 4.2 before you'll see the full potential of it.

      -joib, occasional GCC contributor (although I have absolutely zilch to d
  • It's in the compiler (Score:3, Informative)

    by Mad Hughagi (193374) on Monday February 07, 2005 @12:51PM (#11597557) Homepage Journal
    Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.

    The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.
    • It is built in but you don't automagically get full benefit unless you design your data structures and algorithms appropriately. In my case, I got no measurable benefit until I did a fairly extensive redesign.

      Intel has a great book on performance tuning that has been extremely helpful, as has Intel's VTune.
      • With no changes to our code, but turning on most of the switches to the Linux Intel compiler, I got a huge number of "loop was vectorized" messages, and the resulting code was sped up almost 20% (verses only 5% for the Intel compiler with no switches other than -O5). Now it is quite likely that more speedup is possible, but it appears the Intel compiler was quite able to recognize and vectorize code that was not designed for it. (ps the code is floating-point image processing, with repetitive operations don

      • Actually, you DO get automagical compiler speedup. In some cases it can identify vector-izable (is that word?) loops and promote them to SIMD operations.

        But yes, otherwise, you need to re-code if the compiler doesn't take the hint, especially in structures/classes. The only objection I have to the Intel intrinsics is they don't look pretty! ;-)

        I haven't used VTune since circa 1998, and it had this awesome feature that would point out boneheaded things in your code. One interesting suggestion it made: i
        • Automagical only if it can make the identification; there are several things that can prevent it from doing so, and I managed to do several of them. VTune helps a lot with code like this - I've spent many happy hours tracking down hotspots with it.
  • already exists (Score:3, Informative)

    by jeif1k (809151) on Monday February 07, 2005 @12:55PM (#11597603)
    SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.
  • The future (Score:4, Insightful)

    by johnhennessy (94737) on Monday February 07, 2005 @12:55PM (#11597612)
    Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/ [www.gotw.ca]).

    The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

    The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

    AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.
    • Isn't that pretty much what the Amiga was doing a couple decades ago? The CPU was merely a traffic cop, directing other specialized units to actually do the real work? If so, they're a bit late to the party, eh?
  • Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
  • Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.
    • Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.

      For the record, this has been in Intel's C compiler for years now. It's also in the current release of the Microsoft Visual C++ compiler, including the free download version.
    • by be-fan (61476) on Monday February 07, 2005 @01:33PM (#11598016)
      Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.
  • From the limewire... (Score:3, Interesting)

    by WilyCoder (736280) on Monday February 07, 2005 @01:07PM (#11597731)
    As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

    This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

    Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

    I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

    With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.
    • You often need radically different algorithms to get the full benefit of SIIMD. The processing power is there, figuring out how to exploit it can be very difficult.

      You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.

  • by javaxman (705658) on Monday February 07, 2005 @01:08PM (#11597746) Journal
    Sorry, I can't read a story submitted by someone who doesn't even know about C [apple.com] libraries [intel.com] that have been around for years.

    Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?

    • by kuwan (443684) on Monday February 07, 2005 @01:48PM (#11598226) Homepage
      If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming. You program in standard C++ using std::valarray and you get code optimized for Altivec and MMX/SSE/SSE2/SSE3 without having to do anything else. You don't need to worry about coding to two different libraries on two different platforms nor do you have to worry about learning the platform-specific C intrinsics, alignment issues, head/tail cases, etc.

      SIMD programming becomes as easy as this:
      float af1 [] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
      stdext::valarray <float> v1 (af1, 10); // construct from first 10 elements of af1
      stdext::valarray <float> v2 (10, 3.0f); // construct with 10 repeats of 3.0f
      stdext::valarray <float> v3 (10); // construct with 10 repeats of 0.0f

      v3 = sin (v1) * cos (v2) + sin (v2) * cos (v1);
      He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.

      Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).

      --
      Join the Pyramid - Free Mini Mac [freeminimacs.com]
      • If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming.

        My point exactly. Does the story say cross-platform anywhere? No, it says :
        programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians

        er... so, instead of saying something like "here's a product which allows you to use the same API for both PPC and Intel SIMD", the submitter puts in th

  • liboil (Score:3, Interesting)

    by labratuk (204918) on Monday February 07, 2005 @01:23PM (#11597901)
    Another project trying to do something similar is liboil [schleef.org], the Library of Optimised Inner Loops.

    However in the future I can see things changing for the structure of the stardard PC.

    At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.

    Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.

    I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.

    PCs will ship with two processors - one scalar, one vector. And everyone will be happy.

    Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.
  • by coult (200316) on Monday February 07, 2005 @01:38PM (#11598089)
    You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ [sourceforge.net] already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ [fftw.org] include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

    Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html [apple.com] includes ATLAS so you don't even have to download or install anything - it comes with OS X.
  • by kompiluj (677438) on Monday February 07, 2005 @01:45PM (#11598166)
    Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler [stanford.edu]! Furthemore, I see they [pixelglow.com] introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz [oonumerics.org] to the Altivec and/or to the GPU?
  • by Pyrosophy (259529) on Monday February 07, 2005 @02:03PM (#11598427)
    This story doesn't really mean anything and people are just making up comments.
  • by Baldrson (78598) * on Monday February 07, 2005 @02:10PM (#11598517) Homepage Journal
    The real "grand unified theory" of SIMD is CAPP or content addressable parallel processors. I read a book [amazon.com] on this topic back in the 1970s and it was pretty clear to me that it:
    1. Was a great way of dealing with relational data
    2. Would have to await much larger scales of integration before becoming practical.
    Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.

    Fortunately there is at least a little ongoing research [mit.edu].

    The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.

If you had better tools, you could more effectively demonstrate your total incompetence.

Working...