Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Apple Businesses

AltiVec Unwrapped 38

paradesign writes "O'Reilly is running a nice article on AltiVec in the G4 chip. The article includes examples, with code, showing its effectiveness. For everyone who is uneducated as to exactly what Altivec is, this is a must read."
This discussion has been archived. No new comments can be posted.

AltiVec Unwrapped

Comments Filter:
  • Good (Score:2, Informative)

    by morbid ( 4258 )
    This is a good article giving a basic overview of SIMD coding using altivec. However, when Apple claims that MHz don't matter, they're only telling the story, because SSE (on PIII and Athlon4/XP), 3DNow! on K6-2, K6-3 and Athlon all do much the same thing. I hate to say it, but the Pentium IV even has double-precision SIMD in the form of SSE2, currently the only consumer-grade processor with souble-precision SIMD. The AMD Hammer will have SSE2 as well when it comes out.
    • Re:Good (Score:3, Funny)

      by QuietRiot ( 16908 )
      I was curious about what kind of hardware in the x86 arena had the same capabilities. Does anyone know where one could find a rundown of the "extras" found on the various x86-based processors with capabilities similar to those described above?

      How do they compare to the AltiVec in terms of speed, precision, cache in/out, etc.?

      Oh! http://www.processor-emporium.co.uk [processor-emporium.co.uk] seems to be a good reference site....
    • Gosh, you are right!

      Well, actually you are not, but that shouldn't keep you from trying. 2nd example of using AltiVec: FP vector multiply-add instruction - a no-show on SSE(2) and 3DNow!. 3rd example: relies on the fact that x[i] and y[i] vectors stay the same - which they don't on the x86 SIMD extensions. So in those examples we already have some of the differences between AltiVec and the lesser SIMDs, others are more registers and better instructions for shuffeling data. IOW again MHz isn't everything - as shown by e.g. dnet rc5 scores.

      • And the purpose of this is?

        Let's review. Implementing Altivec requires a code rewrite. If your application lends itself to parallel processing, why rely on a single processor that executes 4 instructions at a time when you could use 6 processors, that are clocked 50% faster and most of the time execute 4 instructions in parallel and somtimes are reduced to two in comparison. You can still execute 6, 100% faster by clock speed at a given price. As long as you are going to have to rewrite your code, might as well rewrite it for a cluster.

        So, in our example, we pit 3 dual processor 1533mhz athlon XPs against 1 800mhz G4. Price point is $1600

        In one corner, you have a single bottom end apple G4 tower at 800 mhz.

        800MHz PowerPC G4
        256K L2
        cache
        256MB SDRAM memory
        40GB Ultra ATA drive
        CD-RW drive
        ATI Radeon 7500
        56K internal modem

        In the other corner we have 3u of Dual processor athlon goodness.

        3 tyan tiger AMD 760mp chipset motherboards @ $522.
        6 1800XP Athlons @ $624 (yes they work).
        3 256mb PC2100 registered ecc DDR ram @ $195.
        3 1u cases w/300w power supplies @ $120.
        3 40gb hard drives @ $162.

        Price point is $1623.

        Now rewrite your code.

        Which takes 3 weeks, by which time Apple raises the price of the G4 another hundred dollars while the price of the cluster drops a hundred dollars.

        Ok, that was a flame, let's stick to matters at hand.

        Refrencing this article, the ars technica article and the c't article (you know which one I'm talking about, that place where you dare not look, you'll find x86 there staring back at you) we can draw these assumptions:

        The G4 with Altivec performs equily clock for clock with x86 w/SSE with some rare exceptions where it performs 100% faster clock for clock.

        best case scenario for our similar priced systems using your best case for the G4 benchmark, rc5:

        Single G4 800mhz 8,243,188 keys per second
        6 AMD 1800XP 32,987,538 keys per second

        Same price, x86 is 4 times as productive.

        Seti@home using Ars Lambchop benching wu: Identicle!

        3.35 per work unit.

        x86 is 6 times as productive for the same price.

        CINT2000: base 648 - XP1800
        CINT2000: base 242 - G4 800mhz

        684 vs 242... and that is a single processor comparison!

        If we can optimise to scale, x86 is 16 times as fast for the same price

        If you know of any benchmarks where Mac can compare favorably for the price, please let us all know. You are right, Mhz is not everything. But you have to get some numbers to back the claim that the G4 is even marginally close in performance to machines with well over twice the clockspeed. I'm sure that will convince us all to run out and buy Macs for number crunching :)
  • Hearing about the wonderful performance, my lab picked up a bunchof these babies for our molecular interface simulations. As spec'd they are wonderful, trouble is, you can't use all that power without burning the case off.

    We did a simple run of elastic polymer equilibria (for nitrogen, of course) and the RAM sub-bus gave out on us after registering a temperature of 87 farads. So we backed off to a simple newtonian extrapolation using quadrature-integrated gaussian kinetics and while it worked the results are no more accurate than we sould have gotten from DOS 5 on a 386.

    In short, unless you are planning to run it above the Antarctic circle, don't buy one.

    • Explain the "farad" temperature measurement please....
      • Gladly! (Score:1, Informative)

        In a molecular interface you have opposing charges facing each other which is isomorphic to a capactive situation which is measured in farads. Because computing power is linearly related to both heat and simulation speed it is easiest to measure PC case temperature in a standards way across the industry using farads rather than degrees.

        You will also find faradic temperature measurement in such fields physical proton bombardment, torque pressurization and shotput.

    • You ARE above the Antarctic circle. It is in the south. The only way you could be "below" the Antarctic circle would be if you were in or around Antarctica, which is in the SOUTH!!!
  • And why doesn't anyone besides Apple sell this stuff?? Is is possible to get a G4-enabled, AltiVec-enabled board somewhere without paying the Apple Tax?
    • Re:OpenApple (Score:1, Informative)

      by morbid ( 4258 )
      Just buy an Athlon XP. It runs at 1.67GHz and does SSE (128-bit SIMD registers holding 4 32-bit floats) and 3DNow! (64-bit MMX registers used to hold 2 32bit floats).
      • You have to write in assembler or get a newer compiler, like gcc 3.1, woops that is not out yet.

        gcc 3.1 will have both sse and Altivect support in code.
        • Re:OpenApple (Score:1, Informative)

          by Anonymous Coward
          You don't have to write in assembler or get a newer compiler. Just get libsse [purdue.edu]. It provides a similar programming interface to apple's hack of GCC. The same author also wrote a libmmx, but that is fairly useless since MMX is so poor.
    • Many people would be happy to sell you a board with a G4 on it. Maybe even 2 G4s!

      Marvell [marvell.com] makes ATX boards with 1 or 2 7450s.

      Motorola [motorola.com]Makes a very nice ATX board with 2 7450's on it. They also have the Sandpoint platform which you can use with many different PPC chips.

      Merlancia [merlancia.com] seems to have some good stuff.

      There's a bunch more too, Tundra, GMS, Force, just do a search on google. You'll likely find though that Apple has the best prices. If you want to play with a PPC (I'm assuming you want to do some low level stuff for fun or profit) you'll end up spending $1500 on just a board from somewhere else, or $1500 on a complete system from Apple. The Apple systems retain their value for a long time too.
    • Re:OpenApple (Score:2, Informative)

      google [google.com] is your friend.

  • Mostly out of curiosity (as I don't have a G4 on my desk anymore - it died), what does anyone know about the status of AltiVec support under LinuxPPC (as opposed to OSX, as discussed in the article)? A quick Google search indicates that Motorola made some patches for gcc a couple years ago, but that it wasn't exactly production quality.

    There's a website [altivec.org] that supposedly has tools, but you have to register for their mailing list to see what they've got (and I get enough mail as it is).

    -"Zow"

    • What do you mean by "support under LinuxPPC"?
      If the kernel knows about the registers, it can preserve them during context switches. I'd imagine this trivial kernel mod was done years ago.
      As for general programming, you're right about gcc. There isn't much vectorisation in gcc (c.f. intel's cc which vectorises for SSE2 on PIV) so I (with unrealistic self-confidence as usual) set about writing a C library of vector, matrix, complext etc. functions to use the SIMD features of K6-2/3, Athlon, PIII, PIV and PPC a while ago, and to provide a plain C implementation for folks without SIMD. If you want to help, have a look here [sourceforge.net].
      I've only done 3DNow and C so far for a small number of functions, but one or two people are already interested.
    • Apparently gcc 3.1 supports altivec.

      Important stuff like MPEG2 decoders have supported it for a while, either with hand-written assembly or using output from the Apple compilers.
    • Altivec support has been in the Linux kernel for some time, and altivec assembler code has been supported in binutils for a while also.

      GCC 3.1 has support for altivec extensions in C/C++ code, however the syntax is a little different from Motorola's altivec extensions which are used in MacOS. Apple are apparantly going to support both the old and new altivec syntax in their GCC 3.1 based compiler. This means that altivec code written with the new syntax should work unmodified on both Linux and Dawrin/MacOS X.

  • Ars Technica did an article [arstechnica.com] comparing the AltiVec and SSE/MMX2/3DNow! architectures. Written a while back, but still valid as the architectures have not changed.

    --Paul
  • in each of their vector code examples, they divide n by 4 which seems to be because there are 4 altivec units on the powerpc chip. what happens when there are more units/chip? i think i may be missing something, though because this seems highly illogical. can someone please clear this up for me? i'm thinking maybe it has more to do with the fact that the 128bit unit can handle four floating point words per cycle (as stated toward the beginning of the article). in this case, would you divide n by 8 for 16-bit integers (and thus experience a ~8x performance increase)? can someone help me to get out of the dark?
    thanks...

6.023 x 10 to the 23rd power alligator pears = Avocado's number

Working...