IBM Full-System Simulator Team Speaks Out 115
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
Re:PS3? (Score:5, Informative)
While this "simulator" is basically an emulation of the Cell hardware, it won't allow people to run games at full speed. It's more of a developer tool, that allows programmers to start coding for the PS3 when they don't actually have the hardware yet. Still, it is reasonable to believe that emulation of the PS3 will be viable in the future (although not for a long time)
Re:mambo? (Score:2, Informative)
Re:PS3? (Score:0, Informative)
First, neither Intel nor AMD will be shipping any thing that even come close to the ~256 Gflops and whatever the Int performance number is of the latest version of the Broadband Engine does.
Second, x86 chips will never be able to emulate the internal ring bus in Cell chips. The killer ring bus inside the chip is really the key to the crazy performance people are getting out of Cell systems.
Intel and AMD pretty much have nothing but slapping additional cores together for the next decade on their roadmaps. And even if they could finally manage to get enough of their x86 cores onto one chip with the same amount of computational performance years from now, they will have nothing like the internal ring bus.
In other words, don't hold your breath waiting to emulate PS3 games on any x86 system...ever.
Re:You WANT A Cell System... (Score:5, Informative)
I almost feel drunk from the power I have at my hands
Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.
Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.
Re:Where is my workstation! (Score:2, Informative)
Praise for Cell (Score:5, Informative)
Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.
Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.
Re:Praise for Cell (Score:3, Informative)
I'll try, but take my opinion with a grain of salt as I didn't do anything beyond coding an RC5-72 core, which doesn't involve external memory accesses.
Sure it'd be impossible to keep this thing completely fed, but I hear the RAM specs are pretty impressive, using some new-fangled XD-RAM technology from Rambus. Still, the computational power of the SPEs is huge and it's sure to be RAM-starved unless the programmers take a lot of care.
Do realize though that this thing has a monster 100 GB/s interconnect. I would gather sending reasonable amounts of data back and forth between the SPUs is feasible, so perhaps operating on 8*256 KB = 2 MB datasets might be possible.
Beyond this, I think programmers would look at the Cell like they do at a NUMA box or clusters -- assume fetching remote data is costly and program to that paradigm. Not as costly as it is for clusters, even those with fancy interconnects; more like NUMA boxes. Hence, lots of blocking algorithms and stuff like 4-step FFTs. IBM is suggesting techniques using double-buffering which seem to be working well.
That depends on your workloads, in particular your access patterns. Sequential and blocking access patterns should do just fine.
What makes me pretty hopeful about the potential performance of Cell is that we're currently getting by pretty well with our CPUs with fast L2 cache of similar size (256 KB was pretty common 3 or 4 years ago) and slow memory accesses. The situation is pretty similar with Cell, save that the local store is directly addressable as opposed to transparent like caches are, and I see that as a big win actually -- being able to manage the local store and only make explicit memory accesses should help spot and fix bottlenecks, without the need to worry whether the target CPU will have 512 KB or 1 MB or 2 MB of cache. Of course, having 8 high-clocked SPEs processing 128-bit vectors will impose a much higher burden on memory than your run-of-the-mill Pentium 4 currently does, but I'm hoping that XD-RAM will be up to the challenge.
You may be mixing things up. What I said was that local store accesses had a fixed latency of 6 cycles.
I don't think a couple of afternoons writing code qualifies as real-world experience, but there you go.
Re:x86, x86_64, or PPC best for mambo simulator? (Score:2, Informative)
Re:Amazing Cell Demo (Score:1, Informative)
"First, the applications capture a user's face with a camera and detect the position of key features of the face, including the eyes, nose and mouth, using image recognition technology."
this can be done real time quite effectively right now:
http://citeseer.ist.psu.edu/rd/95418640%2C476373%
"By matching the 2D positions of these key features to a computer graphic image using a 3D face model, the applications estimate what direction the user is facing and the 3D positions of the face's 500 features."
Having seen a real-time morphable model demo from Toshba at ICCV2003 this is probably a similar approach to this:
http://gravis.cs.unibas.ch/Sigg99.html [unibas.ch]
(my PhD thesis includes this area - not on my site yet, but I have a paper on MM fitting at )
http://www.robots.ox.ac.uk/~jamie/paterson03.html [ox.ac.uk]
Cheers.
Re:You WANT A Cell System... (Score:3, Informative)
The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro and a 200MHz Pentium (which wasn't available for a few years), the Pentium Pro won hands down. By the time the Pentium II (i.e. Pentium Pro MMX) was released, everyone was running 32-bit apps - the only 16-bit apps left were so old that people didn't mind that they were slower than native ones, since they were still much faster than they had been on any CPU designed to run them.
The only differences between the Pentium Pro and the Pentium II were the addition of MMX, and the removal of the cache from a separate die in the same package to a separate package on the same board, which allowed cache and CPU cores to be tested inedpendently, improving yields.