IBM Full-System Simulator Team Speaks Out 115
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
PS3? (Score:3, Funny)
Re:PS3? (Score:5, Informative)
While this "simulator" is basically an emulation of the Cell hardware, it won't allow people to run games at full speed. It's more of a developer tool, that allows programmers to start coding for the PS3 when they don't actually have the hardware yet. Still, it is reasonable to believe that emulation of the PS3 will be viable in the future (although not for a long time)
Re:PS3? (Score:1, Troll)
Re:PS3? (Score:1, Troll)
Ouch, does that count for innovative these days?
Re:PS3? (Score:1)
Re:PS3? (Score:2)
Yeah, remember, the cell is just one component, you've got the GPU to worry about too, and make sure you can match the other system component performance (RAM bus and the like). Not impossible, but consider it took/takes a 100-200MHz intel system to emulate a 3MHz SNES. While other techniques are available (like dynamic recompilation and the like), these only go so far. If you cou
You WANT A Cell System... (Score:1, Interesting)
I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.
But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...
Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out ther
Re:You WANT A Cell System... (Score:4, Insightful)
I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.
But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...
Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out there. And find some way to get your hands on one of these...
I can now see why the PS3 stuff we are seeing is so amazing...
Sure, the cell is amazing, IF you are doing the right things. You say that you simply want to leave the old x86 architecture behind but the truth of the matter is that the two do not even begin to compare.
It is not simply a matter of saying "OMG my cell has 8 cores at 4ghz". The main Power Processing Element is crippled at best for simple single threaded applications -- roughly equivalent to a PowerPC of the G3 era, but specifically in-order execution. The SPEs (the other 8 cores) are essentially mini vector computers. They can perform a massive amount of floating point calculations in parrallel, however they do not enjoy an inante ability to deal well with all sorts of code as a standard x86 cpu could.
The cell designers have comptley sacrificed instruction level parrallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications. -- Don't throw out your x86 just yet.
True, however (Score:4, Insightful)
If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.
I dunno... (Score:1)
Re:True, however (Score:2)
Re:True, however (Score:2)
If performance doesn't matter, it doesn't matter. The discussion is moot. Go and buy a cheap 386.
If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.
There seems to be a difference between that the cell claims to be, and what you perceives it to be. The SPUs of the
Re:True, however (Score:2)
Personally I think that anyone who does similar operations on a large set of data will LOVE the cell. If you can get a pipeline going where each SPU does one step of a larger algorithm, you can stream the data right throug
Re:True, however (Score:2)
Re:True, however (Score:2)
Numerical computing can deal with the 32-bit floating point issue pretty easily.
I am no numerical analyst. But it seems to me that when you really need 8x4Ghz, you are doing some spiffy stuff. The more spiffy stuff you do, the higher precision you are going to need. While I agree that there are plenty of literature on "stable" numerical algorithms, after enough iterations, anything will become less accurate. Add to that the extra cost of developing the program for the cell (lots of workarounds for weak
Re:True, however (Score:1)
Actually, the new Sun T1 processors (1 floating point, 8 integer core CPUs) are what you'd want to serve more webpages for instance. Certainly not that 10GHz Intel processor coming out any day now [Tm]
Supercomputing folks can use it, but only for 32bit operations. Depends upon the need, not solely the bits. No
Re:True, however (Score:2)
The "problem" is that the Cell architecture is highly specialized; it may take them much more code to do more generic stuff, enough to render it useless. Otherwise; why did they require a PowerPC core on the die as well?
Cell is certainly interesting, and I expect a lot of the performance of it, b
Re:You WANT A Cell System... (Score:5, Insightful)
This analysis is incorrect, because it fails to recognize the fixed point. By sacrificing the out-of-order (OOO) mechanisms (which are brutal for heat production) they gained enough thermal headroom to effectively the double the clock rate. In the same thermal envelop, you either get an OOO processor running at 2GHz with three or four issues pathways (three has been the rule under x86) and a very deep pipeline, or you get a processor running at 4GHz with two issue pathways and a relatively short pipeline.
A deep pipeline grants (partial) immunity from stalls and bubbles. A short pipeline grants (partial) immunity from branch misprediction effects. To make the deep pipelines work well, huge investments are required in the branch-prediction unit, which is also infamous for throwing off a lot of heat.
The main Power Processing Element is crippled at best for simple single threaded applications
Fortunately for Cell, this is also the wrong denominator for use in this discussion. Applications might be single threaded, but systems are hardly ever single threaded. While the SPU processors handle audio, video, encryption, block I/O and other compute/bandwidth intensive primitives that most systems engage, they also off-loading cache pollution from the main Cell processor threads, both in the data space and in the task scheduling space.
Nothing will ever best the Pentium IV for single thread peak performance with no calorie spared. News flash: Intel has already given up on this flawed approach. The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots.
OOO served its purpose in the era where cycle time was paramount and the processor to cache cycle time ratios were in closer balance. Now that heat has become the limiting factor, we'll be seeing a lot less of that from all parties.
The reality in silicon is that we need to start rethinking those portions of the code base which only perform well under an OOO execution regime.
This can be accomplished at so many different levels. The entire OpenSSL library can be recoded for SPU coprocessors with massive speed gains. Existing code can be recompiled with modern compilers which exploit large register sets to offset lack of hardware-level OOO. Key algorithms in system libraries can be recoded using better algorithms or memory access patterns.
Those of you who insist on putting all your eggs into one 100W single threaded basket, it's time to step off the Moore's law express train. Hope you enjoy the milk run.
Re:You WANT A Cell System... (Score:3, Interesting)
"The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots."
Not the case. Among other things, modern code is highly dependant on memory latency. P4 as of late hasn't even been getting 60% of clock; Opteron gets nearly 95%.
Your whole argument is why Intel developed the Itanium. The idea of producing a simpler CPU that is thermally more efficent is a novel one, but time and again we find that you can't erase th
Re:You WANT A Cell System... (Score:1)
OOO isn't going away... (Score:3, Insightful)
But there's one thing OOO does that these processors will never do. That is efficiently run code that was not properly scheduled.
Now, why would you generate code with the wrong scheduling? Well, you wouldn't do so on purpose. But in the field PCs frequently encounter it. This code is code that was scheduled for a different processor. As instruction latencies, CPU clocks and memory latencies change the optimal instruction order changes.
So on
Re:OOO isn't going away... (Score:1)
And you could make the exact same argument about AMD m
you could make that argument... (Score:2)
P IV was designed to run at 6GHz or something. And gate-delay wise, they could probably do it with minimal changes. Except then it produces too much heat due to transistor switching that it can't be cooled properly.
AMD's chips however, were designed to run at the speeds they are running at. To make them go 4.4GHz would require redesigning them. But yes, they would also be much faster at those speeds.
So, the argument could be made for AMD, but it's not a
Re:You WANT A Cell System... (Score:1)
Note that this dual issue PPE core is a 21 stage pipeline(similar to PIV Northwood), while AMD's K8 is a 12 stage integer and 17 stage FP combo. PPE is not PPC 7447A nor it's PPC 750FX.
"you either get an OOO processor running at 2GHz with three or four issues pathways (three has been the rule under x86)"
Not quite since a K8's macro-op instruction (fix length) is fused with two instructions (one of the instructions must be an address type instruction). K8 issues three macro-op i
Re:You WANT A Cell System... (Score:2)
A minor point, but this probably isn't a sensible thing to do. OpenSSL already supports crypto accellerators, so it would be better to write a kernel module that provided /dev/crypto using an SPU or two (or more, in very high load situations, like an eCommerce server).
Re:You WANT A Cell System... (Score:2)
I saw this quote, and wondered why CPU manufacturers don't create a chip that is flexible. So instead of 8 registers, or 32, or 64, it would allow the programmer to address L1 cache as "registers" and to set aside a variable portion of L1 cache for the program's needs.
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:2)
Interestingly enough, it might very well lower performance, rather than improve it, since that makes much, much more state to save during thread switches. It would also just about kill any chance of older programs (even for the same arch) running as well as new ones on each new iteration.
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:2)
(offtopic: To whoever has made post #14142136 (sqrt(2)*10000000-rounded-to-nearest integer): please reply, and congratulations. Hopefully you'll get the 31415927th too.)
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:5, Informative)
I almost feel drunk from the power I have at my hands
Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.
Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.
Re:You WANT A Cell System... (Score:2)
For gaming, specifically games with a 3D engine, will the CELL be better than a top of the line P4 or Athlon 64? Let us assume that the entire code has been enginered for every chip. I believe the question that a lot of people have is if the XBOX chip is less powerful than the cell chip in the PS3. Again they want to know if someone wrote Wold of Warcraft or EQII for both platforms, and optimized both to
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:2)
Nintendo has already said that while the Revolution will definitely be an improvement over the GameCube, it won't have the kind of qua
Re:You WANT A Cell System... (Score:2)
Re:You WANT A Cell System... (Score:2)
All of those features were introduced with the Pentium Pro, which was savaged at the time relative to the Pentium (which is far more like the Cell) because the pre-NT Windows codebase ran like crap in that regime (one factor was partial register stalls, but there were many issues). A decade later the compilers and general codebase has become extremely tweaked in the other direction.
After the new code optimization fram
Re:You WANT A Cell System... (Score:3, Informative)
The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro
Re:You WANT A Cell System... (Score:2)
To see loss in the 2-10 range suggests to me that the Cell is blocking on memory loads far more often than it should be, which could be a compiler fault.
Here is a sequence that's hard to handle at the compiler level lacking OOO in hard
Re:You WANT A Cell System... (Score:1)
As far as I know, Sony call their PS1/PS2 sound chips as SPU and SPU2.
Re:You WANT A Cell System... (Score:2)
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:2)
Because it's prerendered on Cell processors, naturally
Re:You WANT A Cell System... (Score:1)
Re:You WANT A Cell System... (Score:2)
Re:You WANT A Cell System... (Score:2)
Apple made an exclusive agreement with Intel.
Half of the stuff Steve Jobs says are lies.
If Apple didn't become a white box Intel builder, there would be a destop variant of the Cell processor. They have chosen to bitch about PowerPC and remove legit benchmark results from their site emberassing Intel CISC stuff.
They trusted the "cult like" zealotry behind them. They were proven right.
mambo? (Score:1)
Re:mambo? (Score:2, Informative)
Re:mambo? (Score:2)
Where is my workstation! (Score:1)
Re:Where is my workstation! (Score:1)
Re:Where is my workstation! (Score:1)
Re:Where is my workstation! (Score:2)
Re:Where is my workstation! (Score:2, Informative)
Re:Where is my workstation! (Score:1)
Mambo - LOL (Score:1)
Re:Mambo - LOL (Score:1)
It had to be called something. Before, it was based on a previous product called SIM OS for PowerPC®, and we had to have a new name for it when we made it an IBM-only, proprietary tool. So, it was just a name that didn't have the word SIM in it, since there are so many simulators that have 'SIM' in their name. Then, for alphaWorks, we were forced to give it a more docile name. So, on alphaWorks I guess there is a reference that internally we call it Mambo, but it's called the IBM Full-System
Re:Mambo - LOL (Score:3, Funny)
Yes, all of IBM's products are named like that. I mean, every now and again they try to go for something neat and spiffy sounding like "WebSphere", but then they have to munge it all up with "Websphere Application Server" (WAS) and "Websphere Client Technologies Mobile Edition" (WCTME) and so on and so forth. This is normal for IBM, and this is why they really need code-names.
A related s
Re:Mambo - LOL (Score:1)
Only the very best get that designation.
Re:Mambo - LOL (Score:2)
Re:Mambo - LOL (Score:1)
Has anyone else been to www.zombo.com [zombo.com]? The infinite is possible at zombocom! The unattainable is unknown at zombocom! Welcome to ZOMBOCOM!!
LOL it's the most pointless site on the web outside of a good laugh, but the funniest thing is that it's been up for years, I wonder who pays for the hosting?
Re:Mambo - LOL (Score:1)
Praise for Cell (Score:5, Informative)
Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.
Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.
Re:Praise for Cell (Score:1)
Could you speak more to performance issues when dealing with code/data that exceeds the 256K SPU local store? It looks to me like fetches from RAM are a real bottleneck, so if you want performance you need to keep code/data within each SPU. If you can chain a series of algorithms and move data down the chain this is a win. But if you need to manipulate a huge data block you're SOL. I can see the Cell being a huge win for say a series of Monte Carlo sims running in each SPU, but am it looks like a lose on
Re:Praise for Cell (Score:3, Informative)
I'll try, but take my opinion with a grain of salt as I didn't do anything beyond coding an RC5-72 core, which doesn't involve external memory accesses.
Re:Praise for Cell (Score:2)
(I still remember when distributed.net was running RC-56 or something for 8 months on 100k machines, and some people just made some asics that had 100+ parallel key-piplines and build a a machine that could exhaust the keyspace in 3 days or so...
So i wouldnt be too optimistic because of that little performance point....
Re:Praise for Cell (Score:2)
Not as much as it seems at first glance. But that's a discussion for another day.
The chips were built by the EFF, and actually they cracked not RC5-56 but DES, which is also a 56-bit-key cipher but far more widespread. Also, by the time of the
Re:Praise for Cell (Score:2)
Using an SPE initialised with an AES decoder and encoder would mean that every single block loaded from or stored to the disk (including the swap file) could be AES encrypted with very little performance penalty. This would be a very nice feature in a laptop, since anyone who stole it would have no way of accessing the original user's files.
Re:Praise for Cell (Score:2)
Mambo? (Score:2)
Amazing Cell Demo (Score:5, Interesting)
http://techon.nikkeibp.co.jp/lsi/images/toshiba_c
http://techon.nikkeibp.co.jp/english/NEWS_EN/2005
Re: (Score:3, Funny)
Re:Amazing Cell Demo (Score:1, Informative)
"First, the applications capture a user's face with a camera and detect the position of key features of the face, including the eyes, nose and mouth, using image recognition technology."
this can be done real time quite effectively right now:
http://citeseer.ist.psu.edu/rd/95418640%2C476373%2 C1%2C0.25%2CDownload/http%3AqSqqSqwww [psu.edu]
Sex Games (Score:2)
x86, x86_64, or PPC best for mambo simulator? (Score:1)
Re:x86, x86_64, or PPC best for mambo simulator? (Score:1)
Re:x86, x86_64, or PPC best for mambo simulator? (Score:1)
As for my plans for the Cell, I was thinking of writing either:
(Ex-IBMer here! Good job guys!).
2.8GHz Athlon 64 (Score:2)
Re:x86, x86_64, or PPC best for mambo simulator? (Score:2, Informative)
Re:x86, x86_64, or PPC best for mambo simulator? (Score:2)
A 1600 Mhz G5 can easily count as 2 Ghz P4 for example.
(Don't tell Mr. Jobs about it)
connotations (Score:2)
Obligatory (Score:1)
/sorry
Yeah, but... (Score:1)
Half of Mac community repeats in mind (Score:2)
When there is IBM and a SORT OF (read zealots) PowerPC story like this happens, you gotta concentrate too much not to think about Mactel.
It is my personal point of view and I am kind of emberassed that whole Mac community became Intel zealots in 1 night.
Any chance of seeing Cell on a PCI-X card? (Score:3)
As everyone seems to agree that running general-purpose code (e.g. Linux) on a Cell is going to be unpleasant thanks to the dumbing down of the PowerPC at the core, I was wondering what the odds are of seeing this as an add-on for doing vector-friendly operations. While I don't see people rushing out to install a Cell just for the hell of it, what are the chances that e.g. future crypto-offload accelerators or even 3D video cards might use one of these puppies?
Re:Buy where??? (Score:1)
Whats their timeline on this one , 1 light year?
Actually, 1 light year is exactly the same amount of time than 1 tortoise year.
Re:Buy where??? (Score:1)
P.S. Are you the barrapunto.com's McPolu? Nice to see you here.
Re:Buy where??? (Score:1)
Hello, yes, my name is Conner McPolu of the clan McPolu and I was born on the shores of barrapunto in the year of our lord 1524 and I cannot die :P
It looks like barrapunto.com is not enough fun for me while I am waiting for my unit tests to complete ;-)