Panic in Multicore Land 367
MOBE2001 writes "There is widespread disagreement among experts on how best to design and program multicore processors, according to the EE Times. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others disagree on the ground that heterogeneous processors would be too hard to program. The only emerging consensus seems to be that multicore computing is facing a major crisis. In a recent EE Times article titled 'Multicore puts screws to parallel-programming models', AMD's Chuck Moore is reported to have said that 'the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.'"
Should Mimick The Brain (Score:5, Interesting)
Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.
Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.
The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.
The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.
Let's see the menu (Score:4, Interesting)
My heterogeneous experience with Cell processor (Score:5, Interesting)
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.
Well, I'm panicked... (Score:5, Interesting)
Re:Panic? (Score:0, Interesting)
User experience is not a useful metric for performance, unless you consider media encoding , decoding and rendering. 10 years ago I was running a P166, what kind of framerates would I get with a modern game using a software renderer? What kind of framerates would I get for decoding a HD video stream?
Do you seriously think a 12 year old P166 will provide a comparative user experience to a modern 8 core 3GHz machine? You're putting it down to "the OS's that run on them", which is interesting since user mode x86 emulation with QEMU runs W2K faster on my laptop than on the hardware I ran it on back in 1999.
he is right, but it depends on the application (Score:5, Interesting)
There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).
Re:My heterogeneous experience with Cell processor (Score:5, Interesting)
And while the Cell architecture is a fairly stationary target because it was incorporated into a commercial gaming console, if these types of architectures were to find their way into general purpose computing, it would be a real nightmare, since every year or so a new variant of the architecture would come out that would introduce a faster interconnect here, more cache memory there, etc., so that one might have to reorganize the division of labor in one's application to take advantage (again a properly parameterized library/framework can handle this sometimes, but only post facto--after the variation in features is known, not before the new features have even been introduced).
Multithreading is not easy but it's doable (Score:5, Interesting)
When we wrote the OpenAMQ messaging software [openamq.org] in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project [amqp.org].
Today, we're making a new design - ØMQ [zeromq.org], aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.
From one of the ØMQ whitepapers [zeromq.org]:
We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.
Heterogenous is a natural thing to do (Score:4, Interesting)
This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and
Multicores, but not on a chip (Score:5, Interesting)
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.
In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.
Kim0
Help me understand the distinction (Score:3, Interesting)
Re:My heterogeneous experience with Cell processor (Score:5, Interesting)
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version.
Your points are valid as things stand, but isn't it a bit premature to make this judgment? Cell was a fairly radical design departure. If IBM continues to refine Cell, and as more experience is gained, the challenge will likely diminish.
For one thing, IBM will likely add double precision floating point support. But note that SIMD in general poses problems in the traditional handling of floating point exceptions, so it still won't be quite the same as double precision on the PPU.
The local-memory SPE design alleviates a lot of pressure on the memory coherence front. Enforcing coherence in silicon generates a lot of heat, and heat determines your ultimate performance envelop.
For decades, programmers have been fortunate in making our own lives simpler by foisting tough problems onto the silicon. It wasn't a problem until the hardware ran into the thermal wall. No more free lunch. Someone has to pay on one side or the other. IBM recognized this new reality when they designed Cell.
The reason why x86 never died the thousand deaths predicted by the RISC camp is that heat never much mattered. Not enough registers? Just add OOO. Generates a bit more heat to track all the instructions in flight, but no real loss in performance. Bizarre instruction encoding? Just add big complicated decoders and pre-decoding caches. Generates more heat, but again performance can be maintained.
Probably with a software architecture combining the hairy parts of the Postgres query execution planner with the recent improvements in the FreeBSD affinity-centric ULE scheduler, you could make the nastier aspects of SPE coordination disappear. It might help if the SPUs had 512KB instead of 256KB to alleviate code pressure on data space.
I think the big problem is the culture of software development. Most code functions the same way most programmers begin their careers: just dive into the code, specify requirements later. What I mean here is that programs don't typically announce the structure of the full computation ahead of time. Usually the code goes to the CPU "do this, now do that, now do this again, etc." I imagine the modern graphics pipelines spell out longer sequences of operations ahead of time, by necessity, but I've never looked into this.
Database programmers wanting good performance from SQL *are* forced to spell things out more fully in advance of firing off the computation. It doesn't go nearly far enough. Instead of figuring out the best SQL statement, the programmer should send a list of *all* logically equivalent queries and just let the database execute the one it finds least troublesome. Problem: sometimes the database engine doesn't know that you have written the query to do things the hard way to avoid hitting a contentious resource that would greatly impact the performance limiting path.
These are all problems in the area of making OSes and applications more introspective, so that resource scheduling can be better automated behind the scenes, by all those extra cores with nothing better to do.
Instead, we make the architecture homogeneous, so that resource planning makes no real difference, and we can thereby sidestep the introspection problem altogether.
I've always wondered why no-one has ever designed a file system where all the unused space is used to duplicate other disk sectors/blocks, to create the option of vastly faster seek plans. Probably because it would take a full-time SPU to constantly recompute the seek plan as old requests are completed and new requests enter the queue. Plus if two supposedly identical copies managed to diverge, it would be a nightmare to debug, because the copy you get back would non-deterministic. Hybrid MRAM/Flash/spindle storage systems could get very interesting.
I guess I've been looking forward to the end of artificial scaling for a long time (clock freq. as the
Re:Languages (Score:4, Interesting)
Every object (in the general sense, not necessarily the OO sense) may be either aliased or mutable, but not both.
Erlang does this by making sure no objects are mutable. This route favours the compiler writer (since it's easy) and not the programmer. I am a huge fan of the CSP model for large projects, but I'd rather keep something closer to the OO model in the local scope and use something like CSP in the global scope (which is exactly what I am doing with my current research).
One Fast Core, Multiple Commodity ones (Score:3, Interesting)
This help get around yield issues of getting all cores to work at a very high frequency and the related thermal issues . This could be a boon to general purpose computer that have a mix of hard to multi-thread and easy to multi-thread programs - assuming the OS could be intelligent on which cores the tasks are scheduled on. The cores could or could not have the same instruction sets, but having the same instruction sets would be the easy first step.
Re:My heterogeneous experience with Cell processor (Score:0, Interesting)
Next year we might even get the same processor with 32 SPU, still hard to program but it means 8MB total of data on the chip, which opens some opportunities (unfortunately the memory size per SPU seems to be set in stone at 256kB).
I'm interested in programming the Cell for doing some signal processing, most of which will be single precision FFT, an application where it seems to rock. I think that the data flow between SPU is relatively easy to organize for my purpose. OTOH, it seems nobody wants to sell bare Cell chips, which is sad, since I would love to try to interface it to high speed (1-2Gsamples/s) ADCs.
Re:My heterogeneous experience with Cell processor (Score:1, Interesting)
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
A lot of this is just due to the lack of a good platform, there is nothing that prevents demand paging of data SPEs need and the C++ features are just due to the current implementation of the runtime. I will agree that the Cell is aimed a little bit too much at the video and game markets in its current implementation though, think of it as a first step, if they can make is successful and some better platforms and tools materialize then imagine having 64 SPEs with different groups of specialized functions, perhaps some aimed at linear algebraic functions, some at more analytical, etc.. some kind of parallel multiprocessing is the future, that much is a given, it's just a matter of figuring out the right model.
Re:Should Mimick The Brain (Score:3, Interesting)
So while there's nothing wrong in looking at our radically imperfect understanding of the brain, which is in no better state than pre-flight understanding of bird aerodynamics, it is optimistic to expect that it will provide much guidance in building programmes for multi-core processors, or for building those processors themselves. Neural networks, the most famously brain-like system architecture, are famously hard to "programme" (train) and essentially impossible to debug (interpret).
The article suggests that heterogeneous multi-core architectures may be best represented to the programmer as a set of heterogeneous APIs, much as graphics-specific APIs are now. While this is vaguely consistent with the idea that "different parts of the brain do different things", I don't think the brain analogy brings anything useful to the table, and past experience should make us very wary of trying to draw any deeper inferences from it. Aeroplanes do look vaguely like birds, but that doesn't mean we should dispense with vertical stabilisers...
One could equally well argue that neurons in the brain are fairly homogeneous, and each core could be considered a neuron. We know that different parts of the brain are remarkably adaptable. Stroke patients often regain function due to other parts of the brain taking over from the bits that were destroyed. So on this analogy, homogeneous processors that could be adapted to multiple tasks is the way to go.
Demonstrating fairly conclusively that the brain analogy is pretty much useless, as it can be manipulated to appear to support whichever side of the debate you've already decided is the right one.
Re:My heterogeneous experience with Cell processor (Score:3, Interesting)
OK, I have to ask - why on earth can't you use C++ exceptions on them?
After all, what is an exception? It's basically syntactic sugar around setjmp()/longjmp(), but with a bit more code to make sure the stack unwinds properly and destructors are called, instead of longjmp() being a plain non-local goto.
What else is there that makes C++ exceptions unimplemenatable?
Re:Panic? (Score:-1, Interesting)
Functional programming makes a lot of common problems mindblowingly easy, especially when you can switch to an imperative style where it makes sense with no loss of clarity (such as the web app I'm currently deploying, where obviously some user-state would need to be managed). The only close-to-legit argument against Scheme I've heard is its lack of OO, but as Graham said, "Object-oriented programming is exciting if you have a statically-typed language without lexical closures or macros. To some degree, it offers a way around these limitations."
Then again, it is of course only my *opinion* that most applications make more sense in a functional style with a bit of imperative code where you're actually dealing with the state of something the application relates to. If you've spent your life in C++/Java/etc... land and it's just too big of a change, there's also Common Lisp, which is basically the opposite. Primarily imperative with a fucking brilliant OO system and a package system for namespace separation that imperative code often needs to avoid various developers clobbering eachother's code, but functional style is also right there when you need it (of course if that were my thing and I knew it wouldn't be anything complicated enough that their simplistic OO could cleanly substitute for macros, I'd probably look at Python or Ruby...).
Anyway, many would argue that a "simple procedural language" is more often than not a huge pain in the ass leading to silly things like "design patterns" and rampant misuse of OO where all you needed was a couple first class procedures or a closure or a macro, just as a "pure" functional language makes you deal with silly bullshit like monads and streams when all you needed was to set a variable somewhere. Such is the nature of extremes. Neither have much place in the future.
Re:Panic? (Score:2, Interesting)
Re:Multithreading is not easy but it's doable (Score:4, Interesting)
- eJabberd latency is in the 10-50msec range. 0MQ gets latencies of around 25 microseconds.
- eJabberd supports more than 10k users. 0MQ will support more than 10k users.
- eJabberd scales transparently thanks to Erlang. 0MQ squeezes so much out of one box that scaling is less important.
- eJabberd has high-availability thanks to Erlang 0MQ will have to build its own HA model (as OpenAMQ did).
- eJabberd can process (unknown?) messages per second. 0MQ can handle 100k per second on one core.
Sorry if I got some things wrong, ideally we'd run side-by-side tests to get figures that we can properly compare.
Note that protocols like AMQP can be elegantly scaled at the semantic level, by building federations that route messages usefully between centers of activity. This cannot be done in the language or framework, it is dependent on the protocol semantics. This is how very large deployments of OpenAMQ work. I guess the same as SMTP networks.
0MQ will, BTW, speak XMPP one day. It's more a framework for arbitrary messaging engines and clients, than a specific protocol implementation.
I've seen Erlang used for AMQP as well - RabbitMQ - and by all accounts it's an impressive language for this kind of work.
Re:Panic? (Score:3, Interesting)
This also goes to the bloat - as programmers have typically stopped optimizing code. Thus there are more lines of code in delivered software - often having more and more abstraction layers in them, which doesn't help either. So the overall effect is that the software takes longer to do the same function.
In the end, despite the increase in processing power, the programs run as slow or slower than before. Numerous reasons for it. The GP of my original post in this thread is still correct.
Hetereogeneous is the key word! (Score:3, Interesting)
As long as all these cores share the same basic architecture (i.e. x86, Power, ARM), it would be possible to allow all general-purpose code to run on any core, while some tasks would be able to ask for a core with special capabilites, or the OS could simply detect (by trapping) that a given task was using a non-uniform resource like vector fp, mark it for the scheduler, and restart it on a core with the required resource.
An OS interrupt handler could run better on a short pipeline in-order core, a graphics driver could use something like Larrabee, while SPECfp (or anything else that needs maximum performance from a single thread would run best on an Out-of-Order core like the current Core 2.
The first requirement is that Intel/AMD must develop the capability to test & verify multiple different cores on the same chip, the second that Microsoft must improve their OS scheduler to the point where it actually understands NUMA principles not just for memory but also cpu cores. (I have no doubt at all that Linux and *BSD will have such a scheduler available well before the time your & I can buy a computer with such a cpu in it!)
So why do I believe that such cpus are inevitable?
Power efficiency!
A relatively simple in-order core like the one that Intel just announced as Atom delivers maybe an order of magnitude better performance/watt than a high-end Core 2 Duo. With 16 or 80 or 256 cores on a single chip, this will become really crucial.
Terje
PS As other posters have noted, keeping tomorrow's multi-core chips fed will require a lot of bandwith, this is neither free nor low-power.
Re:Specialisation is inevitable (Score:3, Interesting)
Trouble is, filling four cores is quite a bit more iffy.
Re:Panic? (Score:3, Interesting)
1) Is there a programming language that tries to make programming for multiple cores easier?
2) Is programming for parallel cores the same as parallel programming?
3) Is anybody aware of anything in this direction on the C++ front that does not rely on OS APIs?