Slashdot Log In
Panic in Multicore Land
Posted by
Zonk
on Tue Mar 11, 2008 06:30 AM
from the multi-cores-no-waiting dept.
from the multi-cores-no-waiting dept.
MOBE2001 writes "There is widespread disagreement among experts on how best to design and program multicore processors, according to the EE Times. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others disagree on the ground that heterogeneous processors would be too hard to program. The only emerging consensus seems to be that multicore computing is facing a major crisis. In a recent EE Times article titled 'Multicore puts screws to parallel-programming models', AMD's Chuck Moore is reported to have said that 'the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.'"
Related Stories
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Panic? (Score:4, Insightful)
Re:Panic? (Score:4, Insightful)
Parent
Re:Panic? (Score:5, Insightful)
"...the speed the user experiences has not improved much [in the last 5-7 years]."
This may almost be true if you stay on the cutting edge, but not even close for the average user (or the power-user on a budget, like myself). 5 years ago I was running a 1.2 GHz Duron. Today I have a 2.3 GHz Athlon 64 in my notebook (which is a little over a year old, I think), and an Athlon 64 X2 5600+ (that's a dual-core 2.8 GHz, for those who don't know) in my desktop. I'd be lying if I said I didn't notice much difference between the three.
Parent
Re:Panic? (Score:4, Insightful)
"...the speed the user experiences has not improved much [in the last 5-7 years]."
This may almost be true if you stay on the cutting edge, but not even close for the average user (or the power-user on a budget, like myself). 5 years ago I was running a 1.2 GHz Duron. Today I have a 2.3 GHz Athlon 64 in my notebook (which is a little over a year old, I think), and an Athlon 64 X2 5600+ (that's a dual-core 2.8 GHz, for those who don't know) in my desktop. I'd be lying if I said I didn't notice much difference between the three.
Do notice that multi-cores don't increase the overall clock frequency, just divide the work up among a set of lower clock frequency cores - yet most programs don't take advantage of that.
Do notice that despite clock frequencies going from 33 mhz to 2.3 GHz, the user's perceived performance of the computer has either stayed the same (most likely) or diminished over that same time period.
Do notice that programs are more bloated than ever, and programmers are lazier than ever.
In the end the GP is right.
Parent
Re:Panic? (Score:4, Insightful)
And yes, the OS can, and has been able to for years since SMP first came about, spread loads across multiple processors and cores. But that cannot change how a single program functions in and of itself - it cannot make that single program work at any given moment on more than one single core if it was not designed to do so (i.e. if the program is not designed to use multiple threads or processes).
All-in-all, the OP is correct.
Parent
Re:Panic? (Score:5, Informative)
Parent
Multicores, but not on a chip (Score:5, Interesting)
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.
In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.
Kim0
Parent
Re:Multicores, but not on a chip (Score:5, Informative)
Parent
Re:Panic? (Score:5, Insightful)
I write large and complex engineering applications. I have a few threads around, mostly for the purpose of doing calculation and dealing with slow devices. But I'm not going to add in more threads just because there are more cores for me to use. I'll add threads when performance issues requires that I add threads, and not before.
Most software today runs fine as a single thread anyway. The specialized software that requires maximum CPU performance (and is not already bottle-necked by HD or GPU access) will be harder to write, but for everything else the current model is just fine.
If anything, Intel should worry about 99% of all people simply not needing 80 cores to begin with...
Parent
Re:Panic? (Score:4, Insightful)
Parent
Re:Panic? (Score:4, Funny)
Parent
Re:Panic? (Score:5, Insightful)
Parent
Re:Panic? (Score:5, Insightful)
It is in general, an impossible problem.
Most existing code is imperative. Most programmers write in imperative programming languages. Object orientation does not change this. Imperative code is not suited for multiple CPU implementation. Stapling things together with threads and messaging does not change this.
You could say that we should move to other programming "paradigms". However in my opinion, the reason we use imperative programs so such is because most of the tasks we want accomplished are inherently imperative in nature. Outside of intensive numerical work, most tasks people want done on a computer are done sequentially. The availability of multiple cores is not going to change the need for these tasks to be done in that way.
However, what multiple cores might do is enable previously impractical tasks to be done on modest PCs. Things like NP problems, optimizations, simulations. Of course these things are already being done, but not on the same scale as things like, say, spreadsheets, video/sound/picture editing, gaming, blogging, etc. I'm talking about relatively ordinary people being able to do things that now require supercomputers, experimenting and creating on their own laptops. Multi core programs can be written to make this feasible.
Considering I'm beginning to sound like an evangelist, I'll stop now. Safe money says PCs stay at 8 CPUs or below for the next 15 years.
Parent
Re:Panic? (Score:5, Insightful)
Odd selection of examples. The processing of cells can almost trivially be allocated across 80 cores. Media work can almost trivially be split into chunks across 80 cores. Games usually relatively easy to split, either by splitiing the graphics into chunks or parallelizable physics or other parallelizable simulation aspects.
Oh, and blogging.
My optical mouse has enough processing horsepower inside for blogging.
OPTICAL MOUSE CIRCUITRY:
Has the user pressed a key?
No.
Has the user pressed a key?
No.
Has the user pressed a key?
No.
(repeat 1000 times)
Has the user pressed a key?
No.
Has the user pressed a key?
No.
Has the user pressed a key?
Yes.
OOOO! YES!
QUICK QUICK QUICK! HURRY HURRY HURRY! PROCESS A KEYPRESS! YIPEE!
-
Parent
Re:Panic? (Score:4, Insightful)
Parent
Re:Panic? (Score:5, Funny)
Parent
Re:Panic? (Score:5, Insightful)
Parent
No problems for servers (Score:5, Insightful)
For most typical workloads most servers don't have enough I/O to keep 80 cores busy.
If there's enough I/O there's no problem keeping all 80 cores busy.
Imagine a slashdotted webserver with a database backend. If you have enough bandwidth and disk I/O, you'll have enough concurrent connections that those 80 cores will be more than busy enough
If you still have spare cores and mem, you can run a few virtual machines.
As for desktops - you could just use Firefox without noscript, after a few days the machine will be using all 80 CPUs and memory just to show flash ads and other junk
Parent
Re:Panic? (Score:5, Informative)
Parent
Re:Panic? (Score:5, Informative)
And frankly, it helps a lot to write code that is microprocessor-friendly to begin with:
If the node-code is bad enough, it can make any parallelism look good to the user. But writing good node-code is hard;-( As a reviewer, I have recommended rejection for a parallel-processing paper that claimed 80% parallel efficiency on 16 processors for the author's air-quality model. But I knew of a well-coded equivalent model that outperformed the paper's 16-processor model-result on a single processor -- and still got 75% efficiency on 16 processors (better than 10x the paper-author's timing).
fwiw.
Parent
Self Interest (Score:3, Informative)
So take it all with a grain of salt
--Q
Re:Self Interest (Score:5, Informative)
Parent
Re: (Score:3, Insightful)
If he's saying that his multicore processors are going to be hard to program, then self-interest suggests he be very very quiet (;-))
Seriously, though, adding what used to be a video board to the CPU doesn't change the programming model. I suspect he's more interested in debating future issues with more tightly coupled processors.
--dave
Should Mimick The Brain (Score:5, Interesting)
Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.
Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.
The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.
The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.
Re:Should Mimick The Brain (Score:5, Funny)
I think it's pretty obvious there are serious design flaws in the human brain. And I'm not only talking about stability, but also reliability and accuracy.
Just look at the world.
Parent
Let's see the menu (Score:4, Interesting)
Re:Let's see the menu (Score:5, Funny)
Parent
Not *that* Chuck Moore (Score:5, Informative)
Re:Not *that* Chuck Moore (Score:5, Funny)
Parent
The future is here (Score:5, Insightful)
My heterogeneous experience with Cell processor (Score:5, Interesting)
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.
Re:My heterogeneous experience with Cell processor (Score:5, Interesting)
And while the Cell architecture is a fairly stationary target because it was incorporated into a commercial gaming console, if these types of architectures were to find their way into general purpose computing, it would be a real nightmare, since every year or so a new variant of the architecture would come out that would introduce a faster interconnect here, more cache memory there, etc., so that one might have to reorganize the division of labor in one's application to take advantage (again a properly parameterized library/framework can handle this sometimes, but only post facto--after the variation in features is known, not before the new features have even been introduced).
Parent
Re:My heterogeneous experience with Cell processor (Score:4, Insightful)
Parent
Re:My heterogeneous experience with Cell processor (Score:5, Interesting)
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version.
Your points are valid as things stand, but isn't it a bit premature to make this judgment? Cell was a fairly radical design departure. If IBM continues to refine Cell, and as more experience is gained, the challenge will likely diminish.
For one thing, IBM will likely add double precision floating point support. But note that SIMD in general poses problems in the traditional handling of floating point exceptions, so it still won't be quite the same as double precision on the PPU.
The local-memory SPE design alleviates a lot of pressure on the memory coherence front. Enforcing coherence in silicon generates a lot of heat, and heat determines your ultimate performance envelop.
For decades, programmers have been fortunate in making our own lives simpler by foisting tough problems onto the silicon. It wasn't a problem until the hardware ran into the thermal wall. No more free lunch. Someone has to pay on one side or the other. IBM recognized this new reality when they designed Cell.
The reason why x86 never died the thousand deaths predicted by the RISC camp is that heat never much mattered. Not enough registers? Just add OOO. Generates a bit more heat to track all the instructions in flight, but no real loss in performance. Bizarre instruction encoding? Just add big complicated decoders and pre-decoding caches. Generates more heat, but again performance can be maintained.
Probably with a software architecture combining the hairy parts of the Postgres query execution planner with the recent improvements in the FreeBSD affinity-centric ULE scheduler, you could make the nastier aspects of SPE coordination disappear. It might help if the SPUs had 512KB instead of 256KB to alleviate code pressure on data space.
I think the big problem is the culture of software development. Most code functions the same way most programmers begin their careers: just dive into the code, specify requirements later. What I mean here is that programs don't typically announce the structure of the full computation ahead of time. Usually the code goes to the CPU "do this, now do that, now do this again, etc." I imagine the modern graphics pipelines spell out longer sequences of operations ahead of time, by necessity, but I've never looked into this.
Database programmers wanting good performance from SQL *are* forced to spell things out more fully in advance of firing off the computation. It doesn't go nearly far enough. Instead of figuring out the best SQL statement, the programmer should send a list of *all* logically equivalent queries and just let the database execute the one it finds least troublesome. Problem: sometimes the database engine doesn't know that you have written the query to do things the hard way to avoid hitting a contentious resource that would greatly impact the performance limiting path.
These are all problems in the area of making OSes and applications more introspective, so that resource scheduling can be better automated behind the scenes, by all those extra cores with nothing better to do.
Instead, we make the architecture homogeneous, so that resource planning makes no real difference, and we can thereby sidestep the introspection problem altogether.
I've always wondered why no-one has ever designed a file system where all the unused space is used to duplicate other disk sectors/blocks, to create the option of vastly faster seek plans. Probably because it would take a full-time SPU to constantly recompute the seek plan as old requests are completed and new requests enter the queue. Plus if two supposedly identical copies managed to diverge, it would be a nightmare to debug, because the copy you get back would non-deterministic. Hybrid MRAM/Flash/spindle storage systems could get very interesting.
I guess I've been looking forward to the end of artificial scaling for a long time (clock freq. as the
Parent
Re:My heterogeneous experience with Cell processor (Score:4, Insightful)
You're not the only person using heterogeneous cores, however. In fact, the Cell is a minority. Most people have a general purpose core, a parallel stream processing core that they use for graphics and an increasing number have another core for cryptographic functions. If you've ever done any programming for mobile devices, you'll know that they have been using even more heterogeneous cores for a long time because they give better power usage.
Parent
Well, I'm panicked... (Score:5, Interesting)
he is right, but it depends on the application (Score:5, Interesting)
There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).
+1 Optimistic (Score:4, Funny)
I don't think we'll be Slashdotting your server any time soon, CBravo
Parent
Multithreading is not easy but it's doable (Score:5, Interesting)
When we wrote the OpenAMQ messaging software [openamq.org] in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project [amqp.org].
Today, we're making a new design - ØMQ [zeromq.org], aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.
From one of the ØMQ whitepapers [zeromq.org]:
We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.
Re:Multithreading is not easy but it's doable (Score:4, Interesting)
- eJabberd latency is in the 10-50msec range. 0MQ gets latencies of around 25 microseconds.
- eJabberd supports more than 10k users. 0MQ will support more than 10k users.
- eJabberd scales transparently thanks to Erlang. 0MQ squeezes so much out of one box that scaling is less important.
- eJabberd has high-availability thanks to Erlang 0MQ will have to build its own HA model (as OpenAMQ did).
- eJabberd can process (unknown?) messages per second. 0MQ can handle 100k per second on one core.
Sorry if I got some things wrong, ideally we'd run side-by-side tests to get figures that we can properly compare.
Note that protocols like AMQP can be elegantly scaled at the semantic level, by building federations that route messages usefully between centers of activity. This cannot be done in the language or framework, it is dependent on the protocol semantics. This is how very large deployments of OpenAMQ work. I guess the same as SMTP networks.
0MQ will, BTW, speak XMPP one day. It's more a framework for arbitrary messaging engines and clients, than a specific protocol implementation.
I've seen Erlang used for AMQP as well - RabbitMQ - and by all accounts it's an impressive language for this kind of work.
Parent
Heterogenous is a natural thing to do (Score:4, Interesting)
This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and
Specialisation is inevitable (Score:3, Insightful)
Flick the CPU monitor to aggregate usage rate mode, and I rarely clear 35% usage, and I've never seem it higher than about 55% (and even that for only a second or two once an hour). A normal PC, even fairly heavily loaded up with apps, just can't use the extra power.
And since cores aren't going to get much faster, there's no real chance of getting big wins there either.
Unless you have a specialized workload (heavy number crunching, kernel compilation, etc) there's going to simply be no point having more parallelism.
So as far as I can tell, for general loads it seems to be inevitable that if we want more straight line speed, we'll need to start making hardware more attuned for specific tasks.
So in my 16-core workstation of the future, if my Photoshop needs to apply some relatively intensive transform that has to be applied linearly, it can run off to the vector core, while I'm playing Supreme Commander on one generic core (the game) two GPU cores (the two screens) and three integer-heavy cores (for the 3 enemy AIs), and the generic System Reserved Core (for interrupts, and low-level IO stuff) hums away underneath with no pressure.
Hetrogeny also has economics on it's side.
There's very little point having specialized cores when you've only got two.
Once there's no longer scarcity in quantity, you can achieve higher productivity by specialization.
Really, any specialized core that you can keep the CPU usage rates running higher than the overall system usage rate, is a net win in productivity for the overall computer. And over time, anything that increases productivity wins.
Occam and Beyond (Score:3, Insightful)
[Although, personally, I prefer Occam's syntax over that of C's.]
http://en.wikipedia.org/wiki/Occam_programming_language [wikipedia.org]
I think that a tread aware programming language would be good in our multi-core world.
Help me understand the distinction (Score:3, Interesting)
better idea (Score:3, Funny)
P.S.: I know why this is impossible, so please don't flame me.
Current state of software development (Score:5, Funny)
Ugg can program a CPU.
Two Uggs can program two CPUs.
Two Uggs working on the same task program two CPUs.
Uggs' program has a race condition.
Ugg1 thinks, it's Ugg2's fault.
Ugg2 thinks, it's Ugg1's fault.
Ugg1 hits Ugg2 on the head with a rock.
Ugg2 hits Ugg1 on the head with an axe.
Ugg1 is half as smart as he was before working with Ugg2.
Ugg2 is half as smart as he was before working with Ugg1.
Both Uggs now write broken code.
Uggs' program is now slow, wrong half the time, and crashes on that race condition once in a while.
Ugg does not like parallel computing.
Ugg will bang two rocks together really fast.
Ugg will reach 4GHz.
Ugg will teach everyone how to reach 4GHz.
How to use so many cpu's (Score:4, Insightful)
I had been working with a 100 PC cluster of P4 based systems to do H.264 HDTV compression in realtime. I spread the compression function across the cluster using each system to work on a small part of the problem and flow the data across the CPU's.
Based on this I wanted to build an array of processors on one chip, but I am not a silicon person, just software, driver and some basic electronics. So I looked at various FPGA cores, Arm, MIPS, etc. Then I went to a talk giving by Chuck Moore, author of the language FORTH. He had been building his own CPU's for many years using his own custom tools.
I worked with Chuck Moore for about a year in 2001/2002 on creating a massive multi core processor based on Chucks stack processor.
The Idea was instead of having 1,2 or 4 large processor to have 49 (7 * 7) small light but fast processors in one chip. This would be for tacking a different set of problems then your classic cpus'. It wouldn't be for running and OS or word processing, but for Multimedia, and cryptography, and other mathematic problems.
The idea was to flow data across the array of processors.
Each processor would run at 6Ghz, with 64K word of Ram each.
21 Bit wide words and bus (based off of F21 processor)
this allows for 4x 5bit instructions on a stack processor that only has 32 instructions.
Since it's a stack processor they run more efficiently. So in 16K transistors, 4000 gates,
the F21 at 500 Mhz performed about the same as a 500Mhz 486 with JPEG compress and decompress.
With the parallel core design instead of a common bus or network between the processors there would only be 4 connections into and out of each processor. These would be 4 registers that are shared with it's 4 neighboring processors that are laid out in a grid. So each chip would have a north, south, east and west register.
Data would be processed in whats called a systolic array, where each core would pick up some data, perform operations on it and pass it along to the next core.
The chips with a 7x7 grid of processors would expose the 28(4x7) bus lines off the edge processors, so that these could be tiled into a much larger grid of processors.
Each chip could perform around 117 Billion instructions per second at 1 Watt of power.
Unfortunately I was unable to raise money, partly because I couldn't' get any commitment from Chuck.
below is some links and other misc information on this project. Sorry it's not better organized.
This was my project.
---------
http://www.enumera.com/chip/ [enumera.com]
http://www.enumera.com/doc/Enumeradraft061003.htm [enumera.com]
http://www.enumera.com/doc/analysis_of_Music_Copyright.html [enumera.com]
http://www.enumera.com/doc/emtalk.ppt [enumera.com]
--------
This was Jeff foxes independent web site, he work on the F21 with Chuck.
http://www.ultratechnology.com/ml0.htm [ultratechnology.com]
http://www.ultratechnology.com/f21.html#f21 [ultratechnology.com]
http://www.ultratechnology.com/store.htm#stamp [ultratechnology.com]
http://www.ultratechnology.com/cowboys.html#cm [ultratechnology.com]
------
http://www.colorforth.com/ [colorforth.com] 25x Multicomputer Chip
Chucks site. 25x has been pulled down, but it's accessible on archive.org.
http://web.archive.org/web/*/www.colorfo [archive.org]
Re:Languages (Score:5, Informative)
I think the wailing we're about to hear is the sound of thousands of imperative-language programmers being dragged, kicking and screaming, into functional programming land. Even the functional languages not specifically designed for concurrency do it much more naturally than their imperative counterparts.
Parent
Re:Languages (Score:4, Interesting)
Every object (in the general sense, not necessarily the OO sense) may be either aliased or mutable, but not both.
Erlang does this by making sure no objects are mutable. This route favours the compiler writer (since it's easy) and not the programmer. I am a huge fan of the CSP model for large projects, but I'd rather keep something closer to the OO model in the local scope and use something like CSP in the global scope (which is exactly what I am doing with my current research).
Parent
Re:Languages (Score:5, Informative)
What it *doesn't* do is make it easy to write verifiably immutable types, and code in a functional way where appropriate. As another respondent has mentioned, functional languages have great advantages when it comes to concurrency. However, I think the languages of the future will be a hybrid - making imperative-style code easy where that's appropriate, and functional-style code easy where that's appropriate.
C# 3 goes some of the way towards this, but leaves something to be desired when it comes to assistance with immutability. It also doesn't help that that
APIs are important too - the ParallelExtensions framework should help
I don't think C# 3 (or even 4) is going to be the last word in bringing understandable and reliable concurrency, but I think it points to a potential way forward.
The trouble is that concurrency is hard, unless you live in a completely side-effect free world. We can make it simpler to some extent by providing better primitives. We can encourage side-effect free programming in frameworks, and provide language smarts to help too. I'd be surprised if we ever manage to make it genuinely easy though.
Parent