Forgot your password?
Software Sun Microsystems

SW Weenies: Ready for CMT? 378

Posted by Hemos
from the step-on-up dept.
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
This discussion has been archived. No new comments can be posted.

SW Weenies: Ready for CMT?

Comments Filter:
  • Argh! (Score:3, Informative)

    by turgid (580780) on Monday June 13, 2005 @09:29AM (#12801942) Journal
    Today I have diarhea in the guts as well as the mind. I should have previewed that before I posted it.

    On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, I meant per clock cycle, of course, not per second.

    The thing is, all of the other CPU vendors with have

    I meant "will have" not "with have".

    /me LARTS himself with a big stick.

  • Don't worry (Score:2, Informative)

    by StupidKatz (467476) on Monday June 13, 2005 @09:42AM (#12802062)
    You can have your parallel processors and still play DOOM III at insane fps. At worst, it will just take a bit for folks to start writing programs to take advantage of the additional processors/cores.

    BTW, your "average" user hasn't even played DOOM I, let alone DOOM III. Surfing the web and using e-mail doesn't usually put a lot of strain on a PC.
  • OLTP systems (Score:3, Informative)

    by bunyip (17018) on Monday June 13, 2005 @09:43AM (#12802070)
    Now of course, the room was full of Sun infrastructure weenies, so if there's something terribly obvious in records management or airline reservations or payroll processing that doesn't parallelize, we might not know about it.

    Well, since I work in airline reservations systems, I'll add my $0.02 worth...

    Most OLTP systems will benefit from CMT and multi-core processors. We had a test server from AMD about a month before the dual-core Opteron was announced, we did some initial testing and then put it in the production cluster and fired it up. No code changes, no recompile, no drama.

    IMHO, the single-user applications, such as games and word processors, will be harder to parallelize.

  • by timford (828049) on Monday June 13, 2005 @10:28AM (#12802418)
    You high-and-mighty scientific code snobs looking down on us game programmers! =)

    Actually there is a whole lot to games like DoomIII and HL2 than what can be run on the GPU. First of all, a lot of the graphics-related code is never run on the GPU, it's run on the CPU (for example shadow-processing code), which then passes on the info to the GPU to do the actual rendering.

    Secondly multiple core GPUs doesn't make that much sense to me. The nature of graphics processing is completely SIMD (like much of your scientific code probably). Graphics needs parallelism, but it doesn't need different code being run in parallel. It needs so much parallelism because there are millions of vertices and pixel fragments, each of which needs to be handled very much the same way (that is, with the same shader code). The main reason SLI exists is that there is a limit to how much parallelism we can put on one chip because of transistor limits and all that mumbo jumbo. When there comes a point that we could put multiple cores on one GPU... then we might as well have one core with twice the number of pipelines.

    Finally, games like D3 and HL2 do a lot more than just graphics and level-loading. Physics are getting more and more realistic and therefore computationally intensive (HL2 has particularly good physics). Also I think we're on the brink of game AI becoming much more advanced than the simple state machines present in current games. Then there are more eccentric tasks like UnrealEngine3's "Seamless World Support" which constantly shuffles in and out resources so you can create huge worlds without loading times.

  • by arkanes (521690) <[moc.liamg] [ta] [senakra]> on Monday June 13, 2005 @10:28AM (#12802423) Homepage
    You cannot hide the gory details and also thread for (pure) performance, at least not to any signifigant degree, and not with our current ability to analyze programs. Some current compilers/languages can squeeze out some parallelism via analysis, but to prevent bugs they must be conservative, so you rarely get signifigant performance boosts. The key to parallelizing performance is minimizing information sharing, and thats a design/archiectural issue that can't really be addressed automatically. It's not simply a matter of higher level languages or cleverer code - the inherent complexities and dangers of multi-threaded programming are quite large, to the point where it's almost impossible to prove the correctness of any signifigantly multithreaded application while still gaining a performance boost.

    Note that I am talking about pure performance gain here, not percieved performance, such as keeping a GUI responsive during long actions - that kind of MT is generally slower than the single threaded alternative, and is fairly easy to keep correct.

    Gaining performance via multithreading requires you to seperate out multiple calculations, with minimal dependencies between them. The number of applications that can benefit from this is much smaller than you might think. I doubt very much that we'll see very many applications get a boost from dual/many core processers, and it's not just a matter of "re-writing legacy apps". What we will see is over all system speed increases on multi-threaded OSes.

  • by babble123 (863258) on Monday June 13, 2005 @10:50AM (#12802609)
    Can I use OpenMP []? I

    void AccumulateLoopCount(int N) {
    int accumulator = 0;
    #pragma openmp parallel for reduction(+:accumulator)
    for (int i = 1; i < N; ++i) {
    accumulator += i;
    return accumulator;
    (I'm not actually an OpenMP programmer, so this syntax might be wrong...)
  • by putaro (235078) on Monday June 13, 2005 @10:51AM (#12802624) Journal
    From the article:

    The standard APIs that came with the first few versions of Java were thread safe; some might say fanatically, obsessively, thread-safe. Stories abound of I/O calls that plunge down through six layers of stack, with each layer posting a mutex on the way; and venerable standbys like StringBuffer and Vector are mutexed-to-the-max. That means if your app is running on next year's hot chip with a couple of dozen threads, if you've got a routine that's doing a lot of string-appending or vector-loading, only one thread is gonna be in there at a time.

    Classes such as StringBuffer and Vector are locked (synchronized) on a per-object basis. As long as you aren't trying to access the same object from different threads you won't block. And if you are trying to access the same object from different threads you will be happy that they were thread-safe!

    The performance problems of having these classes being obsessive about thread safety do not result from the locking forcing singlethreadedness. The performance problem stem from the cost of locking objects.
  • by InvalidError (771317) on Monday June 13, 2005 @10:51AM (#12802633)
    Hardware threading has been mainstream for more than two years in the form of HyperThreading.

    Simultaneous Multi-Threading is a CPU's ability to concurrently execute mixed instructions from multiple threads. Intel's HT simply 2-ways SMT.

    Chip Multi-Threading is a CPU's ability to hold execution states for multiple threads, executing instructions from only one of them at a time unless the chip is also SMT.

    In Sun's case, the mid-term plan is to eventually offer 8-ways SMT with 32-ways CMT: the CPU can hold states for up to 32 threads and have in-flight instructions from as many as eight of them.
  • by furry_wookie (8361) on Monday June 13, 2005 @11:44AM (#12803106)
    >You can only make a Steam engine so big but you cannot connect them together to get more power.

    Who told you that? That's just plain wrong, and was obviously made up by someone who knows little railroad history.

    It's was not uncommon at all to connect 2 or more steam engines together. In fact Union Pacific used to do it for nearly all of their freight trains, especailly the express produce delivery trains.

    There are also rail lines in Africa, China and Russia where coal is plentyful and old is not, and which **still use steam engines** and often use multiple steam engines(heads).

    Search google for "double headed steam" or "triple headed steam" for examples.

  • by Anonymous Coward on Monday June 13, 2005 @12:04PM (#12803290)
    Sure they are. Just don't use Boehm style GC which usually requires a "stop the world" to perform GC. See my project atomic-ptr-plus [] for various forms of SMP/CMT friendly GC. I'm currently sporadically working on a RCU-SMR [] hybrid that obsoletes everything there. It would be less sporadic but I don't have as good funding as Sun, Intel, IBM et all have.
  • by convolvatron (176505) on Monday June 13, 2005 @12:20PM (#12803450)
    the technology and architecture were beautiful. the execution and business planning were poor. because it was such a huge and underfunded effort to get the whole thing (os, compiler, processor, network) brought up from scratch, they lagged current technology at both of the two introductions. stability was a problem.

    still, a terrible shame. a testament to the failings of the short-term investment model.

    the compiler did automatic parallelization, but only really well for HPC-style loop nests. if you weren't running parallel code, you really suffered, because the individual thread execution rates were so poor, and they ran uncached (one of the nice things about the model is that they used concurrency to hide memory latency, but if you didn't have it to exploit...)
  • by johnhennessy (94737) on Monday June 13, 2005 @12:39PM (#12803612)
    Could be wrong here, but I thought that the main reason for implementing a CMT chip with "hardware threads" was to make the context switch less painful.

    On single processor systems, when it wants to switch between two threads, it usually executes a context switch - it needs to dump one set of registers to memory, load the other set from memory and change the instruction pointer.

    That usually adds up to two seperate memory accesses to different parts of memory. What's more, is that it is not always possible to accurately predict (by the processor) which two sets of memory addresses will be involved - it all depends on what new thread has been choosen by the scheduler.

    This wouldn't have been a huge problem on Intels architecture - given the relatively small number of registers it gives to applications.

    In Sparc architectures - you have a lot more registers (and wider 64 bit registers) to worry about.

    What Sun would like to do is remove this overhead by implementing a set of registers for each of processing unit. This makes them independant in their own right.

    At the end of the day, the bottle neck in most systems is going to be the RAM-CPU bus, if this can reduce the number of hits that bus takes, then overall system performance will improve - by what margin is usually up to the system architects (i.e. why pick 8 cores instead of 12, why pick 3Mbit of cache instead of 2MBits, etc, etc)
  • by flithm (756019) on Monday June 13, 2005 @12:44PM (#12803659) Homepage
    Actually that's not necessarily true. It's definitely true right now though. Most developers haven't really been tought to think in terms of parallelism when designing software, but that's starting to change.

    It's all about the algorithms. Once multi-core chips have been mainstream for a while, all the algorithms out there will start to get converted to take advantage of parallel processing. And there are already algorithms out there that do this... this page [] has a small repository of parallel implementations of common algorithms including QuickSort, hashing techniques (for super fast searching), string operations (which every application in existence uses), and more.

    Now I know this isn't always possible, but in many cases it is. Almost every program out there uses search and sort algorithms. Your address book does it, your web browser does it. These algorithms can be implemented to take advantage of having multiple processors.

    A lot of operations can actually be modified to take advantage of this stuff. See the pbzip2 [] project that achieves a near linear speed up per processor!

    Almost every algorithm out there can be modified to take advantage of muliple cores. Things like video/audio decoding are prime candidates (a lot of research is currently happening in this area).

    It may take a generation of programmers and then another generation or two of applications to start really taking advantage of parallelism, but mark my works: once this stuff is mainstream, you'll start to really see some performance like never before.
  • by spockvariant (881611) on Monday June 13, 2005 @12:54PM (#12803750)
    I'm a researcher working on high performance computing and have used various configurations of Simultaneous Multithreading (aka Hyperthreading aka CMT) (Intel Xeon, IBM POWER5). The result is always the same - at the end, memory latencies and OS overheads kill most of the gains of instruction level parallelism coming from SMT. Look at it this way - the typical latencies of operations on most modern processors are of the order of 1 nanosecond, whereas DRAM latencies are of the order of 200ns. As long as you can't do anything about this latency, there's no point in cutting down on processing times. There's a very nice paper in this year's ACM SIGMETRICS that gives real experimental data to illustrate this fact - [] The paper shows that the speedups obtained using SMT in practice are meagre. The reason that the simulation results coming from the original UWashington research on the subject - [] - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.
  • by cosinezero (833532) on Monday June 13, 2005 @01:33PM (#12804132)
    It's very simple, actually, I do it quite frequently. Let's say you need to populate a drop-down based on user input in another drop-down.

    At the start of the page you collect user input and fire off the data access code for the original drop and the parameterized drop, each in a seperate thread.

    This executes while you're performing other formatting actions, like include headers, menu formatting, and outputting strings to your response (like client scripts, etc).

    All the while, the other threads are formatting the first & second dropdown with the returned data, while your main thread is doing more menial UI tasks like formatting the tables and such that hold your page.

    This is simply a basic example, but anyone who uses data access code even for a single databound table or drop should always be running it in a seperate thread and letting the main thread handle the non-data related rendering. There is a TON of work on web pages that you can be doing that is not data related, nor is it required to be peformed before or after the data is available.
  • by be-fan (61476) on Monday June 13, 2005 @02:02PM (#12804433)
    Multithreading is dominant because it's the only way to wring parallelism out of legacy languages like C. And nobody claims multithreading is easy, natural, or anything but error-prone. The future is really in languages that have formal abstractions for concurrency, so programmers can specify at a high level what tasks can be concurrent and let the compiler do the low-level locking. Basically, you want languages based on a concurrent calculus of computation (eg: Pi-calculus), instead of languages based on lambda calculus, which lacks an formal notion of concurrency.
  • by be-fan (61476) on Monday June 13, 2005 @05:01PM (#12806456)
    Eh? Locks suck for debugging, nobody in their right mind likes debugging multithreaded code. If anything, this will make it easier to debug parallel code. Once you have a formal model for expressing concurrency, it makes it much easier to reason about the code and figure out where something went wrong. Further, since the compiler can understand a formal concurrency model in a way it cannot understand an ad-hoc concurrency model, the compiler can offer tools to aid in debugging concurrent applications.

    Now, if you're talking about debugging things when the compiler breaks, then yes, it will make debugging harder. On the other hand, the compiler already does some pretty heroic transformations during optimization, and things seem to work pretty smoothly. Certainly, the possibility of hard-to-debug errors caused by a broken optimizer doesn't seem to have prevented people from putting -O2 in their build scripts.
  • by be-fan (61476) on Monday June 13, 2005 @05:44PM (#12806891)
    The slowest possible way to find a problem is to "reason about the code"!

    It's the only Right Way (TM) to find a problems. Now, I don't recommend reasoning about the code to find typos, but then again, you shouldn't be making typos anyway.

    In my experience, problems with multi-threaded code almost never come from a lack of understanding of how to write multi-threaded code.

    Most people would disagree. Almost invariably, problems with multithreaded code are the fault of the programmer. Race conditions, deadlocks, etc, are all the result of the programmer not locking something he should, locking something he shouldn't, or locking things in the wrong order. You can throw scalability problems on top of there too, the programmer only locked one thing when he should have locked several. Just look through the changelogs of a highly-multithreaded system like the Linux/BSD kernels. Notice how often races and deadlocks get fixed relative to typos.

    is only a good thing, but you don't need a new language for that, just appropriate attention to the debugger.

    You do need a new language for that. Current languages have no formal model for concurrency, so anything the compiler offers you is an ad-hoc solution. Clever debuggers can offer hacks, but nothing complete and reliable. The only way for the compiler to understand concurrency systematically, like the programmer does, is to specify it systematically. To specify it systematically, you need a formal model of concurrency, and a language that allows you to specify concurrency in your application according to that model.

    Let me use an analogy. Current languages treat concurrency the way assembler treats functions. You can do functions in assembler (even very complex ones with closures and everything), but it's all ad-hoc. The assembler doesn't really know anything about functions. It can't tell you if you put your arguments in an incorrect register or if you don't adjust the stack-pointer correctly. Now, an asm debugger can use hacks to try to divine the functions in an asm program, it can read the stack pointer and the base pointer and try to figure out what the functions are, but it'll never be as good as a C debugger that systematically understands what a function is and how it is used. Concurrent calculi do for concurrency what lambda calculus does for functions. They offers a formal way of understanding and specifying concurrency in a program.
  • Re:Schism Growing (Score:3, Informative)

    by CTho9305 (264265) on Monday June 13, 2005 @08:48PM (#12808496) Homepage
    However when we move to a 4GHz machine that requires 400 cycles to access main memory, 25 cycles to access L2 cache and 4 cycles to access L1 cache, the difference between OOP and in-order starts to fall away.
    Actually, a major point of OO execution is to hide small delays - with the window sizes of modern processors, you can easily hide L1 latencies and possibly hide the latencies of L2, and only pay severely for accesses to main memory. An in-order core is the one that really loses performance as L1 and L2 take a few extra cycles.

    Also just because the processor is in order doesn't mean a memory/fp/int instruction can't all be run in parallel depending on how its designed (however they must be retired in order).
    They're retired in order in out-of-order processors... they just execute out of order. In order to run n instructions in parallel in an in-order processor, you need n consecutive instructions in the program that all have their dependencies met at the start of the cycle. I'd doubt that happens often. Plus, you said yourself L1 is 4 cycles away... that means at every memory access you can't issue any new instructions for 4 cycles. Read some assembly code - memory instructions make up a HUGE portion of the instructions. In order execution is a big sacrifice, and you need to be able to find a lot more parallelism to make up for its loss.

"Stupidity, like virtue, is its own reward" -- William E. Davidsen