Forgot your password?
typodupeerror
Software Sun Microsystems

SW Weenies: Ready for CMT? 378

Posted by Hemos
from the step-on-up dept.
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
This discussion has been archived. No new comments can be posted.

SW Weenies: Ready for CMT?

Comments Filter:
  • by Anonymous Coward on Monday June 13, 2005 @09:28AM (#12801936)
    from TFA:
    "Problem: Legacy Apps You'd be surprised how many cycles the world's Sun boxes spend running decades-old FORTRAN, COBOL, C, and C++ code in monster legacy apps that work just fine and aren't getting thrown away any time soon. There aren't enough people and time in the world to re-write these suckers, plus it took person-centuries in the first place to make them correct.

    Obviously it's not just Sun, I bet every kind of computer you can think of carries its share of this kind of good old code. I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff. So if we're going to get some real CMT juice out of these things, it's going to have to be done automatically down in the infrastructure. I'd think the legacy-language compiler teams have lots of opportunities for innovation in an area where you might not have expected it."
  • by dostert (761476) on Monday June 13, 2005 @09:34AM (#12801987)
    As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.

    I really only know the "scientific" programming languages, but most all math specific routines are already written for parallel machines. I'm a bit curious, what else really needs multiple threads? Isn't the benefit of dual-core procs the ability to not have a slow-down when you run two or three apps at a time? Don't games like DOOM III and Half-Life II depend mostly on the GPU (which I'm guessing they can handle multiple core GPU's since the programming should be fairly similar to SLI?)? What is the benefit in games? Just faster level loading times?

    I don't want to sound like I'm whining or anything here... I'm not saying that multiple cores suck. On the contrary they're fantastic for what I do, but I just was hoping you guys could help me understand how common apps and non-mathematical operations can use them.
  • by Toby The Economist (811138) on Monday June 13, 2005 @09:38AM (#12802026)
    32 threads in hardware on one chip is the same as 32 slow CPUs.

    Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code.

    Accordingly, multi-threading is currently handled by the programmer; which by and large doesn't happen, because programmers are not used to it.

    A lot of applications these days are weakly multi-threaded - Windows apps for example often have one thread for the GUI, another for their main processing work.

    This is *weak* multi-threading; because the main work done occurs within a single thread. Strong multi-threading is when the main work is somehow partioned so that it is processed by several threads. This is difficult, because a lot of tasks are inherently essentially serial; stage A must complete before stage B which must complete before stage C.

    The main technique I'm aware of for making good use of multi-threading support is that of worker-thread farms. A main thread receives requests for work and farms them out to worker threads. This approach is useful only for a certain subset of problem types, however, and within the processing of *each* worker thread, the work done itself remains essentially serial.

    In other words, clock speeds have hit the wall, transistor counts are still rising, the only way to improve performance is to have more CPUs/threads, but programming models don't yet know how to actually *use* multiple CPU/threads.

    El problemo!

    --
    Toby
  • Re:Schism Growing (Score:5, Interesting)

    by philipgar (595691) <pcg2&lehigh,edu> on Monday June 13, 2005 @09:42AM (#12802060) Homepage

    Actually from what I've heard, the entire industry is moving in this direction. The whole idea of out of order processors (OOP) has become outdated. OOP was great. Enabled massive single threaded performance, however the costs (in terms of area and heat dissipation) is enormous.

    I just came back from the DaMoN [cmu.edu] workshop where the keynote was delivered by one of the lead P4 developers. He explained the future of microprocessors and said that the 10-15% extra performance that OOP enables just isn't worth it. The Pentium 4 has 3 issue units, but the way things are rarely issues more than 1 instruction per cycle.

    We can squeeze more performance out of them, but not much. The easiest method is to go dual core. However if an application must be multithreaded to enable the best performance, what would you rather have . . . 2 highly advanced cores, or 8-10 simple cores that can issue half as many instructions per cycle as the dual core design. Than consider the fact that each core enables 4 threads to run (switch on cache miss/access). It doesn't take a rocket scientist to see that overall performance is improved with this.

    The other option is the hybrid core. A single really fast x86 core combined with multiple simpler x86 cores. That way single threaded apps can run fast (until they're converted) and you can get overall throughput from the system without blowing away your power budget on OOP optimizations.

    Granted most of this is in the future (within the next 5 years), but IBM's going that way (ala Cell), its within Intels roadmap, Sun is pushing that route etc. I assume AMD has plans to create a supercomputer on a chip . . . unless they wish to be obsoleted.

    Phil

  • I would be far more interested in taking advantage of all the CPU cycles that run all over at Businesses.

    Condor [wisc.edu].
  • Re:Schism Growing (Score:4, Interesting)

    by timford (828049) on Monday June 13, 2005 @09:52AM (#12802138)
    You're right that the latest generation console CPU architectures reflect the trend of concurrent thread execution. That said, however, there seems to be a parallel trend developing that involves separating the general purpose CPU into independent single-purpose processors.

    The most obvious example of this is the GPU, which has been around for a long time. The latest moves toward this trend rumored to be in development are PPUs, Physics Processing Units. How long until game AI evolves enough that we have the need for AIPUs also?

    This approach obviously doesn't make too much sense in a general purpose computer because the space of possible applications and types of code to be run are just too large. It makes perfect sense in computers that are built especially to run games though, because we have a very good idea of the different kinds of code most games will have to run. This approach allows each type of code to be run on a processor that is most efficient at that type of code, e.g. graphics code being run on processors that provide a ton of parallel pipelines.
  • Shame (Score:4, Interesting)

    by gr8_phk (621180) on Monday June 13, 2005 @09:53AM (#12802146)
    That's really a shame about the FP performance. My hobby project is ray tracing, and my code is just waiting to be run on parallel hardware. The prefered system would have multiple cores sharing cache, but seperate cache would be fine too. memory is not the bottleneck, so higher GHz and more cores/threads will be very welcome so long as they each have good performance. The code scales well with multiple CPUs as pixels can be rendered in parallel with zero effort - the code was designed for that. As it sits, I'm hoping my Shuttle (SN95G5v2) will support a AMD64x2 shortly. We're still not up for RT Quake, but interactive (read very jerky 1-2 fps) high-poly scenes are possible today.
  • by James McP (3700) on Monday June 13, 2005 @09:57AM (#12802170)
    The simplest example is OS runs on one, the game another. But it's really not that simple. Let's take a typical Windows box since it's the bulk of the market.

    Thread 1: OS kernel
    Thread 2: firewall
    Thread 3: GUI
    Thread 4: print server
    Thread 5-7: various services (update, power, etc)
    Thread 8: antivirus
    Thread 9: antivirus manager/keep-alive
    Thread 10-16: spyware (I said a typical Windows box)
    Thread 17+: applications

    Yeah, CMT will be handy out of box as long as the OS is aware. I expect it will be wasteful the first couple of iterations but I can't count the number of times I've had to disable antivirus and yank the ethernet while running computationally intense applications.
  • The bottlenecks (Score:4, Interesting)

    by davecb (6526) * <davec-b@rogers.com> on Monday June 13, 2005 @09:58AM (#12802184) Homepage Journal
    CMT is a good approach for dealing with the speed mismatch between CPUs and memory, our current Big Problem

    I'll misquote Fred Weigel and suggest that the next problem is branching: Samba code seems to generate 5 instructions between branches, so suspending the process and running something else intil the branch target is in I-cache seems like A Good Thing (;-)).

    Methinks Samba would really enjoy a CMT processor.

    --dave

  • by Apreche (239272) on Monday June 13, 2005 @10:06AM (#12802234) Homepage Journal
    Easy. In present days there are some assembly instructions that can be executed simultaneously. With a chip like this however, all bets would be off. Instead of just a meager few instructions that could be executed simultaneously you would be able to execute any number of instructions simultaneously.

    So if you have a function that say does 10 additions and 10 moves you would first figure out if any of them needed to be done before or after each other. Then see which ones don't matter. Then write the function to do as many at once as possible.

    It really doesn't matter for anyone other than the compiler writers. Those guys will write the compiler to do this kind of assembly level optimization for you. The trick is writing a high level language, or modifying an existing one, so the compiler can tell which things must be executed in order and which can be executed side by side.
  • by flaming-opus (8186) on Monday June 13, 2005 @10:17AM (#12802323)
    You are absolutely incorrect.
    multi-threaded programming is the predominant programming model on servers. Some tasks, such as web serving, mail serving, and to some degree data-base machines scale almost linearly with the number of processors. All of the first tier, and some of the second tier server manufacturers have been selling 32+-way SMP boxes for years. They work pretty damn well.

    Sun is not trying to create a chip to supplant pentiums in desktops. They are not going for the best Doom3 performance. They want to handle SQL transactions, and IMAP requests, and most likely are targetting this at JSP in a big way.

    As a user of a slightly aged sun SMP box, I'd rather have those many slow CPUs and the accompanying I/O capability, than a pair of cores that can spin like crazy waiting for memory.
  • Can use, not needs! (Score:2, Interesting)

    by try_anything (880404) on Monday June 13, 2005 @10:23AM (#12802379)
    If single-threaded performance improvements slow down, and the available computing power is spread out among multiple cores, anyone persisting in writing single-threaded code will fall behind in performance.

    Remember the old days when people used fancy tricks to implement naturally concurrent solutions as single-threaded programs? The future is going to be just the opposite. Any day now we'll see a rush toward langages with special support for quick, clear, safe parallelism, just like we've seen scripting languages catch on for web programming.
  • by Dark Fire (14267) <(moc.liamg) (ta) (cmsalc)> on Monday June 13, 2005 @10:36AM (#12802491)
    "Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code."

    I agree.

    However, I believe that Functional programming languages would seem to have the best chance of successfully taking advantage of multiple threads of execution. Google has 100,000+ computers doing this now using functional programming ideas.

    As pointed out in other posts, not every problem will benefit from parallelism. With research and time, this might change. Many problems can be represented in both procedural constructs and recursive constructs. The procedural has been considered the most comprehendable and implementable for the past three decades. This may have to change in light of the direction the hardware technology is going.
  • by Anonymous Coward on Monday June 13, 2005 @10:42AM (#12802530)
    I have never worked on an embedded product that was not implemented as a collection of threads. Setting priorities properly and dealing with issues of priority inversion and deadlock have been part and parcel for embedded systems for decades. A multi-thread core that allowed you to lock critical threads to a slice of the core would be a hoot if the silicon was affordable.
  • by The Mad Duke (222354) on Monday June 13, 2005 @10:42AM (#12802532)
    IBM started SHIPPING Power5 with SMT capablility August 31 of last year - IBM has SMT running on 1.9 GHz processors today. Sun is getting farther and farther behind.
  • Re:EPIC? (Score:3, Interesting)

    by HidingMyName (669183) on Monday June 13, 2005 @10:48AM (#12802588)
    That is hard to say. EPIC is a very long instruction word architecture (VLIW) which supports up to 3 concurrent non-interfering instructions which requires static (compile time) scheduling, since the instructions must be in contiguous memory. Getting efficient scheduling is hard, since the complexity is pushed back on the compiler, which may need to do some serious code reordering. Additionally, EPIC was designed to support speculative execution, which has efficiency issues if the wrong prediction is made. Additionally, EPIC had a new instruction set/core so Intel may not have gotten as much reuse of existing designs that multithreaded (using register bank switching) or multi core designs might have been able to exploit. Modern fabrication and design is so complex, that widely used designs get development resources and new interesting directions often don't get fabricated.
  • by alispguru (72689) <(bane) (at) (gst.com)> on Monday June 13, 2005 @11:16AM (#12802849) Journal
    Those of you who are up on the current state of the art here, please help me out. I was under the impression that multiple threads and automatic storage management were still not on good terms with each other, and that this was a big unsolved problem.
  • Re:Schism Growing (Score:5, Interesting)

    by swillden (191260) * <shawn-ds@willden.org> on Monday June 13, 2005 @11:29AM (#12802949) Homepage Journal

    Exploring parallelism is a hard issue for many problems. For instance, most of my time I'm compiling C++ code. Usually I just need to compile one file (the one I changed and want to test), and this is not a parallel process.

    You'll still benefit from parallelism in two ways. First, a modern computer is rarely doing just one thing. The OS has some threads managing I/O and performing housekeeping operations, and you're probably also listening to some music, and you probably have some other apps running that occasionally need a little computation. So none of that stuff will impede your compile.

    Second, even a compiler can benefit from multiple threads, though current compilers don't do it. There are multiple stages in compilation, like pre-processing, lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization and code generation. The stages don't need to wait until the previous stage has completed its work on the entire file, so the stages can be parallelized to a large extent. It might even make sense to have multiple threads working on different chunks of the code for more computation-intensive stages, like optimization (which becomes even more important without out-of-order execution).

    It seems to me that linking could also be done in parallel with computation, to some degree. To a very large degree if you can guarantee that you don't have any symbols that override library symbols (else a use of a symbol could be linked against a library definition of that symbol before the compiler got around to noticing that you'd defined another definition).

    Perhaps the biggest problem with parallelizing compilation and linking to that degree will be I/O. On second thought, probably not, because modern machines have huge amounts of RAM for caching disk files.

    In an 8+ core machine, it may make sense to dedicate a core to memory management, also. Even with manual memory management (malloc/free), allocating and releasing memory consumes significant CPU cycles, so I could see value in offloading that to another thread. A "free" operation, from a compute thread's point of view, would be nothing more than notifying the memory manager thread that this block is now available for re-use. The memory manager thread would then take care of all of the bookkeeping needed. The manager could also arrange to have a list of blocks of commonly-needed sizes ready for instant allocation, and could even spend some CPU cycles on analyzing the allocation patterns of the compute threads to try to ensure that blocks are always available when needed. Obviously, pushing that idea further leads naturally to full-blown garbage collection, with fewer concerns about GC pauses.

    Although it's true that not all computations can be sped up by multi-threading, lots of them can, including lots that we're used to thinking of as inherently serial processes.

  • Re:Schism Growing (Score:5, Interesting)

    by philipgar (595691) <pcg2&lehigh,edu> on Monday June 13, 2005 @01:09PM (#12803880) Homepage
    This is true. On a 500MHz machine OOP makes a huge difference. However when we move to a 4GHz machine that requires 400 cycles to access main memory, 25 cycles to access L2 cache and 4 cycles to access L1 cache, the difference between OOP and in-order starts to fall away. Even the best code on the best processors of today aren't getting a huge speedup from OOP. Also just because the processor is in order doesn't mean a memory/fp/int instruction can't all be run in parallel depending on how its designed (however they must be retired in order). The primary factor however is the memory hierarchy. If most applications are waiting on main memory or cache half of the time, even the most efficient processing can only speedup the processor by 50% (Amdahl's law). Phil

"But this one goes to eleven." -- Nigel Tufnel

Working...