Forgot your password?
typodupeerror
Software Sun Microsystems

SW Weenies: Ready for CMT? 378

Posted by Hemos
from the step-on-up dept.
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
This discussion has been archived. No new comments can be posted.

SW Weenies: Ready for CMT?

Comments Filter:
  • Schism Growing (Score:2, Insightful)

    by SirCyn (694031) on Monday June 13, 2005 @09:25AM (#12801909) Journal
    I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.

    The parallel are used for intense processing. Research, servers, clusters, databases; anything that can be divided into many little jobs and run in parallel.

    The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.
  • Niagara Myths (Score:5, Insightful)

    by turgid (580780) on Monday June 13, 2005 @09:27AM (#12801920) Journal
    I am totally not privy to clock-rate numbers, but I see that Paul Murphy is claiming over on ZDNet that it runs at 1.4GHz.
    Whatever the clock rate, multiply it by eight and it's pretty obvious that this puppy is going to be able to pump through a whole lot of instructions in aggregate.

    Ho hum.

    On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, provided it has 8 independent threads not blocking on I/O to execute.

    It only has one floating-point execution unit attached to one of those 8 cores, so if you have a thread that needs to do some FP, it has to make its way over to that core and then has to be scheduled to be executed, and then it can only do one floating-point instruction.

    Superb.

    The thing is, all of the other CPU vendors with have super-scalar, out-of-order 2- and 4- core 64- bit processors running at over twice to three times the clock frequency.

    You do the mathematics.

  • by kpp_kpp (695411) on Monday June 13, 2005 @09:27AM (#12801921)
    Some people have predicted this move for quite some time. I remember hearing about it back in the late 80's early 90's and I'm sure it goes way back before then. The analogy was to Steam Engines and why they lost out over Diesels. You can only make a Steam engine so big but you cannot connect them together to get more power. With diesels you can hook many of them together for more power. Chips are finally getting to the same point -- It is more cost efficient to chain them together than to create a monsterous one. I'm surprised it has take this long to get to this point.
  • EPIC? (Score:1, Insightful)

    by Anonymous Coward on Monday June 13, 2005 @09:27AM (#12801924)
    So does this mean that Intel's gamble with the Itanium was a good one? Or does this mean that we are going to try to teach students a totally new development style for more threads and parallelism?
  • Re:Schism Growing (Score:2, Insightful)

    by GoatMonkey2112 (875417) on Monday June 13, 2005 @09:30AM (#12801952)
    This will go away once there are games that take advantage of multiple processors. Eventually the game user will start to see the advantage of multiple processors. It's already starting to become clear when you look at the architectures of the next generation consoles.
  • by MemoryDragon (544441) on Monday June 13, 2005 @09:30AM (#12801954)
    given the fact, that I havent programmed a single threaded program in years.
  • by turgid (580780) on Monday June 13, 2005 @09:40AM (#12802040) Journal
    The problem has been the cost of software development. It's almost always cheaper to throw more hardware at a problem than invest in cleverer code. Highly parallel designs require very clever code. The Pentoum 4 debacle has finally shown that we're now at the stage where we're going to have to bite the bullet at develop that cleverer code. With ubiquitous high-level laguages running on virtual machines (e.g. Java) this is becoming more feasable since a lot of the gory details and dangers can be hidden from the average programmer.
  • Missing the point (Score:1, Insightful)

    by Anonymous Coward on Monday June 13, 2005 @09:43AM (#12802067)
    All of these recent articles about multi-cores, multiple pipelines of execution seem to miss the real value of theis technology; the provisioning of multiple Virtual Machines real-time on the same system. While most software will never use the multi-thread, multi-CPU capabilities of even the quad core AMD products like VMWare are now allowing you to dynamically provision systems on demand to deal with load. Another great use is for server consolidation; instead of 10 1U racks to handle web farming, try a 16 way box that can provide a single point of reliability, management and execution for those services. This is about horizontal scaling in a vertical fashion.
  • by Aspasia13 (700702) on Monday June 13, 2005 @09:43AM (#12802068)
    I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.

    Look out! It's Garth Vader!
  • First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.

    So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".

    In the C and C++ world things get extended but also fixed at the same time. We can still use the strncat function which has been around for a while EVEN IN threaded environments...

    Also, he totally fails to point out that extra threads [e.g. register sets] only pay off when the pipeline is empty. So it's a catch-22. You either have a very efficient pipeline that you can cram full of a single thread's instructions or you have a shoddy one where you're only hope is to mix in other threads.

    Think about it. If you only have one ALU and 32 threads that means each individual thread works at 1/32 the normal speed. Even if they're a lower/higher priority!

    That then gets into two camps. Are you threading because the performance of the pipeline sucks [e.g. dependencies in the P4] or because you want to interleave instructions [e.g. twice the clock rate but half the performance]. If it's the latter than even if you turn off 31 of 32 threads you still end up with one weak ALU.

    Consider the AMD64 for instance. It usually gets an IPC that is pretty high [usually in the 1.5-2.5 range] which means that it's retiring instructions from a single thread at pretty much the entire capacity of the chip. Adding extra threads doesn't help.

    Consider then the P4. It usually gets an IPC of 0.5 to 1 [for ALU code, which is observable by the fact it's about as fast as a half-clockrate Pentium-M]. This means it's two ALUs are not always busy and an additional thread could bump the IPC up to 1-1.5 range.

    I know [for instance] that with HT turned on my 3.2Ghz Prescott compiles LibTomCrypt in close to the same time as my 2.2Ghz AMD64 [the P4 takes 5 seconds longer, without HT it takes about 15 seconds longer].

    So the only saving grace is an efficient ALU so that you can run single tasks at least somewhat efficiently. Then tacking on the extra threads doesn't help as an efficient ALU won't have many bubbles where other threads could live.

    So you end up with essentially a hardware register file but still 1/2 the performance. Remember that the goal of multi-processing is closer to 'n' times faster with n processors.

    The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...

    Whoopy...

    Multi-threading is NOT the future. Multi-cell is. Where you have dedicated special purpose [re: space optimized] side-cores that do things like "I can do MULACC/load/store REALLY REALLY QUICK!!!".

    In other words, "yet another press release on /.".

    Tom
  • by Frit Mock (708952) on Monday June 13, 2005 @09:52AM (#12802141)

    In games the AI of non-player-characters (-objects) can profit a lot from threading.

    But for common apps ... I don't expect a big gain from multiple threads. I guess typical apps like browsers, word-processor and so one have a hard time utilizing more than 3-4 threads for the most common operations a user does.

  • by TheKidWho (705796) on Monday June 13, 2005 @09:52AM (#12802144)
    umm, better physics and AI for games is what I can think of off the thop of my head =)
  • by spotvt01 (626574) on Monday June 13, 2005 @09:52AM (#12802145)
    It's all about the scalability in processor architecture. And unfortunately, your analogy about diesel engines only goes so far. You can only chain so many pistons together before you have to worry about how effecient you can transfer the energy to the drive train. There is an upperbound of effectiveness. Concentrating on the number of pistons and ignoring each pistons' capabilites will leave you with a lot of hourse power but little torque. The same problem exists in multiple core designs, namely: only so many things can be done in parallel. This is because most programs are sequential in nature and benefit very little from executing their code in parallel. And eventually, you'l get down to something sequential like the bus or acess to memory or paging to the hard diskwhich is where the real bottle neck is anyway). About the only thing this will help with is if you're doing some sort of mathmatical computing (using MPI or somethigg like that as was previously mentioned) or you're playing Doom3 while you're your rendering the spcial effects for Star Wars III. In which case you need to get out more ;)
  • Every time someone exposes concurrency at some layer as a way of improving performance, rather than because you're implementing a process that's inherently concurrent, it's a huge clusterfuck. Doesn't matter whether it's asynchronous I/O, out-of-order execution, multithreaded code, or whatever. Even when you're dealing with a concurrent environment like a graphical user interface the most successful approaches involve breaking the problem down into chunks small enough you can ignore concurrency.

    One of UNIX's most important features is the pipe-and-filter model, and one of the really great things about it is that it lets you build scripts that can automatically take advantage of coarse-grained concurrency. Even on a single-CPU system, a pipeline lets you stream computation and I/O where otherwise you'd be running in lockstep alternating I/O and code.

    That's where the big breakthroughs are needed: mechanisms to let you hide concurrency in a lower layer. Pipelines are great for coarse-grained parallelism, for example, but the kind of fine grain you need for Niagara demands a better design, or the parallelism needs to be shoved down to a deeper level. Intel's IA64 is kind of a lower level approach to the same thing where the compiler and CPU are supposed to find parallelism that the programmer doesn't explicitly specify, but it suffers from the typical Intel kitchen-sink approach to instruction set design.
  • Re:Niagara Myths (Score:4, Insightful)

    by Shalda (560388) on Monday June 13, 2005 @09:58AM (#12802185) Homepage Journal
    Well, as you might expect, Sun has only a server mentality. The typical server runs few floating point instructions. In a lot of ways, Niagara would be very good at crunching through a database or serving up web pages. On the other hand, such a processor would be worthless on a desktop or a research cluster. I'd like to see actual real-world performance on these processors. I'd also like to see what Oracle charges them for a license. :)
  • dead end (Score:3, Insightful)

    by cahiha (873942) on Monday June 13, 2005 @10:02AM (#12802205)
    Threads are actually one of the simplest form of parallelism to deal with and we have had decades of experience with them. That's why Sun loves them: it fits in well with their big-iron philosophy and hardware and makes it easy for their customers to migrate to the next generation.

    But the future of high-end computing, both in business and in science, will not look like that. Networks of cheap computing nodes scale better and more cost-effectively. Many manufacturers have already gone over to that for their high-end designs. That's where the real software challenges are, but they are being addressed.

    Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.
  • by Anonymous Coward on Monday June 13, 2005 @10:27AM (#12802410)
    and other such languages will become more popular as this new multithreaded world takes hold because they embed the multithreaded concepts into the language without explicit programmer interaction. C, C++, Java style threading and mutex constructs are error-prone and awkward to use.
  • Re:Niagara Myths (Score:1, Insightful)

    by Anonymous Coward on Monday June 13, 2005 @10:47AM (#12802578)
    Sun has stated that the Niagra CMT chips are aimed at web servers and such that do not need a lot of FP. Follow on chips, late next year I believe, will have the FP stuff.
  • Re:Shame (Score:4, Insightful)

    by Knetzar (698216) on Monday June 13, 2005 @11:03AM (#12802744)
    It sounds like you want a cell.
  • by putaro (235078) on Monday June 13, 2005 @11:05AM (#12802762) Journal
    and take some advance architecture courses.

    The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...

    I'm sorry but that turns out not to be the case.

    When you have a system that is running lots of different threads simultaneously the amount of time that it takes to do a context switch from one thread to another becomes an issue. In the real world, threads often do things like I/O which cause them to block or they wait on a lock. If you can do a fast context switch you get back the time that you would have wasted saving registers off to RAM and pulling back another set. Faster thread switching means that your multi-thread single core now runs its total load (all of the threads) faster than a single core single thread design. Also, things like microkernels become a lot more feasible (microkernels are notorious for being slow because context switches are slow).

    When you have looked beyond your desktop machine maybe you'll have earned the right to sneer at your professors. I don't think you're there yet.
  • by Anonymous Coward on Monday June 13, 2005 @12:08PM (#12803331)
    Sparc playing catch-up? It's x86 that's playing catch-up to the proprietary RISC vendors. UltraSparc IV processors have multiple cores like the new AMD and Pentiums for the past year or two. POWER4 from IBM started shipping with four cores when it came out several years ago. HP's PA-RISC has been dual-cored for a while. I think POWER4 has SMT, and I know POWER5 does. Even before HP and Compaq merged, the next Alpha chip, the EV8 was going to have some impressive SMT, also.

    The only way that x86 is ahead is clockspeed, due to aggressive production technology.

    How can a true Slashdot geek not be looking forward to this? It's something new and different. I'll never own one and possibly never work with one, but I'm curious to see exactly how such a design performs, because it's a lot different from a single 3.6 ghz Pentium 4. Don't you want to at least see how it does before dismissing it? Unless you have stock in Sun or a bizaare emotional investment in processors, what's the harm in Sun spending their money on this product?
  • by TopSpin (753) * on Monday June 13, 2005 @04:34PM (#12806162) Journal
    I doubt very much that we'll see very many applications get a boost from dual/many core processers, and it's not just a matter of "re-writing legacy apps".

    I think this is a foolish thing to doubt. As supercomputing evolved into parallelism the same thing was said; it's too hard, some things can't be done in parallel. Yet solutions have been found for most cases and there is no lack of desire for more parallel capacity today.

    Put enough cores in front of a twenty something Carmack wannabe and he'll figure out how to parallelize so many spinning triangles we'll all be breathlessly waiting to pay for even more cores. Put eight cores in the hands of a video encoding programmer and he'll refractor, tune and rethink the whole process until those cores stay 99% full and he invents an entirely new paradigm for the practice.

    Perhaps there is something deeper here; isn't the universe fundamentally parallel? So it isn't possible to parallelize the calculation of the next digit of pi; the universe has a way of ultimately requiring you to perform the damn calculation with thousands of pi simultaneously. Determinism gets lost somewhere and parallel computation becomes viable.
  • by Dark Fire (14267) <clasmc@NoSPaM.gmail.com> on Monday June 13, 2005 @05:59PM (#12807009)
    From the parent post:

    "Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code."

    The portion of importance is:

    "insufficiently descriptive"

    In C, C++, and Java, you must program with concurrency in mind to obtain any benefit from multiple threads of execution. In a functional programming language, the restrictions placed on the behavior of functions often imply concurrency without the programmer necessarily intending that as the result. If you write a C program without concurrency in mind and want to adapt your solution later to take advantage of multiple threads, you may need to code a completely different solution and also locate a compiler that knows how to take advantage of concurrency. In a functional language, you may only need to get an updated version of your compiler/interpreter. This is why C, C++, and Java are in the "insufficiently descriptive" category and functional programming languages are not.
  • by CTho9305 (264265) on Monday June 13, 2005 @08:39PM (#12808424) Homepage
    The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ [washington.edu] - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.

    I didn't read most of the princeton paper... but you're arguing that caches need to be big to get any gains, and that Intel's HT chips show SMT doesn't offer anything. The Intel chips have ridiculously small L1 caches - only 8KB. A quick sampling of washington papers shows they simulate machines with 64-128KB L1 caches, which are entirely reasonable - all AMD processors since the Athlon have had 64KB L1 caches. Both companies are increasing L2 sizes, and 1-2MB is not unreasonable either.

    I don't know anything about OS overhead, but section 2.3 of the princeton paper argues SMP kernels (which SMT requires) are slower, and thus you pay for extra overhead when using SMT vs a non-multithreaded single processor. However, they themselves don't make the same claim for multiprocessors (because you have to pay the OS overhead anyway), and with the introduction of dual core processors at the consumer level, everybody will soon be using the SMP kernels anyway. This point is [rapidly becoming, if not already] moot.

    Their analysis in section 3.3 implied that the memory subsystem becomes the bottleneck in multiprocessor systems with SMT enabled, but before you take that and agrue SMT offers nothing, I again point out problems with the Intel implementation: their memory bus is shared among all CPUs, so the per-CPU bandwidth drops with an increase in CPUs, and per-thread bandwidth is half again. AMD's Opterons don't suffer from this same problem due to their NUMA configuration, so a 2-CPU 2-thread SMT with an Opteron-like memory system would get the same per-thread memory bandwidth as a 2-CPU non-SMT Xeon system, while supporting twice as many threads.

FORTRAN is for pipe stress freaks and crystallography weenies.

Working...