SW Weenies: Ready for CMT? 378
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
Argh! (Score:3, Informative)
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, I meant per clock cycle, of course, not per second.
The thing is, all of the other CPU vendors with have
I meant "will have" not "with have".
Don't worry (Score:2, Informative)
BTW, your "average" user hasn't even played DOOM I, let alone DOOM III. Surfing the web and using e-mail doesn't usually put a lot of strain on a PC.
OLTP systems (Score:3, Informative)
Well, since I work in airline reservations systems, I'll add my $0.02 worth...
Most OLTP systems will benefit from CMT and multi-core processors. We had a test server from AMD about a month before the dual-core Opteron was announced, we did some initial testing and then put it in the production cluster and fired it up. No code changes, no recompile, no drama.
IMHO, the single-user applications, such as games and word processors, will be harder to parallelize.
Alan.
Re:Question: What needs multiple threads? (Score:2, Informative)
Actually there is a whole lot to games like DoomIII and HL2 than what can be run on the GPU. First of all, a lot of the graphics-related code is never run on the GPU, it's run on the CPU (for example shadow-processing code), which then passes on the info to the GPU to do the actual rendering.
Secondly multiple core GPUs doesn't make that much sense to me. The nature of graphics processing is completely SIMD (like much of your scientific code probably). Graphics needs parallelism, but it doesn't need different code being run in parallel. It needs so much parallelism because there are millions of vertices and pixel fragments, each of which needs to be handled very much the same way (that is, with the same shader code). The main reason SLI exists is that there is a limit to how much parallelism we can put on one chip because of transistor limits and all that mumbo jumbo. When there comes a point that we could put multiple cores on one GPU... then we might as well have one core with twice the number of pipelines.
Finally, games like D3 and HL2 do a lot more than just graphics and level-loading. Physics are getting more and more realistic and therefore computationally intensive (HL2 has particularly good physics). Also I think we're on the brink of game AI becoming much more advanced than the simple state machines present in current games. Then there are more eccentric tasks like UnrealEngine3's "Seamless World Support" which constantly shuffles in and out resources so you can create huge worlds without loading times.
Re:Steam Engine - Diesel (Score:4, Informative)
Note that I am talking about pure performance gain here, not percieved performance, such as keeping a GUI responsive during long actions - that kind of MT is generally slower than the single threaded alternative, and is fairly easy to keep correct.
Gaining performance via multithreading requires you to seperate out multiple calculations, with minimal dependencies between them. The number of applications that can benefit from this is much smaller than you might think. I doubt very much that we'll see very many applications get a boost from dual/many core processers, and it's not just a matter of "re-writing legacy apps". What we will see is over all system speed increases on multi-threaded OSes.
Re:well at least he seems to understand the proble (Score:2, Informative)
The author doesn't understand Java class locking (Score:3, Informative)
The standard APIs that came with the first few versions of Java were thread safe; some might say fanatically, obsessively, thread-safe. Stories abound of I/O calls that plunge down through six layers of stack, with each layer posting a mutex on the way; and venerable standbys like StringBuffer and Vector are mutexed-to-the-max. That means if your app is running on next year's hot chip with a couple of dozen threads, if you've got a routine that's doing a lot of string-appending or vector-loading, only one thread is gonna be in there at a time.
Classes such as StringBuffer and Vector are locked (synchronized) on a per-object basis. As long as you aren't trying to access the same object from different threads you won't block. And if you are trying to access the same object from different threads you will be happy that they were thread-safe!
The performance problems of having these classes being obsessive about thread safety do not result from the locking forcing singlethreadedness. The performance problem stem from the cost of locking objects.
Re:how much for the best of both worlds? (Score:5, Informative)
Simultaneous Multi-Threading is a CPU's ability to concurrently execute mixed instructions from multiple threads. Intel's HT simply 2-ways SMT.
Chip Multi-Threading is a CPU's ability to hold execution states for multiple threads, executing instructions from only one of them at a time unless the chip is also SMT.
In Sun's case, the mid-term plan is to eventually offer 8-ways SMT with 32-ways CMT: the CPU can hold states for up to 32 threads and have in-flight instructions from as many as eight of them.
Re:Steam Engine - Diesel (Score:1, Informative)
Who told you that? That's just plain wrong, and was obviously made up by someone who knows little railroad history.
It's was not uncommon at all to connect 2 or more steam engines together. In fact Union Pacific used to do it for nearly all of their freight trains, especailly the express produce delivery trains.
There are also rail lines in Africa, China and Russia where coal is plentyful and old is not, and which **still use steam engines** and often use multiple steam engines(heads).
Search google for "double headed steam" or "triple headed steam" for examples.
Re:I didn't see garbage collection in his list (Score:1, Informative)
Re:Anybody else remember Tera? (Score:2, Informative)
still, a terrible shame. a testament to the failings of the short-term investment model.
the compiler did automatic parallelization, but only really well for HPC-style loop nests. if you weren't running parallel code, you really suffered, because the individual thread execution rates were so poor, and they ran uncached (one of the nice things about the model is that they used concurrency to hide memory latency, but if you didn't have it to exploit...)
Re:Programming isn't up to it (Score:3, Informative)
On single processor systems, when it wants to switch between two threads, it usually executes a context switch - it needs to dump one set of registers to memory, load the other set from memory and change the instruction pointer.
That usually adds up to two seperate memory accesses to different parts of memory. What's more, is that it is not always possible to accurately predict (by the processor) which two sets of memory addresses will be involved - it all depends on what new thread has been choosen by the scheduler.
This wouldn't have been a huge problem on Intels architecture - given the relatively small number of registers it gives to applications.
In Sparc architectures - you have a lot more registers (and wider 64 bit registers) to worry about.
What Sun would like to do is remove this overhead by implementing a set of registers for each of processing unit. This makes them independant in their own right.
At the end of the day, the bottle neck in most systems is going to be the RAM-CPU bus, if this can reduce the number of hits that bus takes, then overall system performance will improve - by what margin is usually up to the system architects (i.e. why pick 8 cores instead of 12, why pick 3Mbit of cache instead of 2MBits, etc, etc)
Re:Question: What needs multiple threads? (Score:3, Informative)
It's all about the algorithms. Once multi-core chips have been mainstream for a while, all the algorithms out there will start to get converted to take advantage of parallel processing. And there are already algorithms out there that do this... this page [cmu.edu] has a small repository of parallel implementations of common algorithms including QuickSort, hashing techniques (for super fast searching), string operations (which every application in existence uses), and more.
Now I know this isn't always possible, but in many cases it is. Almost every program out there uses search and sort algorithms. Your address book does it, your web browser does it. These algorithms can be implemented to take advantage of having multiple processors.
A lot of operations can actually be modified to take advantage of this stuff. See the pbzip2 [compression.ca] project that achieves a near linear speed up per processor!
Almost every algorithm out there can be modified to take advantage of muliple cores. Things like video/audio decoding are prime candidates (a lot of research is currently happening in this area).
It may take a generation of programmers and then another generation or two of applications to start really taking advantage of parallelism, but mark my works: once this stuff is mainstream, you'll start to really see some performance like never before.
Why the future of SMT is bleak (Score:5, Informative)
Re:PHP and multi-threading (Score:2, Informative)
At the start of the page you collect user input and fire off the data access code for the original drop and the parameterized drop, each in a seperate thread.
This executes while you're performing other formatting actions, like include headers, menu formatting, and outputting strings to your response (like client scripts, etc).
All the while, the other threads are formatting the first & second dropdown with the returned data, while your main thread is doing more menial UI tasks like formatting the tables and such that hold your page.
This is simply a basic example, but anyone who uses data access code even for a single databound table or drop should always be running it in a seperate thread and letting the main thread handle the non-data related rendering. There is a TON of work on web pages that you can be doing that is not data related, nor is it required to be peformed before or after the data is available.
Re:Programming isn't up to it (Score:3, Informative)
Re:Programming isn't up to it (Score:3, Informative)
Now, if you're talking about debugging things when the compiler breaks, then yes, it will make debugging harder. On the other hand, the compiler already does some pretty heroic transformations during optimization, and things seem to work pretty smoothly. Certainly, the possibility of hard-to-debug errors caused by a broken optimizer doesn't seem to have prevented people from putting -O2 in their build scripts.
Re:Programming isn't up to it (Score:3, Informative)
It's the only Right Way (TM) to find a problems. Now, I don't recommend reasoning about the code to find typos, but then again, you shouldn't be making typos anyway.
In my experience, problems with multi-threaded code almost never come from a lack of understanding of how to write multi-threaded code.
Most people would disagree. Almost invariably, problems with multithreaded code are the fault of the programmer. Race conditions, deadlocks, etc, are all the result of the programmer not locking something he should, locking something he shouldn't, or locking things in the wrong order. You can throw scalability problems on top of there too, the programmer only locked one thing when he should have locked several. Just look through the changelogs of a highly-multithreaded system like the Linux/BSD kernels. Notice how often races and deadlocks get fixed relative to typos.
is only a good thing, but you don't need a new language for that, just appropriate attention to the debugger.
You do need a new language for that. Current languages have no formal model for concurrency, so anything the compiler offers you is an ad-hoc solution. Clever debuggers can offer hacks, but nothing complete and reliable. The only way for the compiler to understand concurrency systematically, like the programmer does, is to specify it systematically. To specify it systematically, you need a formal model of concurrency, and a language that allows you to specify concurrency in your application according to that model.
Let me use an analogy. Current languages treat concurrency the way assembler treats functions. You can do functions in assembler (even very complex ones with closures and everything), but it's all ad-hoc. The assembler doesn't really know anything about functions. It can't tell you if you put your arguments in an incorrect register or if you don't adjust the stack-pointer correctly. Now, an asm debugger can use hacks to try to divine the functions in an asm program, it can read the stack pointer and the base pointer and try to figure out what the functions are, but it'll never be as good as a C debugger that systematically understands what a function is and how it is used. Concurrent calculi do for concurrency what lambda calculus does for functions. They offers a formal way of understanding and specifying concurrency in a program.
Re:Schism Growing (Score:3, Informative)
Actually, a major point of OO execution is to hide small delays - with the window sizes of modern processors, you can easily hide L1 latencies and possibly hide the latencies of L2, and only pay severely for accesses to main memory. An in-order core is the one that really loses performance as L1 and L2 take a few extra cycles.
Also just because the processor is in order doesn't mean a memory/fp/int instruction can't all be run in parallel depending on how its designed (however they must be retired in order).
They're retired in order in out-of-order processors... they just execute out of order. In order to run n instructions in parallel in an in-order processor, you need n consecutive instructions in the program that all have their dependencies met at the start of the cycle. I'd doubt that happens often. Plus, you said yourself L1 is 4 cycles away... that means at every memory access you can't issue any new instructions for 4 cycles. Read some assembly code - memory instructions make up a HUGE portion of the instructions. In order execution is a big sacrifice, and you need to be able to find a lot more parallelism to make up for its loss.