Faster Chips Are Leaving Programmers in Their Dust 573
mlimber writes "The New York Times is running a story about multicore computing and the efforts of Microsoft et al. to try to switch to the new paradigm: "The challenges [of parallel programming] have not dented the enthusiasm for the potential of the new parallel chips at Microsoft, where executives are betting that the arrival of manycore chips — processors with more than eight cores, possible as soon as 2010 — will transform the world of personal computing.... Engineers and computer scientists acknowledge that despite advances in recent decades, the computer industry is still lagging in its ability to write parallel programs." It mirrors what C++ guru and now Microsoft architect Herb Sutter has been saying in articles such as his "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Sutter is part of the C++ standards committee that is working hard to make multithreading standard in C++."
The basic problem (Score:5, Insightful)
So far, multiple cores have boosted performance mostly because the typical user has multiple applications running at a time. But as the number of cores increases, the beneficial effects diminish dramatically.
In addition, most applications these days are not CPU bound. Having eight cores doesn't help you much when three are waiting on socket calls, four are waiting on disk access calls and the last is waiting for the graphics card.
Re:2005 Called (Score:5, Insightful)
Re:OS/2? (Score:4, Insightful)
Re:concurrency - the developer's responsibility? (Score:3, Insightful)
Huhhh?
My guess is that you never wrote any code.
Linux doesn't do any more heavy lifting for you than Windows does. I doubt that OS/X does.
So what are you talking about.
An OS will never figure out what part of your program is going to need to be in which thread. A compiler MAY at some time do it but they are just now doing a good job with vectors.
Re:2005 Called (Score:1, Insightful)
Re:2005 Called (Score:3, Insightful)
And so it goes...... (Score:5, Insightful)
Translation:
Code will get even more inefficient / bloated and require faster hardware to do the same thing you are doing now. While I'm all for better / faster computer hardware, most if not all Jane and Joe Sixpack users never need Super Computer power to surf the net, read e-mail and watch videos.
Re:Oh, wow (Score:5, Insightful)
Actually, according to the latest Dr Dobbs, Herb is the *chair* of the ISO C++ Standards committee. (He had an article on lock hierarchies being used to avoid deadlock)
He's really going to know what he's talking about, then.
As chair of the committee, I'd say there's a pretty fair chance that he *does*.
I really love people who bash things just because Microsoft is involved. Contrary to what seems to be a popular belief here, they have some incredibly intelligent people who are very good at what they do there.
Re:Personal computing? (Score:3, Insightful)
AOL 10.0 will say "You got mail!"
It's the OS Stupid !!! (Score:2, Insightful)
Just because the latest and greatest release of a New OS by a certain vendor is dog slow doesn't mean it's time to start blaming Programmers and calling them LAME.
There are several good Operating Systems out there that handle multiple threads on multi core machines just fine. They even do this in there basic scripting languages native to those Operating Systems and many have been doing them since the 70's.
There are techniques out there that handle work just fine in a Parallel Program/Core Environments. On a side note, Data Encapsulated Object Oriented techniques are not always the best way handle performance issues. A look back in time has the several answers to this question and more. (Less We Forget)
--- Old engineers never die, they just build away. (By deweycheetham) ---
There's not much hope for the C++ committee (Score:4, Insightful)
I have little hope for the C++ standards committee. It's dominated by people who think really l33t templates are really cool. Everything has to be a template feature. They're fooling around with a proposal for declaring variables atomic through something like atomic<int> n; This allows really l33t programmers to write really l33t code using really l33t lockless programming. But without the proofs of correctness needed to make that actually work reliably.
It's also long been Strostrup's position that concurrency is a library problem. As long as the OS provides threads and locking, it's not a language problem. This isn't good enough.
The fundamental problem is that, as currently defined, a C++ compiler has no idea which variables are shared between threads, and which are never shared. The compiler has no notion of critical sections. Fixing this requires some fundamental changes to the language. It's known what to do; Modula, Ada, and Java all have synchronization and isolation built into the language. But there's nothing like that in C++, and the designers of C++ don't want to admit their mistakes.
It's not just a C++ problem. Python has a similar issue. Python as a language doesn't deal with concurrency adequately. The main implementation, CPython, has a "global interpreter lock" that slows the thing down to single-CPU speed.
Re:And so it goes...... (Score:1, Insightful)
Re:Personal computing? (Score:3, Insightful)
There are plenty of tasks that people do routinely on computers that are not "instantaneously" fast (spreadsheets, photo-editing, etc.). Furthermore there are many aspects of modern user interfaces that would be better if they were faster (generating thumbnail previews, sorting entries, rescanning music collections, searching, etc.). Also, it's important to realize that the commonplace desktop elements of tomorrow may not have been imagined today. Many things that we don't even consider (and certainly don't consider as "necessary") may become possible (and thus "necessary") with greater computer power (complex graphs/images/previews that update in realtime as a user slides a control, instantaneous re-encoding of video when you drag-and-drop to an external device, etc.).
My only point is that it is tempting to say that computers are "fast enough" and yet in my own computer-use (and watching the computer use of others) there are definitely times when the user must wait for the computer to finish a task (whether it is a split-second page render or a many-seconds refresh of a spreadsheet or a many-minute generation of a complex image). Until all of these tasks are "instantaneous" (shorter than human reaction time), then there is definite room for improvement in computer speed; and moreover improvements that the end-user will appreciate and come to rely on.
You'll notice that of the examples I've mentioned, many of them could in principle be parallelized (and thus benefit from multi-core systems).
Imagine that your core overheats while idling (Score:1, Insightful)
Re:Wow, this is a great idea! (Score:3, Insightful)
Thank God for that.
I'm glad that coders today can use high-level tools and languages without having to spend half their time on performance tweaking.
Take as an example a game like Halo (or Guitar Hero, or World of Warcraft, or whatever your favorite modern game is). If the developers of these titles had to execute the same amount of care in optimization as developers did on the Atari 2600 -- where often, the author had to unroll simple countdown loops because they could not afford the overheard of DEC and BEQ instructions -- yes, the game kernel would probably run twice as fast. But on the other hand, each game would take a decade to complete!
I'd happily trade some (but not all) efficiency in program execution for an increase in efficiency in program authoring. And that's exactly what we've done.
Fine grained vs Coarse grained parallelism (Score:2, Insightful)
Fine grained (spread your for loops across processors) and coarse grained parallelism (different independent actors exchanging messages and working on tasks separately) are two completely different approaches, though they generally use the same mechanisms. Everybody always focuses on the fine grained and how that affects algorithms, but I personally believe that personal computing yields more benefit from coarse grained parallelism, where nothing in your program blocks because every task that it's performing is independent. Having modal, sequential operations that you have to wait for your computer perform before you get control back for an unrelated task in the same program is absolutely absurd in this day and age.
The few instances where a personal application does spend significant time in a single task (media manipulation, mostly) could use fine grained parallelism, but that is not the common case. Stop whining about algorithm parallelism and get your system/application design broken out into independent components and tasks properly.
Besides, as others have said, neither is particularly difficult to do properly. It's when you try to hack in threaded shared access without having properly contained the mutable data that you shoot yourself in the foot.
Re:Personal computing? (Score:3, Insightful)
Video, audio, gaming, emulators, and VMs are starters. But I think you're missing some of the picture. Most computer users have one or two programs open at a time and end up quitting everything when they want to run something processor intensive like a game or photoshop. With the move towards multi-core and with a little work from developers, people might be able to leave 90% of the apps they use running, all the time. Multiple cores also provides something of a buffer. When a thread goes rogue, their machine does not grind to a halt. Heck, just yesterday my girlfriend was complaining because she tried to open a page in Firefox and it locked up the whole application including the other 8 tabs she had open. That means she had to kill it (which took a while itself) and then try to decide if she wanted to reopen all those tabs and risk it locking up again, or just try to remember what she had open and reopen them all by hand. If each tab, however is running in its own thread and there are enough cores to handle it, this could easily have been a much better experience for her. She could have just closed the unresponsive tab.
Basically, I'd argue that if you provide the resources, smart developers will find a way to make clever use of those resources. Dual core has already sparked a revolution for virtualization and led to some other, really cool OS changes to increase speed. Many cores will provide diminishing returns (we have 2 eyes for a reason), but I bet 8 cores will be well utilized within a few years.
Er... What drugs are you taking? (Score:2, Insightful)
What uttermost and complete crap.
We are nowhere near multi-core programming being a no-brainer.
Here's what we know right now:
1. We know how to manually create threads to perform specialized tasks. This comes nowhere near the ideal which is loading all the CPUs roughly the same, taking in account CPU affinity for some tasks in order to keep the caches warm and work well on NUMA architectures.
2. We know how to exploit data parallelism in those cases where we have large quantities of data.
Other than that we are still trying to find any paradigm that would make arbitrary systems scale well on a massive number of cores. Some of them are based on pi calculus, some on join calculus, some on more practical foundations.
At this point some things are obvious:
1. CPU threads are useless except as part of the foundation on which other abstractions are built. All really scalable systems use either lightweight threads/processes or smaller tasks which are scheduled in user space.
2. Native stacks are evil.
3. Thread affinity, as implemented by Windows USER and GDI modules and STAs is evil. Don't know how this works under Linux as I never did any GUI work there but I assume many components have similar limitations.
4. Any solution that exposes locks to the user instead of hiding them in the infrastructure is evil. Locks are not composable are very error-prone in real-world scenarios.
Dejan
Re:2005 Called (Score:4, Insightful)
Re:Threads Are Not the Answer (Score:3, Insightful)
The problem with os threads is that the things the benefit the most from parallel processing are the finest grained, but the os threads are only usable for the coarsest grained problems. So, OS threads are generally only useful for concurrency and not for parallel execution. Ie meaning that os threads can let you do two mostly different 'tasks' at the same time (repainting the GUI while the data is being processed), but are really bad at actually making a single task run faster.
You can, sometimes, with incredible effort make os threads run one task faster. But that doesn't change the fact that they are a really really bad solution for this.
Re:Threads Are Not the Answer (Score:1, Insightful)
You can say that threads are over-used by programmers who don't understand the reasons why you'd use a separate process instead, but I don't think you can say that threads don't have areas in programming where they're almost essential.
Re:2005 Called (Score:4, Insightful)
On modern systems, threads are themselves first-class constructs, and it runs somewhat like this:
A process has things like memory-tables for virtual memory, handles for objects, files, socket connections, etc. A process always contains at least one thread (this isn't always true while a process is being set up or torn down, but it's true when most anyone's code is running).
A thread generally has a stack (in the host-process's virtual address space, so everyone can read it), some thread-local storage to make life easier for some api's (you don't need to care about this in most cases), and lives in a process. This means that threads can use virtual addresses for memory interchangeably with other threads in the same process.
Additionally, some operating systems support fibers. A fiber is like a thread except that it has to be explicitly or cooperatively (not quite the same thing) multi-tasked. Fibers use even less memory than threads, and you really don't have to care about them.
When you're in, say, Visual Studio, there's a "threads" window for all of the threads of the process that you are debugging. You can end up stepping through code on one thread while other threads are running.
The modern hardware designs lead to interesting performance side-effects from cache location and memory location. It's not quite as hard as systems that have asymmetric access to resources (e.g. Playstation 2), but it makes for fun work.
Re:Diaspora (Score:4, Insightful)
Making code easy to read and maintain is critical to maximizing the efficiency of the programmer. The efficiency of the code is generally a secondary issue, and is only a factor if the code in question is found to be a bottleneck.
Brian Kernighan once said,
"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"
Re:HPC (Score:3, Insightful)
It's often quite difficult to wrap your head around that unpredictability, especially since so much of the beginning computer science education teaches programmers to evaluate each instructions in their programs in source order as the computer is likely to end up doing when the program is run. This is made even worse by the fact that some languages (I know Java, but there may be others too) allow a compiler to re-order instructions to improve performance provided it doesn't alter that thread's behavior. This is fine for a single-threaded application, but can be quite confusing for a multi-threaded application when you can no longer assume source ordering of instructions from other threads.
It took a while before I got comfortable with essentially asking myself "What am I assuming and do I actually know that at this point or do I just think I know it at this point" with every line of code that I write that might execute in a multi-threaded environment. Even with that, I still run into occasions where it takes over an hour to debug a race condition when that error only happens a small percentage of the time.
Re:The basic problem (Score:4, Insightful)
They diminish, but they never disappear. Even in algorithms where you completely have to wait the results of previous computation to go on, you can still get a speedup with branch prediction. In essence, while your one core is cracking the numbers, other cores do the what if work, and even if you mispredict in lots of cases, you can still get speedups with large datasets, because in some cases, when your first core comes up with a result, you will discover that the what if computation started out with a right guess.
Hey, i hear they are doing essentially the same stuff with all those newfangled multiscalar processors and branch prediction anyway.
Re:OS/2? (Score:4, Insightful)
I'm of the opposite opinion; it's a shame that so many people equate parallel processing with threads. When there's not much shared data, using multiple processes keeps memory protection between your parallel "things", decreasing coupling, increasing isolation, and generally resulting in a more stable system (and for certain things where you can avoid some cache coherency problems, a faster system). Your example is perfect; there's really no good reason to use a thread for such lookups. Another process would do, or even better just use select() and avoid all the pain (and bugs) of a multithreaded solution.
OS developers spent a lot of engineering time implementing protected memory. Threads throw out a huge portion of that; a good programmer won't do that without very good reasons. Some tasks, where there really are tons of complicated data structures to be shared, are good candidates for threading. More commonly, though, threads are used either because the programmer doesn't know any better or because they allow you to be a slacker about defining exactly what is shared and mediating access to it. The latter is especially dangerous; defining exactly what (and how) things are shared goes most of the way toward eliminating multiprocessing bugs, and threads make it easy to slack off on that and get a "mostly working" solution that occasionally deadlocks, fails to scale, etc.
Use processes or state machines when you can, and threads when you must.
Re:Thank god (Score:3, Insightful)
http://www.microsoft.com/downloads/details.aspx?FamilyID=e848dc1d-5be3-4941-8705-024bc7f180ba&displaylang=en [microsoft.com]
Essentially, they turn
for (int i = 0; i < 100; i++) {
a[i] = a[i]*a[i];
}
into
Parallel.For(0, 100, delegate(int i) {
a[i] = a[i]*a[i];
});
and the hint tells the
http://msdn.microsoft.com/msdnmag/issues/07/10/Futures/default.aspx [microsoft.com]
Re:There's not much hope for the C++ committee (Score:3, Insightful)
Critical sections are a high level future which must be in a library.
The problem is that a C++ compiler doesn't know what data is locked, and which data items are locked by which lock, because the language has no way to talk about that subject. OS-level primitives lock everything. The compiler has a hard time telling which data needs concurrency protection. Thus, the compiler can't diagnose race conditions.
If the language understood locking, one could do more checking at compile time. One could take a hard-nosed approach. Every variable has to be locked by something. Either it's locked by the object of which it is a member (like Java's "Synchronized"), or the thread to which it is local, or by some other object which owns the variable. This last is something for which a language needs descriptive syntax.
One approach would be syntax where the programmer declares a critical section, and lists everything that can be referenced within the critical section. But that might not be necessary. A system more like the way an SQL database decides transaction locking issues might be easier on the programmer.
The big memory headache in C and C++ is always "who owns what", something with which the language provides no assistance. That's the cause of dangling pointers and memory leaks, but it's also the cause of much locking trouble.
Re:2005 Called (Score:2, Insightful)
As far as sorting stuff like drop-down boxes you will not have enough data to justify using multiple cores on it, unless you got millions of items in it but then you got other problems.
Re:2005 Called (Score:3, Insightful)
The algorithms programmers have to deal with here involve concurrency, and have been in use for decades by anyone writing an OS or device driver. Dining Philosopher problem, readers and writers synchronization, etc. These are used on what most people think of as single processor computers and are essential. So I don't really think of these as "parallel programming", but as "parallel-light".
Parallel programming to me means dealing with SIMD or MIMD machines. MIMD has multiple processors each with its own memory and data, not multiple processors all sharing the same memory like SMP does. They may have high speed connections to a subset of other processors, such as being arranged in a grid or cube. SIMD has multiple processors all with their own data space but executing the same instruction sequences; the simplest form of which might be vector processors. The algorithms for these machines have very little in common with multithreading types of algorithms.
The parallel algorithms that require lots of sharing between processors will hit a bottleneck on the RAM with these multicore CPUs.
Re:2005 Called (Score:3, Insightful)
Doing things with digital video and photoshopping still images will use as muich CPU as you can feed it. These are now mainsteam uses for home computers.
Re:Thank god (Score:3, Insightful)
Garbage collection is a one size fits all solution, that is not appropriate for all the applications in the C++ problem space. Further there is a lot of C++ code already out there that does its own memory management. It would be difficult to retrofit this code to garbage collection.
Furthermore, many garbage collected languages lack proper destructors. At best they have a finalize method. This interfears with the C++ idiom "object creation is resource allocation; object destruction is resource release". This is the way C++ manages all resources. There are other resources besides memory; like open files, descriptors, network connections and many others. Because the garbage collected languages lack proper destructors, they actually make the management of these other resources more difficult. This can make garbage collected languages more complex and buggy. What the garbage collected languages give with one hand, they take away with the other!
I wish someone would develop a language with optional garbage collection and with proper destructors!
Memory matters, too (Score:3, Insightful)
Re:2005 Called (Score:3, Insightful)
The massively multicore processors are exactly where they need to be: in servers and workstations, and on the desks of hardware queens who absorb the cost of product development so I don't have to.
People run the vast majority of their applications concurrently with other applications. The only significant exception is gamers. When you're dealing with a sluggish app on a single-core machine, what are the odds it's unresponsive because of another application vs. being unresponsive because of its own problems? Now, same question, on a dual-core machine? The odds drop quite a bit. It's nice to have a spare core so when one app gets fussy the rest of your applications keep responding normally.
All the more reason to have multiple cores. In my experience, having multiple processors actually compensates for application-level and OS-level multiprocessing deficiencies, because let's face it, one hoggish app can make it very annoying to use a single-core machine. OSes are supposed to mitigate that, but since they don't do a perfect job, multiple cores help keep the system usable. Granted, there are other resources besides CPU that can suffer from contention, but every little bit helps.
threads are too high level (Score:3, Insightful)
I would much rather the operating system switch 4 or 16 synchronized cores completely over to me. Add prefixes to the assembly instructions so that I can explicitly execute instructions on processor 1, 2, 3, etc, in a shared memory model. Add logic similar to simultaneous multithreading to keep unused cores saturated with instructions from other threads when possible. This would help the programmer extract parallelism from tightly coupled algorithms. There seems to be no real multithreaded analogue to assembly language, and I think that is a big part of the problem. If we had such a thing it would be much easier to write tightly coupled parallel code, and higher level parallelization (from compilers) would follow inevitably.
Of course I'm not saying this is some sort of magic bullet. We would still need to split up computations and use threads as best as possible, but I think this is an obvious tool that we are missing.
Re:Oh, wow (Score:3, Insightful)
In doing so, you prove yourself a fool. It is a childish action that only hurts your cause, and Microsoft (as well as most people with any business or social sense) knows it.
You see Microsoft as some great evil to be overcome without seeing that a large part of your problem is yourself.
Companies see people like you bash anything that isn't open source or "free" and they quite rightly think that you haven't really thought things out or lack the business acumen to realize why all of the world can't work that way. (Not to mention the extreme lack of social skills that it shows)
I like open source, I use it, I occasionally write it, and I've championed the cause in a sane way.
What you are missing is that Microsoft is giving a lot of people and companies what they want - software that is relatively easy to use and which everyone else is already using ("best" doesn't matter most of the time, which a lot of you have problems understanding).
At the same time, they treat their employees well, paying them well with good benefits (from what I've heard from people I know who work there), and maintain well-respected research labs.
You do not draw good people from a good environment by telling them it's not a good environment because they don't make everything open source. You draw good people by being a better environment in terms of pay, benefits, culture, work-life balance etc *and* appealing to their sensibilities.
If you can't do that, and instead simply bash anyone for associating with "the enemy", you are doomed to fail because, at best, people will work on it as a hobby. The lion's share of good open source software is done by people being paid to do it. Bashing the company of people you want to work for you does not help.
Not all of the world cares about open source, and many of us who do are not fanatical about it and realize that, while it is good for some things, is absolutely horrible for other things from a business standpoint. We like working on things that we see as important, but we also like being able to pay our bills and having a life outside of work.
Re:Sameless Plug: Qt 4.4 (Score:3, Insightful)
While you did say 'almost', I'm still going to take exception with that statement.
That is a very dangerous thing to say without reams of qualifications.
Programming (of any non-trivial nature) is not currently, nor is it likely to be any time soon, a 'no-brainer'. No library, no framework, no toolset, no abstraction takes away from the core fact that programming is hard. Sure, you can take away the boring/trivial stuff and give the programmers more time to work on the hard/interesting stuff, but that doesn't make it a 'no-brainer'.
Abstracting away mapReduce just means you don't have to know how to write your own mapReduce implementation. It doesn't automatically make the user of Qt (or whatever) an expert in designing parallel algorithms, nor parallel debugging, nor the performance benefits and tradeoffs and gotchas of parallel programming.
Chip makers are at least 2 decades behind (Score:3, Insightful)
So after a decade of poor adoption on the part of software developers, the chip makers have ignored the fact that the wisdom of the (programming) mob indicates that multi-processing is not an attractive solution. Chip makers have known for more than two decades that they were going to run into physical limits eventually using the current technology, but opted for milking the 1970's model as long as possible rather than developing new technologies that might lead to much better single-core performance.