Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming IT Technology Hardware

An Overview of Parallelism 197

Mortimer.CA writes with a recently released report from Berkeley entitled "The Landscape of Parallel Computing Research: A View from Berkeley: "Generally they conclude that the 'evolutionary approach to parallel hardware and software may work from 2- or 8-processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism.' This assumes things stay 'evolutionary' and that programming stays more or less how it has done in previous years (though languages like Erlang can probably help to change this)." Read on for Mortimer.CA's summary from the paper of some "conventional wisdoms" and their replacements.

Old and new conventional wisdoms:
  • Old CW: Power is free, but transistors are expensive.
  • New CW is the "Power wall": Power is expensive, but transistors are "free." That is, we can put more transistors on a chip than we have the power to turn on.

  • Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins.
  • New CW: As chips drop below 65-nm feature sizes, they will have high soft and hard error rates.

  • Old CW: Multiply is slow, but load and store is fast.
  • New CW is the "Memory wall" [Wulf and McKee 1995]: Load and store is slow, but multiply is fast.

  • Old CW: Don't bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.
  • New CW: It will be a very long wait for a faster sequential computer (see above).
This discussion has been archived. No new comments can be posted.

An Overview of Parallelism

Comments Filter:
  • Erlang (Score:5, Insightful)

    by Anonymous Coward on Saturday February 10, 2007 @08:09PM (#17967176)
    Erlang only provides a way of proving parallel correctness, a la CSP. This means avoiding deadlocks and such. The primary difficulty of crafting algorithms to run efficently over multiple CPUs still remains. Erlang does not do any automatic parallelization, and expects the programmer to write the code with multiple CPUs in mind.


    I'm wating for a language which would parallelize stuff for you. This is most likely to be a functinal language, or an extension to an existing functional language. Maybe even Erlang.

  • It's not hard (Score:5, Insightful)

    by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Saturday February 10, 2007 @08:12PM (#17967204) Homepage

    I think the main reason people say "don't use threads" is because while single threaded apps are easy to debug, multi-threaded ones will crash and burn at seemingly random places if the programmer didn't plan ahead and use proper locking. This is probably good advice to a noob programmer but I otherwise can't stand people who are of the "absolutely, never, ever, use threads" mindset.

    Some applications have no need to be multithreaded, but when they do it is a lot easier than people make it out to be. Taking advantage of lock-free algorithms and NUMA for maximum scalability *can* be hard, but the people who need these will have the proper experience to tackle it.

    Language extensions for threading would be great, and I'm sure somebody is working on it. But until that magical threading language (maybe c++1x) comes along the current ones work just fine.

  • by SparhawkA ( 608965 ) on Saturday February 10, 2007 @08:16PM (#17967236)
    Take a look at LabVIEW, a compiled graphical programming language from National Instruments. It natively supports SMP / multicore / multithreading. Essentially, dissociated pieces of code you write (computations, hardware I/O, etc.) are automatically scheduled in separate threads of execution in order to maximize efficiency. It's an interesting idea: here's a technical article from their website that does a better job of describing it (some marketing included as well): http://zone.ni.com/devzone/cda/tut/p/id/4233 [ni.com]
  • by ardor ( 673957 ) on Saturday February 10, 2007 @08:37PM (#17967386)
    Functional languages are no silver bullet, however. Things like I/O do not fit well in there. Yes, there are solutions for this, but they tend to be overly complicated. A hybrid functional/imperative language with safeguards for side-effects of the imperative parts seems to be the way to go.
  • by Animats ( 122034 ) on Saturday February 10, 2007 @08:59PM (#17967546) Homepage

    I just heard that talk; he gave it at EE380 at Stanford a few weeks ago.

    First, this is a supercomputer guy talking. He's talking about number-crunching. His "13 dwarfs" are mostly number-crunching inner loops. Second, what he's really pushing is getting everybody in academia to do research his way - on FPGA-based rackmount emulators.

    Basic truth about supercomputers - the commercial market is zilch. You have to go down to #60 on the list of the top 500 supercomputer [top500.org] before you find the first real commercial customer. It's BMW, and the system is a cluster of 1024 Intel x86 1U servers, running Red Hat Linux. Nothing exotic; just a big server farm set up for computation.

    More CPUs will help in server farms, but there we're I/O bound to the outside world, not talking much to neighboring CPUs. If you have hundreds of CPUs on a chip, how do you get data in and out? But we know the answer to that - put 100Gb/s Ethernet controllers on the chip. No major software changes needed.

    This brings up one of the other major architectural truths: shared memory multiprocessors are useful, and clusters are useful. Everything in between is a huge pain. Supercomputer guys fuss endlessly over elaborate interconnection schemes, but none of them are worth the trouble. The author of this paper thinks that all the programming headaches of supercomputers will have to be brought down to desktop level, but that's probably not going to happen. What problem would it solve?

    What we do get from the latest rounds of shrinkage are better mobile devices. The big wins commercially are in phones, not desktops or laptops. Desktops have been mostly empty space inside for years now. In fact, that's true of most non-mobile consumer electronics. We're getting lower cost and smaller size, rather than more power.

    Consider cars. For the first half of the 20th century, the big thing was making engines more powerful. By the 1960s, engine power was a solved problem, (the 1967 turbine-powered Indy car finally settled that issue) and cars really haven't become significantly more powerful since then. (Brakes and suspensions, though, are far better.)

    It will be very interesting to see what happens with the Cell. That's the first non-shared memory multiprocessor to be produced in volume. If it turns out to be a dead end, like the Itanium, it may kill off interest in that sort of thing for years.

    There are some interesting potential applications for massive parallelism for vision and robotics applications. I expect to see interesting work in that direction. The more successful vision algorithms do much computation, most of which is discarded. That's a proper application for many-CPU machines, though not the Cell, unless it gets more memory per CPU. Tomorrow's robots may have a thousand CPUs. Tomorrow's laptops, probably not.

  • Re:Erlang (Score:3, Insightful)

    by CastrTroy ( 595695 ) on Saturday February 10, 2007 @09:33PM (#17967740)
    I wrote some parellel code using MPI [lam-mpi.org] in university. It takes a lot of work to get the hang of at first, and many people who I know that were good at programming had lots of trouble in this course, because programming for parallelism is very different than programming for a single processor. On the other hand, you can get much better performance from parallel algorithms. However, I think that we could do just as well sticking with the regular algorithms, and having a lot of threads each running on a different core. If you look at an RDBMS, it would be nice if you could sort in less than n log(n) time, but it's even better if you just sort in n log (n), but can run 128 sorts simultaneously. I seem to remember some news about Intel saying they would have 128 core chips available in the near future.
  • by jd ( 1658 ) <imipak@ y a hoo.com> on Saturday February 10, 2007 @09:54PM (#17967912) Homepage Journal
    This problem was "solved" (on paper) in the mid 1970s. Instead of writing a highly complex parallel program that you can't readily debug, you write a program that the computer can generate the parallel code for. Provided the compiler is correct, the sequential source and the parallel binary will be functionally the same, even though (at the instruction level) they might actually be quite different. What's more, if you compile the sequential source into a sequential binary, the sequential binary will behave exactly the same as the parallel version (only much slower).

    Any reproducable bug in the parallel binary will be reproducable given the same set of inputs on the sequential binary, which you can then debug as you have the corresponding sequential source code.

    So why isn't this done? Automagically parallelizing compilers (as opposed to compilers that merely parallelize what you tell them to parallelize) are extremely hard to write. Until the advent of Beowulf clusters, low-cost SMP and low-cost multi-core CPUs, there simply haven't been enough machines out there capable of sufficiently complex parallelism to make it worth the cost. Simply make a complex-enough inter-process communication system, with a million ways to signal and a billion types of events. Any programmer who complains they can't use that mess can then be burned at the stake for their obvious lack of appreciation for all these fine tools.

    Have you ever run GCC with maximum profiling over a program, tested the program, then re-run GCC using the profiling output as input to the optimizer? It's painful. Now, to parallelize, the compiler must automatically not just do one trivial run but get as much coverage as possible, and then not just tweak some optimizer flags but run some fairly hefty herustics to guess what a parallel form might look like. And it will need to do this not just the once, but many times over to find a form that is faster than the sequential version and does not result in any timing bugs that can be picked up by automatic tools.

    The idea of spending a small fortune on building a compiler that can actually do all that reliably, effectively, portably and quickly, when the total number of purchasers will be in the double or treble digits at most - say what you like about the blatant stupidity rife in commercial software, but they know a bad bet when they see one. You will never see something with that degree of intelligence come out of PCG or Green Hills - if they didn't go bankrupt making it, they'd go bankrupt from the unsold stock, and they know it.

    What about a free/open source version? GCC already has some of the key ingredients needed, after all. Aside from the fact that the GCC developers are not known for their speed or responsiveness - particularly to arcane problems - it would take many days to compile even SuperTuxKart and probably months when it came to X11, glibc or even the Linux kernel. This is far longer than the lifetime of most of the source packages - they've usually been patched on that sort of timeframe at least once. The resulting binaries might even be truly perfectly parallel, but they'd still be obsolete. You'd have to do some very heavy research into compiler theory to get GCC fast enough and powerful enough to tackle such problems within the lifetime of the product being compiled. Hey, I'm not saying GCC is bad - as a sequential, single-pass compiler, it's pretty damn good. At the Supercomputer shows, GCC is used as the benchmark to beat, in terms of code produced. The people at such shows aren't easily impressed and would not take boasts of producing binaries a few percent faster than GCC unless that meant a hell of a lot. But I'm not convinced it'll be the launchpad for a new generation of automatic parallelizing compilers. I think that's going to require someone writing such a compiler from scratch.

    Automatic parallelization is unlikely to happen in my lifetime, even though the early research was taking place at about the time I first started primary school. It's a hard problem that isn't being made easier by having been largely avoided.

  • by zestyping ( 928433 ) on Saturday February 10, 2007 @09:59PM (#17967950) Homepage
    Reliably achieving even simple goals using concurrent threads that share state is extremely difficult. For example, try this task:

    Implement the Observer [wikipedia.org] (aka Listener) pattern (specifically the thing called "Subject" on the Wikipedia page). Your object should provide two methods, publish and subscribe. Clients can call subscribe to indicate their interest in being notified. When a client calls publish with a value, your object should pass on that value by calling the notify method on everyone who has previously subscribed for updates.

    Sounds simple, right? But wait:
    • What if one of your subscribers throws an exception? That should not prevent other subscribers from being notified.
    • What if notifying a subscriber triggers another value to be published? All the subscribers must be kept up to date on the latest published value.
    • What if notifying a subscriber triggers another subscription? Whether or not the newly added subscriber receives this in-progress notification is up to you, but it must be well defined and predictable.
    • Oh, and by the way, don't deadlock.
    Can you achieve all these things in a multithreaded programming model (e.g. Java)? Try it. Don't feel bad if you can't; it's fiendishly complicated to get right, and i doubt i could do it.

    Or, download this paper [erights.org] and start reading from section 3, "The Sequential StatusHolder."

    Once you see how hard it is to do something this simple, now think about the complexity of what people regularly try to achieve in multithreaded systems, and that pretty much explains why computer programs freeze up so often.
  • Re:It's not hard (Score:4, Insightful)

    by mikael ( 484 ) on Saturday February 10, 2007 @10:07PM (#17967998)
    Not all algorithms can be parallelized that easily. Imagine e.g. a parser: You cannot parse text by having a million processors looking at one character each.

    You could have the first thread processor split the text by white space. Then each block of characters is assigned to any number of processors to find the matching token. I've seen some parsers where the entire document was
    read in, converted into an array of tokens before returning back to the calling routine.
  • Re:It is hard (Score:1, Insightful)

    by Anonymous Coward on Saturday February 10, 2007 @10:48PM (#17968260)
    "don't use threads unless you have to" should read: "don't use ANY arbitrary feature unless you have to" i.e. KISS.
  • by antifoidulus ( 807088 ) on Saturday February 10, 2007 @10:53PM (#17968284) Homepage Journal
    Also keep in mind that many companies aren't interested in linpac peformance per se, at least to the extent that they will spend a lot of time and effort tweaking their computers to get really high linpac scores, which is all that is important when it comes to top500.
  • by Doctor Memory ( 6336 ) on Sunday February 11, 2007 @12:06AM (#17968850)

    Most of the time a computer isn't churning away on a single problem that needs to be parallelized. In that respect, the solution rests more with the operating system.
    True, and in the short term I'd imagine that's where most of the improvements will appear, but context switches are expensive. It's more efficient if you can switch execution to another thread within the same context, so you can (hopefully) still use the data in your I & D caches, although you do still have a register spill/reload.

    Speaking of architecture changes, it sounds like Intel is going down the same road the Alpha team did — more cacheing. I remember reading an article about one system DEC made (ISTR this was about the time the 21264 came out) that had 1M of L1 cache, 2M of L2, and 8M of L3. I wonder how much cache they could squeeze onto a chip, given current power handling.
  • by kwahoo ( 899178 ) on Sunday February 11, 2007 @12:12AM (#17968902)

    ...if there were, the langauge wars of the 80s and 90s would have produced an answer. And what new langauge caught on? Not Sisal, or C*, or Multilisp, etc. It was Java. And C, C++, and Fortran are still going strong.

    Part of the problem, as previous posts have observed, is that most people didn't have much incentive to change, since parallel systems were expensive, and bloated, ineffeicient code would inevitably get faster thanks to the rapid improvement in single-thread performance that we enjoyed until recently. So outside of HPC and cluster apps, most parallelism consisted of decoupling obviously aynchronous tasks.

    I don't think there ever will be one language to rule them all.... The right programming model is too dependent on the application, and unless you are designing a domain-specific system, you will never get people to agree. Depending on your needs, you want different language features and you make different tradeoffs on performance vs. programmability. For some applications, functional programming languages will be perfect, for others Co-Array Fortran will be, for others an OO derivative like Mentat will be, etc. And as new applications come to the fore, new languages will continue to spawn.

    I think the key is to:

    • Do your best to estimate what range of applications your chip will need to support (so you could imagine that chips for desktops/workstations might diverge from those for e-commerce datacenters--which we already see to a mild extent)
    • Deduce what range of programming language features you need to support
    • Do your best to design an architecture that is flexible enough to support that, and hopefully not close off future ideas that would make programming easier.

    If one programming model does triumph, I would predict that it will be APIs that can be used equally well from C, Fortran, Java, etc., thus allowing different application domains to use their preferred APIs. And even that model is probably not compelling enough to bring them all and in the dark bind them....

  • 640k, anyone? (Score:3, Insightful)

    by try_anything ( 880404 ) on Sunday February 11, 2007 @04:27AM (#17970394)

    Something like a server could relatively easily make use of almost any number of processors; one per client maybe.

    What happens when a 1024-core server is too slow to handle 700 concurrent connections, and the only upgrade option is a 2048-core server? Then it matters whether each of those 700 requests is a parallelizable problem. Imagine a server that solves difficult computations like routing delivery traffic, designing tailored clothes from customer snapshots, monitoring security camera feeds at a casino, or analyzing a twenty-second voice recording to decide where to route a call. ("To help us best serve you, briefly state why you are calling.") Saying that one core per client will always be sufficient, even when cores stop getting faster, is tantamount to saying that nobody will ever figure out how to use all that power -- historically, a poor prediction.

  • Use a Database (Score:2, Insightful)

    by Tablizer ( 95088 ) on Sunday February 11, 2007 @08:19PM (#17976858) Journal
    Databases already allow a kind of parellel processing. A.C.I.D.-based techniques allow multiple users (processors) to send results to the same database in order to communicate results between each user/client. Each "client" may be single threaded, but together a client/server system is essentially a multi-threaded application, all without odd code or odd programming languages.
  • Re:Erlang (Score:2, Insightful)

    by middlemen ( 765373 ) on Sunday February 11, 2007 @10:11PM (#17977694)
    However, I think that we could do just as well sticking with the regular algorithms, and having a lot of threads each running on a different core.

    This is exactly the problem with the widespread acceptance of parallel programming. Software programmers don't want to think about new parallel algorithms. I have been and still am working in the parallel programming industry for 2.5 years now (of the total 3 years work-ex that I have), and I find it ridiculous that many programmers with 5-10 years experience or more don't want to use their god-damn brains. Agreed parallel programming is a different paradigm and makes you think more, but what the hell is wrong with that !? You already know sequential style programming, now you will know more and know both and leverage that knowledge to use the available hardware best, to your satisfaction. Isn't that what a programmer should do ? Leverage his knowledge to gain maximum use of the available computer hardware ?

    They should just teach parallel programming (not multi-threading, but parallel programming - using MPI for example) to everyone learning programming in college. It is very useful, esp for engineers, since they mostly work with programs that use huge amounts of memory or do intensive computations for long periods of time.

Love may laugh at locksmiths, but he has a profound respect for money bags. -- Sidney Paternoster, "The Folly of the Wise"

Working...