Forgot your password?
typodupeerror
Programming

Auto-threading Compiler Could Restore Moore's Law Gains 404

Posted by Unknown Lamer
from the sufficiently-smart-compiler dept.
New submitter Nemo the Magnificent writes "Develop in the Cloud has news about what might be a breakthrough out of Microsoft Research. A team there wrote a paper (PDF), now accepted for publication at OOPSLA, that describes how to teach a compiler to auto-thread a program that was written single-threaded in a conventional language like C#. This is the holy grail to take advantage of multiple cores — to get Moore's Law improvements back on track, after they essentially ran aground in the last decade. (Functional programming, the other great hope, just isn't happening.) About 2004 was when Intel et al. ran into a wall and started packing multiple cores into chips instead of cranking the clock speed. The Microsoft team modified a C# compiler to use the new technique, and claim a 'large project at Microsoft' have written 'several million lines of code' testing out the resulting 'safe parallelism.'" The paper is a good read if you're into compilers and functional programming. The key to operation is adding permissions to reference types allowing you to declare normal references, read-only references to mutable objects, references to globally immutable objects, and references to isolated clusters of objects. With that information, the compiler is able to prove that chunks of code can safely be run in parallel. Unlike many other approaches, it doesn't require that your program be purely functional either.
This discussion has been archived. No new comments can be posted.

Auto-threading Compiler Could Restore Moore's Law Gains

Comments Filter:
  • by HuguesT (84078) on Monday December 03, 2012 @08:24PM (#42174651)

    OK, so now we shall have another way to produce parallel programs.

    Running things safely in parallel is a very good thing, but it does not by itself guarantee significant improvements in speed. The hard bit is actually to optimize the parallel code [futurechips.org].

  • by lennier (44736) on Monday December 03, 2012 @08:42PM (#42174781) Homepage

    Or, gawd forbid.. we could teach programmers how to use threading?

    That's easy: "Don't."

    From everything I've read about threading, there's no general way a hand-coded multithreaded program can ever be proven to be correct. Threads introduce all sorts of extremely subtle timing-based logic and security bugs which even the smartest programmers in the world think they can handle but in practice don't. And most programmers are not the smartest programmers in the world (they only think they are).

    The correct solution is to switch from C-based imperative languages to pure-functional implicitly parallelised languages, but that's not likely to happen before the heat death of the universe.

  • by timeOday (582209) on Monday December 03, 2012 @08:49PM (#42174835)
    I hate it when all the potentially interesting discussion on a slashdot story is headed off by a pedantic nitpick.
  • by Jonner (189691) on Monday December 03, 2012 @08:56PM (#42174893)

    Moore's law has nothing to do with performance, numbnuts. Apparently the real geeks left Slashdot ages ago.

    Try reading the headline again. Usually, clueless posters have to be reminded to read TFA, but this is ridiculous. As Moore's "law" continues to be mostly true, the added transistors are being used for extra more cores rather than to make one core faster. Most of the cores sit idle most of the time because few programs can use more than one of them at once.

  • Re:Great potential (Score:5, Insightful)

    by allcoolnameswheretak (1102727) on Monday December 03, 2012 @09:00PM (#42174935)

    Imagine you have a for-loop that calls the method 'hop' on every object 'bunny' in the list:

    for every bunny in list {
            bunny.hop()
    }

    This is a simple, sequential process - bunny.hop() is called in order, for every bunny in the list, one after the other.
    Now suppose you have defined the data in your program in such a way that the compiler knows that method 'bunny.hop()' only ever accesses read-only data, or modifies local data that is unaccessible from anywhere else. The compiler now knows that the order of execution of the bunny hops doesn't really matter, as every call of bunny.hop() is independent from anything else. This frees the compiler to spawn threads or processes to call as many bunny.hop()'s as he likes at the same time, thereby processing through the list alot faster.

    Another method, bunny.eat() actually performs write access on a shared object 'carrot' when called. If two bunnies eat the same carrot, the compiler can not perform automatic parallelization, as running two bunny.eat() methods could lead to invalid state (only one piece of carrot remaining, two bunnies eating 1 piece of carrot at the same time, results in -1 pieces of carrot). In this case, the compiler will take care to run two bunnies eating away at the same carrot sequentially. However if there are 2 bunnies eating the green carrot and another 2 bunnies eating the yellow carrot, these are again independent from each other and can again be paralleled.

    The requirement to make this possible is to provide the compiler with information on what kind of data something is - is it an isolated hopping? Or a shared carrot? or a global Bunnygod that affects all bunnies?

  • by Jonner (189691) on Monday December 03, 2012 @09:08PM (#42175001)

    An interesting development, and much needed I fear, but yet another layer of abstraction to allow lazy developers to not have to really bother about knowing what their code is actually doing (that's for the poor SoB who has to maintain it is for...)

    Developing software is all about managing complexity. Abstractions are the primary tool used to do so. They are neither inherently good or bad. A large part of writing good software is finding the appropriate abstractions and eliminating inappropriate ones. If an abstraction allows a program to be written more easily with an acceptible level of performance and correctness, it is an appropriate abstraction.

    To know what code is "actually doing" is relative. No programmer knows what his code is doing at the level of gates on the chip. It's rarely necessary or even helpful to know what the code is doing at the level of CPU instructions or microinstructions. This is especially true if the code runs on multiple CPUs.

  • by fyngyrz (762201) on Monday December 03, 2012 @09:18PM (#42175075) Homepage Journal

    Honestly, you know what this says? This just says some programmers still need to go practice in the threading space.

    Threading is not that hard at all for a VERY large class of problems. There are issues -- like, the memory bandwidth will eventually choke the ability of processors to get at nearby memory regions -- and you need some kind of sane plan to deal with cacheing in one core as compared to the next core when you share flags or data coming in during execution from elsewhere -- and you need to figure out when thread overhead starts to chew into your efficiencies -- but really, it's entirely doable, and isn't that difficult a problem space as compared to the actual problems *serious* programmers have to solve. It may just be that your "thing" isn't really all that threadable. Some things aren't. But it should also be very obvious when they aren't.

    Then again, some problem spaces lend themselves very well to multiple core approaches -- I just finished a library that slices up images and then applies various separable algorithms to the slices in a variable number of threads (the user of the library can specify.) I wrote it in pure C, used posix threads, no sweat at all, in my 8-core machine, you get just about exactly the gains you'd think (x the number of cores) when the process is CPU cycle intensive, less as the ops are simpler and memory I/O rates rise.

    Seriously. If threading seems that hard, you need to go do some studying. It's not the threading. It's you. You really should get on this bus.

  • by Spiridios (2406474) on Monday December 03, 2012 @09:26PM (#42175127) Journal

    Maybe you need to read?

    Here's the headline (emphasis mine): Auto-threading Compiler Could Restore Moore's Law Gains

    In the past, adding more transistors to a core meant the core worked faster (simplification), so Moore's law indirectly lead to better performance. Now a core is close to as fast/efficient as its going to get, so we throw the extra transistors at new cores (still Moore's law). The problem is, there's only so many single-threaded processes a person can run before there will literally be one core per process. In order to see benefits from the extra transistors again, "gains" in the summary's terminology, then we need a way to take advantage of it (this new compiler technique, functional programming, programmers who can actually grok threads). In the end, if we're not seeing some kind of gain from Moore's law, then the chip manufacturers will stop adding new transistors because no one will pay money for a chip that's just as good as their current chip, and Moore's law will fail.

  • by SplashMyBandit (1543257) on Monday December 03, 2012 @09:39PM (#42175215)

    Actually you don't need functional. Java works fine with multi-threading (well, it does for my daily use). In fact, that is the real strength of the garbage collection model of Java, it can track and manage efficient release of resources in a multi-threaded system. Also Java has direct language support for synchronization mechanisms and extensive library support for multi-thread friendly data structures (Doug Leah is a legend!). Lots and lots of multi-threaded programs get written in Java, without needing functional languages, it's just that we we do write them they do work (after we test for various possible thread-related problems), work well, and we don't have to bitch about needing to change paradigms to get things going (eg. needing to change to functional languages when they are not always the best fit for many classes of problem).

    Of course you'll hear those still stuck on legacy languages designed without multi-threading in mind whinging. That doesn't mean there isn't a straightforward, easy to learn language and development platform out there that isn't strong on multi-threading. There is - and it is cross-platform and general purpose too.

  • Re:Great potential (Score:5, Insightful)

    by Your.Master (1088569) on Monday December 03, 2012 @09:51PM (#42175289)

    If it has a visual effect, then it's not read-only.

  • by betterunixthanunix (980855) on Monday December 03, 2012 @09:52PM (#42175293)
    Let me repeat myself: except for very small programs or very short inner loop bodies. Most of what you are describing is pretty small in terms of the amount of code and its complexity; hand-optimizing assembly code does not scale up.

    Even the best of C or C++ compilers are terrible at vectorization of code.

    Yeah, and the best humans are terrible at allocating registers -- so bad, in fact, that the best C compilers ignore the register keyword. What do you think is more relevant to the general case: vectorizing multimedia operations, or allocating registers? Compilers are also better than humans at:

    • Finding redundant or unreachable code
    • Finding dead or otherwise unneeded variables
    • Finding algebraic reductions
    • Strength reducing transformations
    • Boolean optimizations
    • Reducing program size

    ...and numerous other optimizations that, like register allocation, are more generally applicable than SIMD. There has also been work on compiler optimizations that are utterly out of reach for even the most skilled humans, like probabilistic optimizations that have a negligible (with respect to some tunable parameter) probability of being unsound.

    To put it another way, look at the Orbitz story, which is over a decade old now:

    http://www.paulgraham.com/carl.html [paulgraham.com]

    On the one hand, you have hand-tuned assembly language. On the other, you have a program written in Lisp (a high level, compiled language) with some C++ mixed in (for managing memory). Orbitz was able to compete on speed, but more importantly, it was returning better results. It's not that the people who wrote that mainframe assembly language were idiots -- they were taking advantage of all sorts of special hardware features, they knew how to hack their machines better than anyone else -- it is just that the Orbitz algorithm was far too complex for efficient hand-rolled assembly code, at which point compilers are really the only choice. The mainframe guys were busy thinking about how to make use of special machine features in assembly language; the ITA team was busy solving the higher-level problem, and relying on their compiler to generate good assembly language code. This is a particularly telling line:

    We disassemble most every Lisp function looking for inefficiencies and have had both CMUCL and Franz enhanced to compile our code better.

    [emphasis mine]. They disassembled their code...and then improved their compiler when they saw problems. They did not hand-roll the code, they made the compiler do a better job of generating code. These are not lazy programmers, nor are they programmers who do not know how to use assembly language; they are programmers who understand that they have a tool that is far better at generating assembly language than they are, and that they have more important things to do with their time.

    I deal with quite a bit of crypto code in my work. I have seen lots of hand-tuned assembly language, I dealt with code that took advantage of the AESNI instructions to perform very fast encryption. I am well aware that in small, highly specialized functions (like AES), humans are better able to utilize special instructions to improve performance. Those are niche cases, and the techniques used in those cases have very limited applicability (even SSE is fairly limited in its applicability, by comparison with the sort of code programmers write and maintain every day), and the techniques scale very poorly.

  • Re:Great potential (Score:5, Insightful)

    by paskie (539112) <pasky@nosPAm.ucw.cz> on Monday December 03, 2012 @09:55PM (#42175309) Homepage

    One would assume that "world modification" (monad; in this case, visual I/O) does not fit the restriction "only ever accesses read-only data, or modifies local data that is unaccessible from anywhere else" and as soon as you are showing something to the user, that blocks the auto-threading.

    Then again, if you are drawing map tiles, the order you draw them in doesn't matter - so this goes to show that in some situations, there's no way around the programmer actually having to stop, think and declare whether the program's contract allows it to do some user-visible operation in parallel.

  • Re:Great potential (Score:4, Insightful)

    by elfprince13 (1521333) on Monday December 03, 2012 @10:11PM (#42175407) Homepage
    One problem, and the great attraction of this way of doing things, is that most software companies don't have too many programmers worth $120k. Lots of $50k-$80k, but not so much on the higher end of the pay scale until you get into research divisions at big companies like Microsoft or Google or IBM.
  • by betterunixthanunix (980855) on Monday December 03, 2012 @10:19PM (#42175437)

    Just like quicksort(3) is far faster than bubblesort so too is a highly threadable code faster than non-threadble code

    First, just to be pedantic, I'll point out that quicksort is as bad as bubblesort in the worst case, to a constant factor (you should have picked heapsort or mergesort). That aside, it is worth noting (and I am somewhat bothered by this when it comes to TFA) that we still do not know if it is even possible to optimize any program by parallelizing it; see the NC-vs-P question:

    https://en.wikipedia.org/wiki/P-complete [wikipedia.org]

    Multithreading is not a magic bullet, and in all likelihood it is not generally applicable.

    Languages do not, contrary to belief, express intent, the provide a strict set of instructions that the computer MUST respect

    Wrong on all counts. Imperative languages are a way to convey instructions to a computer; declarative languages do not convey instructions, and programming in a declarative language requires an entirely different mode of thinking (it is closer to asking a question that giving instructions). It is also not strictly necessary for the computer to do exactly what a program expresses; there has been some work on compiler optimizations that have a (tunable) chance of not maintaining soundness.

    In the end a good algorithm with no compiler help will beat optimized "dumb" code in all cases larger than "toy" (say, a few dozen "n" in Big-O notation)

    If you are convinced of this, try implementing something more complex than the algorithms you see in Knuth's books; say, this:

    http://eurocrypt2010rump.cr.yp.to/9854ad3cab48983f7c2c5a2258e27717.pdf [cr.yp.to]

    Then ask yourself this: could the constant factors in your implementation be better? At the end of the day, big constant factors will hurt your performance so badly that you might as well have used an asymptotically worse algorithm; indeed, consider fast integer multiplication:

    https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm [wikipedia.org]

    10000 digits are needed before that algorithm actually outperforms the asymptotically worse Toom-Cook family of algorithms. Here is an even more extreme example:

    https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm [wikipedia.org]

    Sure, that's a better matrix multiplication algorithm in the asymptotic sense...but only for matrices that are so large that you cannot even store them on today's computers.

    So really, while you are right that asymptotic improvements will always beat constant factor improvements (which is what compilers are mostly going to do for you), you are wrong to ignore the importance of constant factor improvements. In the real world, constant factors matter. In the real world, quicksort and mergesort will use asymptotically worse algorithms below a certain problem size because of constant factors. In the real world, large integer multiplication is done using Karatsuba or Toom-Cook methods, not FFT methods, because of constant factors. In the real world, if you are not using a compiler to optimize your code, your code is going to be slower than it needs to be, even if you spent hours hand-tuning it, unless you are only dealing with toy problems.

  • by Anonymous Coward on Monday December 03, 2012 @11:04PM (#42175717)

    L4 isn't exactly trivial, but it's proof exceeds the length of it's code by whole factors.

  • by shmageggy (2682297) on Tuesday December 04, 2012 @12:18AM (#42176091)
    Yeah but "Moore's Heuristic" just aint as snappy.
  • by Anonymous Brave Guy (457657) on Tuesday December 04, 2012 @03:00AM (#42176743)

    It may just be that your "thing" isn't really all that threadable. Some things aren't. But it should also be very obvious when they aren't.

    Should it? Do you really want to bet that you understand a modern CPU, the related hardware architecture, and a production quality compiler for that platform, all well enough to predict reliably when it's worth running a few instructions in parallel and when it isn't, taking into account the inevitable overheads of fork/join operations vs. the potential performance gains of concurrency? Can you really analyse the global structure of a 1,000,000+ line project and reliably identify algorithms that are expressed sequentially but in fact could be transformed losslessly into parallelised equivalents?

    Implementing efficient algorithms for embarrassingly parallel computations is (relatively) easy. Implementing efficient algorithms for basically independent parallel operations, such as a web server, is (relatively) easy. Implementing efficient algorithms for problems that might admit some asymmetric parallelism as long as all interactions are coordinated safely and where it might be worth implicit generation of some parallel code paths as long as the benefits outweigh the overheads... Well, if that doesn't seem hard to you, I look forward to reading your papers, because you just outsmarted a lot of researchers who have been investigating this problem for years.

  • by subreality (157447) on Tuesday December 04, 2012 @07:16AM (#42177641)

    There is no metaphysical part of the program that is not parallellizable


    require 'digest/sha2'
    string = 'Parallelize this!'
    123456789.times { string = Digest::SHA2.hexdigest(string) }
    puts string

    Any suggestions?

    Most programs have parts that can be run in parallel, but Amdahl's point is that as you increase the number of processors you'll find that the parts that are inherently serial - even if they're quite small - will increasingly become the bottleneck.

Whenever people agree with me, I always think I must be wrong. - Oscar Wilde

Working...