Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming

How We'll Program 1000 Cores - and Get Linus Ranting, Again 449

vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.
This discussion has been archived. No new comments can be posted.

How We'll Program 1000 Cores - and Get Linus Ranting, Again

Comments Filter:
  • Mutex lock (Score:5, Funny)

    by Anonymous Coward on Friday January 02, 2015 @02:05AM (#48715081)

    All other ended up in a mutex lock situaton so I had chance to do the first post

    • by NoNonAlphaCharsHere ( 2201864 ) on Friday January 02, 2015 @02:19AM (#48715105)
      Thanks a lot asshole, a lot of were busy-waiting while you were typing.
    • by Z00L00K ( 682162 )

      In any case - a multi-core machine can also handle multiple different tasks simultaneously, it's not always necessary to break down a single task into sub problems.

      The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

  • Pullin' a Gates? (Score:4, Interesting)

    by Tablizer ( 95088 ) on Friday January 02, 2015 @02:11AM (#48715091) Journal

    "4 cores should be enough for any workstation"

    Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.

    • by cb88 ( 1410145 )
      It already is wrong...

      Linux Workstation: 16cores = way faster builds than 4 cores.
      CAD workstation: I imagine alot of geometry processing is parallelized... the less waiting the better (either format conversion or generating demo videos etc.. eat up alot of CPU)
      Video workstation: Thats just a blatantly obvious use for multiple cores...
      Linux HTPC: I wanna transcode stuff fast... more cores
      Linux Gaming: These days using at least 4 cores is getting more common...

      Things that I often seen that are *broken* for in
      • Re:Pullin' a Gates? (Score:5, Interesting)

        by bruce_the_loon ( 856617 ) on Friday January 02, 2015 @02:42AM (#48715155) Homepage

        If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

        The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.

        Your Linux workstation would be a good example, but is a very low user count requirement and can be done at the compiler level and not the core OS level anyway.

        Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.

        Redesigning what we're already doing successfully with a low number of controller/data shifting CPU cores managing a large bank of dedicated rendering/physics GPU cores and task-specific ASICs for things like 10GB networking and 6GB IO interfaces is pretty pointless, which is what Linus is talking about, not that we only need 4 cores and nothing else.

        • by gnupun ( 752725 )

          If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.

          So, if someone wants to optimize a critical app-specific operation "foo()" in their app and make it to go 4 times faster using 4 cores, they are crazy?
          Your argument implies that other than these so-called

          • by itzly ( 3699663 )
            No, outside of the edge cases, using 4 smaller cores instead of a single big one will not make foo() go faster.
            • by gnupun ( 752725 )

              What if the cores don't become much smaller while cores are added to your PC? Your general desktop/workstation can have up to 16 cores each of which are more powerful than the previous generation core. Should we still do single-threaded programming for any time-critical foo() and run roughly 10 times slower?

              There's plenty of code than would benefit from the speedup of multi-core programming, not just some niche code.

              • by itzly ( 3699663 )
                Obviously, adding more big cores is better, but that's not what Linus was talking about.
              • by itzly ( 3699663 )
                For a fair comparison, you shouldn't compare a single core with 16 cores, each the size of the single core. Instead, you should keep the number of transistors fixed, and decide whether you want to divide them into 4 big cores, or 16 smaller ones (with smaller caches). And since we're talking about general purpose PCs, you should consider a typical mix of user applications.
      • by Urkki ( 668283 )

        It already is wrong...

        Linux Workstation: 16cores = way faster builds than 4 cores.

        Did the 4 core CPU have 1/4th of the transistor count of the 16 core CPU? Then I'd expect it to be much slower of course. Point of Linus was, a 4 core CPU with same transistor count (used for more cache, better out-of-order execution logic, more virtual registers, and so on), as 16 core CPU will be faster on almost every task. So cores beyond 4 (the number Linus threw as the ballpark count) make sense only, if you really can not spend any more transistors in making those 4 cores faster, but still have die s

    • by davmoo ( 63521 )

      Except that Bill Gates never actually said the so-called "quote" that is attributed to him.

  • by MichaelSmith ( 789609 ) on Friday January 02, 2015 @02:14AM (#48715095) Homepage Journal

    ...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.

    • by phantomfive ( 622387 ) on Friday January 02, 2015 @02:45AM (#48715165) Journal
      In his post, Linus was talking about single, desktop computers, not distributed servers. He specifically said that he could imagine a 1000 core computer might be useful in the server room, but not for a typical user. So if you're going to criticize him, at least criticize what he said.

      Also, git is not totally without locks. Try seeing if you can commit at the same time as someone else. It can't be done, the commits are atomic.
      • My point is that git knows how to merge. It knows when a merge is required, when it is not, and when it can be done automatically. If you design your data structures properly, the same behaviour can be used in massively parallel systems.

  • The article makes the point (which is not correct*) that to have high scalability we need lockless designs, because locking has too much overhead. If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I. And neither can they: they've decided to give up on reliability. They've decided we need to give up on the idea that the computer always gives the correct answer, and instead gives the correct answer most of the time (correct meaning, of course, doing
    • by imgod2u ( 812837 ) on Friday January 02, 2015 @03:25AM (#48715247) Homepage

      The idea isn't that the computer ends up with an incorrect result. The idea is that the computer is designed to be fast at doing things in parallel with the occasional hiccup that will flag an error and re-run in the traditional slow method. How much of a window you can have for "screwing up" will determine how much performance you gain.

      This is essentially the idea behind transactional memory: optimize for the common case where threads that would use a lock don't actually access the same byte (or page, or cacheline) of memory. Elide the lock (pretend it isn't there), have the two threads run in parallel and if they do happen to collide, roll back and re-run in the slow way.

      We see this concept play out in many parts of hardware and software algorithms actually. Hell, TCP/IP is built on having packets freely distribute and possibly collide/drop with the idea that you can resend it. It ends up speeding up the common case: that packets make it to their destination along 1 path.

      • by Rei ( 128717 )

        I'm wondering about what he is thinking for real-world details. For example, a common use case is one thread does searches through a data structure to find an element (as, say, a pointer or an iterator), but before it can dereference it and try to access the memory, some other thread comes along and removes it from the list and frees it. Then your program tries to dereference a pointer or iterator that's no longer valid and it crashes.

        The problem isn't that it's no longer in the list. Clearly the other thre

    • by Rei ( 128717 )

      There are cases where getting exactly the right answer doesn't matter - real-time graphics is a good example. It's amazing the level of error you can have on an object if it's flying quickly past your field of view and lots of things are moving around. In "The Empire Strikes Back" they used a bloody potato and a shoe as asteroids and even Lucas didn't notice.

      That said, it's not the general case in computing that one can tolerate random errors. Nor is the concept of tolerating errors anything new. Programme

  • by Nutria ( 679911 ) on Friday January 02, 2015 @02:42AM (#48715159)

    Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
    Email programs?
    Chat?
    Web browsers get a big win from multi-processing, but not parallel algorithms.

    Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

    • Emacs can always use more cores [flame suit on]
    • Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
      Email programs?
      Chat?
      Web browsers get a big win from multi-processing, but not parallel algorithms.

      Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

      This is kind of silly. Rendering, indexing and searching get pretty easy boosts from parallelization. That applies to all three cases you've listed above. Web browsers especially love tiled parallel rendering (very rarely these days does your web browser output get rendered into one giant buffer), and that can apply to spreadsheets to.

      A better question is how much parallelization we need for the average user. While the software algorithms should nicely scale to any reasonable processor/thread count, on the

      • by Nutria ( 679911 )

        indexing and searching get pretty easy boosts from parallelization.

        How much indexing and searching does Joe User do? And what percent is already done on a high-core-count server where parallel algorithms have already been implemented in the programs running on that kit?

        Web browsers especially love tiled parallel rendering

        Presuming that just a single tab on a single page is open, how CPU bound are web browsers running on modern 3GHz kit? Or are they really IO (disk and network) bound?

        • by Rei ( 128717 )

          The key question is, what are as many common example cases one can list (in order of frequency times severity) where users' computers have lagged by a perceptible amount which in any way reduced their user experience, or caused the user to have to forgo features that would otherwise have been desirable? Then you need to look at the cause.

          In the overwhelming majority of cases, you'll find that "more parallelism with more cores" would be a solution. So why not just bloody do it?

          Not everybody suffers performan

    • I'll give you an example. I use iBooks to read eBooks. I downloaded two eBooks which are actually each a collection of fifty full-size books. On my MacBook, the one I'm currently reading displays that it's around page 8,000. The total is about ten thousand pages.

      If I change the font size, it recalculates the pages and page breaks for the whole book. One CPU running at 100% for a very long time. For a five hundred page book, no problem. For a ten thousand page book, big problem. I'd love it if the re-pagi
      • by Nutria ( 679911 )

        First thought: Why the hell aren't those two (total of) 10,000 page "books" split into their constituent 50 "actual" books?

        That's the kind of parallelization and work optimization that needs to take place before algorithm changes.

      • That's actually a good example of Linus' point. You can not easily parallelise pagination, because you need to know what fits on one page before you can paginate the next page. Sure, you can do some heuristics (every page contains exactly 1000 words), which is dangerous and at best lowers the quality of the result. You can also try to be clever and for example paginate chapters in parallel and then do the exact numbering afterwards in a separate pass, but before you know it you have turned a fairly simple
      • by swilver ( 617741 )

        Yes, and it re-rerenders all the pages as bitmaps at 400% zoom, scales them back down to get proper anti-aliased results, then compresses them with JPEG and stores them into main memory... ...or how about just recalculating the page that you need to display?

        Parallel processing is not gonna solve stupidity.

  • by Urkki ( 668283 ) on Friday January 02, 2015 @03:06AM (#48715213)

    Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.

    I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?

    • by Z00L00K ( 682162 )

      Some I/O won't catch up that easily, you can't speed up a keyboard much, and even though we have SSDs we have a limit there too.

      But if you break up the I/O as well into sectors so that I/O contention on one area don't impact the I/O on another by using a NUMA architecture for I/O as well as RAM then it's theoretically possible to redistribute some processing.

      It won't be a perfect solution, but it will be less sensitive.

  • by ShakaUVM ( 157947 ) on Friday January 02, 2015 @03:41AM (#48715281) Homepage Journal

    Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.

    In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.

  • I'll see your cores and raise you your boss strangling all your cores by forcing you to get all the data you were planning to process from NFS shares on 100 megabit LAN connections. Because your developers and IT department, with all the competence of a 14-year-old who just got his hands on a copy of Ruby on Rails, can't figure out how to utilize disk that every fucking machine in the company doesn't have read access to.
  • How does Linux compile his kernel? Certainly I use a parallel make across as many cores as possible (well, up to the point where there's a core for every compilation unit).
    • by gweihir ( 88907 )

      Wrong question. C compilation has linear speedup as each file can be compiled without knowing the others. The question is how he links his kernel, and the answer is on a single core as there is no other sane way to do it. Fortunately, this problem is almost linear in the input size (assuming good hash-tables), or we would not be having any software in the size of the kernel.

      • by itzly ( 3699663 )

        C compilation has linear speedup as each file can be compiled without knowing the others

        As long as I/O bandwidth is infinite.

        • by gweihir ( 88907 )

          Or you do it on separate machines. But yes. Ideally, it has linear speed-up, if I/O is not a bottleneck. In practice, things are not as nice, although with 4...8 cores and an SSD to feed them you do not notice much.

        • by Rei ( 128717 )

          No, as long as I/O bandwidth is not the limiting factor. The sort of thing you're compiling can have radically different CPU vs. I/O requirements. Some simple but verbose C code with little optimization might be almost entirely IO limited while some heavy templated C++ and full optimization might be almost entirely CPU limited.

          The thing is, there's no way to know what is going to cause a particular person to think "I wish my computer was performing faster". It all depends on the individual and what they use

    • by itzly ( 3699663 )
      But would you rather do a parallel make on 4 cores with big caches, or 64 cores, each with 1/16th of the cache ?
    • Comment removed based on user account deletion
  • The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism. While i agree that many people claim (IMHO correctly) a increase in the performance (reduction of execution time) within the constraints given by a specific technology level by doing symmetric multiprocessing, i have not heard many people to claim that efficiency (in terms of power, chip area, component count) is improved by symmetric, general parallelization; and nobody with a good un

    • The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism.

      They do, and to an extent they are correct.

      On CPUs that have high single thread performance, there is a lot of silicon devoted to that. There's the large, power hungry, expenive out of order unit, with it's large hidden register files and reorder buffers.

      There's the huge expensive multipliers which need to complete in a single cycle at the top clock speed and so on.

      If you dispense with

  • by amplesand ( 3864419 ) on Friday January 02, 2015 @04:34AM (#48715423)
    Shi's Law

    http://developers.slashdot.org... [slashdot.org]

    http://spartan.cis.temple.edu/... [temple.edu]

    http://slashdot.org/comments.p... [slashdot.org]

    "Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

    This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

    We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."



    .
  • Poor slashdot... (Score:3, Insightful)

    by Anonymous Coward on Friday January 02, 2015 @05:32AM (#48715523)

    Few are actually people with a real engineering background anymore.

    What Linus means is:
    - Moore's law is ending (go read about mask costs and feature sizes)
    - If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
    - therefore you have to choose what to use the transistors for
    - anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
    - since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck

    Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.

    • by gweihir ( 88907 )

      Good summary, and I completely agree with Linus. The limit may go a bit higher, up to say, 8 cores, but not many more. And there is the little problem that for about 2 decades, chips have been interconnect-limited, which is a far harder limit to solve than the transistor-one, so the problem is actually worse.

      All that wishful thinking going on here is just ignorant of the technological facts. The time where your code could be arbitrary stupid, because CPUs got faster in no time, is over. There may also be ot

  • Mostly writing code for MacOS X and iOS. All current devices have two or more cores. Writing multi-threaded code is made rather easy through GCD (Grand Central Dispatch), and anything receiving data from a server _must_ be multithreaded, because you never know how long it takes to get a response. So there is an awful lot of multi-threaded code around.

    But the fact that work is distributed to several cores is just secondary for that kind of work. It is also easy to make most work-intensive code use multipl
  • Linus is right (Score:4, Insightful)

    by gweihir ( 88907 ) on Friday January 02, 2015 @06:15AM (#48715637)

    Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.

    Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.

    • Nothing significant will change this year or in the next 10 years in parallel computing.

      You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.

      The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how ineffic

  • 1.) Linus' wording is pretty moderate.
    2.) He's right. Again.

  • Lots of moving parts (Score:5, Informative)

    by m.dillon ( 147925 ) on Friday January 02, 2015 @02:05PM (#48719219) Homepage

    There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.

    The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.

    Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.

    The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.

    Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

    The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.

    Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.

    Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.

    So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.

    -Matt

Pascal is not a high-level language. -- Steven Feiner

Working...