Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Java Programming IT Technology

Scalable Nonblocking Data Structures 216

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
This discussion has been archived. No new comments can be posted.

Scalable Nonblocking Data Structures

Comments Filter:
  • why (Score:5, Interesting)

    by damn_registrars ( 1103043 ) <damn.registrars@gmail.com> on Tuesday May 27, 2008 @03:15PM (#23561441) Homepage Journal
    why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?
    • Re:why (Score:5, Informative)

      by Chris Burke ( 6130 ) on Tuesday May 27, 2008 @03:29PM (#23561655) Homepage
      Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").

      It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.
      • Surely, 640 threads ought to be enough for anybody.

        But seriously, have you seen the price of this beast? HUGE price tag. Why not build your own cluster?

        There are technologies today (like, say, Terracotta Server [markturansky.com]) that allow for easy distribution of work across a large number of JVMs.

        I suppose the companies that need all those cores and threads in one machine can afford the Honkin' Big Iron. For the rest of us, clustering is getting cheaper and cheaper these days.

        • Or its a development system with limited production.
          Chances are it is being used for R&D and not actually crunching numbers.

          When systems like that hit mass production then we have the data structures for them.
          Cant do that with a cluster.
        • by Ed Avis ( 5917 )
          Have you seriously investigated the hardware costs of building your own cluster? With equivalent specs to this one?

          Hint: it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors.
          • it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors.
            That depends how well your project scales to a cluster, then. SETI or Folding, for example, won't really care how fast the interconnect is.

            Of course, the same programming techniques used to build this are absolutely not going to scale to a cluster. Probably vice versa, but I'm not convinced of that yet.
        • Re:why (Score:5, Informative)

          by maraist ( 68387 ) * <michael.maraistN ... gmail.n0spam.com> on Tuesday May 27, 2008 @06:55PM (#23564457) Homepage
          Message passing systems and MT systems solve different problems. Consider that Message Passing is a subclass of Multi-Processing; in general the amount of work is much larger than the data-set. But Multi-Threading often involves many micro-changes to a large message (the entire state of the process).

          Consider an in-memory database. (Mysql-cluster (NDB), for example). You wouldn't want to pass the entire database around (or even portions of it around) for each 'job'. Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Then your message passing is limited to only interactions that aren't held in the local memory space (i.e. NUMA).

          With Terracotta you are breaking a sequential application into a series of behind-the-scenes messages which go from clustered node to clustered node as necessary (I'm not very well versed on this product, but I've reviewed it a couple times).

          Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. And thus, the more CPUs (that leverage MESI (sp?) CPU cache) the more efficient the overall architecture.

          Now, I can't imagine that a 768CPU monster is that cost effective - you're problem space is probably pretty limited. But a simultaneous 700 thread application is NOT hard to write in java at all. I regularly create systems that have between 1,000 and 2,000 semi-active threads. But I try to keep a CPU-intensive pool down to near the number of physical CPUs (4, 8 or 16 as the case may be). Java has tons of tools to allow execution-pools of configureable size.
          • Consider an in-memory database.

            OK. [apache.org]

            Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations.

            Got it. [danga.com] Can't find a link, but I'm thinking specifically the hashing mechanism. Given a key, I can find which node should be caching that key.

            Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware.

            I'm not sure I've seen such a problem. For example, the CPU cache alone is an example of what happens when you break a problem down into smaller chunks.

            I can see where a single memory space might do better, though.

            a simultaneous 700 thread application is NOT hard to write in java at all.

            Once you know how, I suppose. Consider that most programmers who use threads find ways to deadlock on one or two cores.

            The reason I'm drawn to mes

        • Surely, 640 threads ought to be enough for anybody.
          I have a slightly sickening image of Bill Gates as mill overseer, using his whip to drive the children back under the looms in some sort of dark, satanic mill. 640 threads at 20 threads/inch would give you about 32inch wide cloth, which is adequate for anyone I'm sure.

          Will someone please take the expected jokes about "patching" or "darning" over the 640 thread limit?

    • Re: (Score:2, Informative)

      by Anonymous Coward
      his implementation is in Java and the JVM adds some of its own threads like threads for garbage collection, compiler threads etc. so, some of the compute goes towards those threads.
    • Each Vega2 processor has 48 cores, 768 cores in just 16 processors is pretty good and you can be certain a number of those are reserved for system use on such a large-scale machine; these are already fairly lightweight hardware threads and I can only presume more hardware threads per-core and you'd get some serious I/O starvation issues.

      How I'd love to have one of these boxes :)
  • Inspiration... (Score:5, Informative)

    by green-alien ( 1296909 ) on Tuesday May 27, 2008 @03:26PM (#23561591)
    The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [cam.ac.uk] [Practical Lock Freedom]
    • by Monkius ( 3888 )
      Indeed there is a lot of published research--by Herlihy, Michael, Fraser, Sundell & Tsigas, and others, going back to 2001. It's a hot topic for everyone trying to scale on modern hardware, certainly this custom Java processor thing Click works on seems way out in left field, for me.
  • Google Talk (Score:5, Informative)

    by jrivar59 ( 146428 ) on Tuesday May 27, 2008 @03:28PM (#23561621)
    Google Talk [google.com] by the author.
  • 768 Cores? (Score:3, Funny)

    by Shadow Wrought ( 586631 ) * <.moc.liamg. .ta. .thguorw.wodahs.> on Tuesday May 27, 2008 @03:32PM (#23561697) Homepage Journal
    Call me a CPU luddite, but does this mean that I can lose 768 games of Solataire simultaneously?

  • Google Tech Talk (Score:4, Informative)

    by Bou ( 630753 ) on Tuesday May 27, 2008 @03:34PM (#23561733)
    Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 [google.com] .
  • Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.

    KeyWords:

    concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures ...
  • false sharing? (Score:2, Interesting)

    by shrimppoboy ( 853235 )
    The brief description in the article sounds suspicious and incompetent.
    1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.

    2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should ref

    • by Chirs ( 87576 )
      If the data set is much larger than the number of cpus, then it may be possible to arrange things such that the likelihood of two cpus hitting the same cacheline is pretty small.

      As for Java, in the article Dr. Click says it has a well-understood and well-implemented memory model.
    • Re: (Score:3, Informative)

      by julesh ( 229690 )
      The brief description in the article sounds suspicious and incompetent.
      1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.


      The keyword in your statement is "typically". Click is working on the Azul processor, which is desig
  • by ThreeGigs ( 239452 ) on Tuesday May 27, 2008 @03:51PM (#23561983)

    and a finite-state machine built from the atomic update and logically replicated per array word.


    Now *that* is what I call geek speak.
    • and a finite-state machine built from the atomic update and logically replicated per array word.

      Now *that* is what I call geek speak.
      Atomic? Damn geeks! ZOMG, we're all going to die from radiation!!!11!eleven!
  • It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.

    This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more impo

    • No it is Lock Free (Score:2, Interesting)

      by tbcpp ( 797625 )
      I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists
    • by Kupek ( 75469 )
      It is lock-free, but it is not wait-free. He explains the difference in his slides, and there's plenty of literature around for those who have access to Google.
  • by Kingrames ( 858416 ) on Tuesday May 27, 2008 @04:05PM (#23562195)
    "# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."

    Clearly, the data structures have been touched by his noodly appendage.
  • one per (Score:3, Funny)

    by spoonist ( 32012 ) on Tuesday May 27, 2008 @04:10PM (#23562293) Journal

    Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java

    Hmmm... one core per Java thread?

    That sounds about right for Java apps...

    • by afidel ( 530433 )
      They are probably lightweight cores, much like those on the Sun coolthreads processors. Plus if you are paying for a system with 700+ cores you probably have an app that can keep 700+ threads busy =)
  • This is interesting indeed. When reading the summary, it made me think about BSPML, although the slides make it clear that there are a number of differences. Essentially
    • BSPML doesn't limit itself to FSM but has full expressive power, including exceptions -- some implementations of BSPML use monads to solve things that this work solves by scaling down to a FSM
    • BSPML doesn't support dynamic changes to the number of threads
    • many BSPML algorithms are provable
    • BSPML is typically compiled to fully native co
  • WTF? (Score:2, Informative)

    Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads o
    • by julesh ( 229690 )
      Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads".

      Quoth the article: "On Azul's hardware it obtains linear scaling to 768 CPUs"

      That kind-of implies 768 hardware threads are in use.
  • From the video... (Score:3, Insightful)

    by Zarf ( 5735 ) on Tuesday May 27, 2008 @05:10PM (#23563127) Journal
    Someone posted the video [google.com] and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.

    The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?
  • This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.

    And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too!

Don't tell me how hard you work. Tell me how much you get done. -- James J. Ling

Working...