Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Java Programming IT Technology

Scalable Nonblocking Data Structures 216

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
This discussion has been archived. No new comments can be posted.

Scalable Nonblocking Data Structures

Comments Filter:
  • Re:Sounds great! (Score:1, Informative)

    by Anonymous Coward on Tuesday May 27, 2008 @04:25PM (#23561567)
    It means that the atomic operation required for lock-free data structures are now available for Java from this vendor. Welcome to 1988!
  • Inspiration... (Score:5, Informative)

    by green-alien ( 1296909 ) on Tuesday May 27, 2008 @04:26PM (#23561591)
    The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [cam.ac.uk] [Practical Lock Freedom]
  • Google Talk (Score:5, Informative)

    by jrivar59 ( 146428 ) on Tuesday May 27, 2008 @04:28PM (#23561621)
    Google Talk [google.com] by the author.
  • Re:why (Score:5, Informative)

    by Chris Burke ( 6130 ) on Tuesday May 27, 2008 @04:29PM (#23561655) Homepage
    Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").

    It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.
  • Google Tech Talk (Score:4, Informative)

    by Bou ( 630753 ) on Tuesday May 27, 2008 @04:34PM (#23561733)
    Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 [google.com] .
  • Re:why (Score:2, Informative)

    by Anonymous Coward on Tuesday May 27, 2008 @04:36PM (#23561753)
    his implementation is in Java and the JVM adds some of its own threads like threads for garbage collection, compiler threads etc. so, some of the compute goes towards those threads.
  • Re:Java???? (Score:4, Informative)

    by Kupek ( 75469 ) on Tuesday May 27, 2008 @04:42PM (#23561843)
    Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.
  • by Animats ( 122034 ) on Tuesday May 27, 2008 @04:58PM (#23562103) Homepage

    It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.

    This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more important, especially in Java; stalls due to array resizing are a headache in real time systems with dynamic memory. It works about the way you'd expect; there are flags indicating the location (old or new) of each array element as the array is copied. Making all this work correctly is touchy.

    Concurrent algorithms like this are incredibly hard to write, and you need formal methods to get them right. The author of the paper still hasn't figured out to code a FIFO under these rules.

  • Re:Sounds great! (Score:5, Informative)

    by mikael ( 484 ) on Tuesday May 27, 2008 @05:16PM (#23562393)
    The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).

    With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).

    To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.

    The author uses the Load-Link/Store Conditional [wikipedia.org] pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.

    Check-And-Set [wikipedia.org] only replaces the variable with a new value if the value of the variable matches a previously read old value.

    Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.
  • Re:Uh, no. (Score:3, Informative)

    by quanticle ( 843097 ) on Tuesday May 27, 2008 @05:16PM (#23562397) Homepage

    He never said that Java had surpassed C in speed, he said that Java had surpassed C++. C++ library classes are not the same as C library classes, and many C++ libraries (especially the ones outside STL and Boost) are woefully under optimized. Java has many more optimized libraries "packaged in" with the language itself.

    Second, neither the Doom or Unreal engines are multi-threaded. Java has threading support built into the language. To get the same with C you'd have to use POSIX threads (killing Windows compatibility) or the Windows threading API (killing POSIX compatibility). With Java, you don't have to make that choice.

  • WTF? (Score:2, Informative)

    by neuromancer23 ( 1122449 ) on Tuesday May 27, 2008 @05:18PM (#23562429)
    Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.

    Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.

    Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.

  • Re:"2.5"? WTF? (Score:5, Informative)

    by badboy_tw2002 ( 524611 ) on Tuesday May 27, 2008 @05:22PM (#23562477)
    He's built two working data structures and is working on a third (had to read the slides to figure that one out).
  • Re:Java???? (Score:2, Informative)

    by fishbowl ( 7759 ) on Tuesday May 27, 2008 @05:38PM (#23562691)

    In the problems that TFA addresses, I'd wager that most of the time is spent in dissemination barriers of some sort, since invariably the problems in parallel computing move into issues within the problem domain (which is ideally what we want, after all).

    As for a JIT being inefficient compared to a static optimizing compiler, it depends so much on the code in question and on the platform, as to not be something you can make blanket statements about.

    Let's hear from some HPC researchers on this. Get some real numbers.
  • Re:false sharing? (Score:3, Informative)

    by julesh ( 229690 ) on Tuesday May 27, 2008 @05:43PM (#23562755)
    The brief description in the article sounds suspicious and incompetent.
    1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.


    The keyword in your statement is "typically". Click is working on the Azul processor, which is designed from the ground up for this kind of task. While I haven't been able to find details of its cache organisation, I'd say it's pretty safe money that it uses smaller than average cache lines. The hashtable structure described uses the contents of the array in pairs, i.e. 64 bits at a time. If each cache line were 64 bits wide, this would be efficient, no?

    2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.

    What if the bottleneck of your application is referencing shared data that needs to be updated in real time? There are many applications for which this is the case, and this kind of work is useful for anyone who is working on such an application. Just throwing an example out there: the scheduler for a massively parallel operating system would typically need to have many cores referring to a single queue of waiting threads with a high chance for collisions between read and write operations. I'm sure there are plenty more, too.

    3. Java, really, Java?

    Yes. Java. Really. Java is probably the most important language at the moment for enterprise application development, which is the environment where this kind of issue most frequently occurs, so developing solutions to these problems that work in Java is particularly important. Azul, Click's employer, specialises in high performance computers optimized for running Java software.
  • Re:Sounds bogus? (Score:5, Informative)

    by Kupek ( 75469 ) on Tuesday May 27, 2008 @06:22PM (#23563247)
    Locking in software has implications that locking at the hardware level does not.

    If a thread locks in software, any subsequent thread must block, waiting for the first thread to finish. If the thread is preempted, then the waiting threads wait needlessly. If the thread dies, then the waiting threads are hosed.

    Lock-free techniques prevent this problem, at the expense of more complicated algorithms and data structures. The basic structure of most lock-free algorithms is read a value, do something to it, and then attempt to commit the changed value back to memory. The attempt fails if another thread has changed the value from underneath you, and you must try again. (This is detected through operations like compare-and-swap.) This allows greater concurrency and guarantees that the system as a whole will make progress, even if a thread is preempted or dies.

    Lock-free algorithms and data structures is a well established area. What Click has done here is provide a Java implementation of some data structures that yield good performance on the manycore systems his company makes.
  • Re:why (Score:5, Informative)

    by maraist ( 68387 ) * <{michael.maraist ... mail.n0spam.com}> on Tuesday May 27, 2008 @07:55PM (#23564457) Homepage
    Message passing systems and MT systems solve different problems. Consider that Message Passing is a subclass of Multi-Processing; in general the amount of work is much larger than the data-set. But Multi-Threading often involves many micro-changes to a large message (the entire state of the process).

    Consider an in-memory database. (Mysql-cluster (NDB), for example). You wouldn't want to pass the entire database around (or even portions of it around) for each 'job'. Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Then your message passing is limited to only interactions that aren't held in the local memory space (i.e. NUMA).

    With Terracotta you are breaking a sequential application into a series of behind-the-scenes messages which go from clustered node to clustered node as necessary (I'm not very well versed on this product, but I've reviewed it a couple times).

    Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. And thus, the more CPUs (that leverage MESI (sp?) CPU cache) the more efficient the overall architecture.

    Now, I can't imagine that a 768CPU monster is that cost effective - you're problem space is probably pretty limited. But a simultaneous 700 thread application is NOT hard to write in java at all. I regularly create systems that have between 1,000 and 2,000 semi-active threads. But I try to keep a CPU-intensive pool down to near the number of physical CPUs (4, 8 or 16 as the case may be). Java has tons of tools to allow execution-pools of configureable size.
  • Re:Sounds bogus? (Score:5, Informative)

    by Chris Burke ( 6130 ) on Tuesday May 27, 2008 @08:07PM (#23564597) Homepage
    These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

    That's not strictly true.

    First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.

    Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.

    Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).

    Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.

    So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.

The hardest part of climbing the ladder of success is getting through the crowd at the bottom.

Working...