Scalable Nonblocking Data Structures 216

Posted by kdawson on Tuesday May 27, 2008 @04:10PM from the don't-fear-the-multi-core dept.

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."

This discussion has been archived. No new comments can be posted.

Scalable Nonblocking Data Structures

Load All Comments

Search 216 Comments Log In/Create an Account

Comments Filter:

why (Score:5, Interesting)

by damn_registrars ( 1103043 ) writes: <damn.registrars@gmail.com> on Tuesday May 27, 2008 @04:15PM (#23561441) Homepage Journal

why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?

Share
twitter facebook
- Re:why (Score:5, Informative)
  
  by Chris Burke ( 6130 ) writes: on Tuesday May 27, 2008 @04:29PM (#23561655) Homepage
  
  Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").
  
  It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by MarkEst1973 ( 769601 ) writes:
    
    Surely, 640 threads ought to be enough for anybody.
    
    But seriously, have you seen the price of this beast? HUGE price tag. Why not build your own cluster?
    There are technologies today (like, say, Terracotta Server [markturansky.com]) that allow for easy distribution of work across a large number of JVMs.
    I suppose the companies that need all those cores and threads in one machine can afford the Honkin' Big Iron. For the rest of us, clustering is getting cheaper and cheaper these days.
    - Re: (Score:2)
      
      by cheater512 ( 783349 ) writes:
      
      Or its a development system with limited production.
      Chances are it is being used for R&D and not actually crunching numbers.
      
      When systems like that hit mass production then we have the data structures for them.
      Cant do that with a cluster.
    - Re: (Score:2)
      
      by Ed Avis ( 5917 ) writes:
      
      Have you seriously investigated the hardware costs of building your own cluster? With equivalent specs to this one?
      
      Hint: it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors.
      - Re: (Score:2)
        
        by SanityInAnarchy ( 655584 ) writes:
        
        it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors.
        That depends how well your project scales to a cluster, then. SETI or Folding, for example, won't really care how fast the interconnect is.
        
        Of course, the same programming techniques used to build this are absolutely not going to scale to a cluster. Probably vice versa, but I'm not convinced of that yet.
    - Re:why (Score:5, Informative)
      
      by maraist ( 68387 ) * writes: <michael DOT mara ... T n0spam DOT com> on Tuesday May 27, 2008 @07:55PM (#23564457) Homepage
      
      Message passing systems and MT systems solve different problems. Consider that Message Passing is a subclass of Multi-Processing; in general the amount of work is much larger than the data-set. But Multi-Threading often involves many micro-changes to a large message (the entire state of the process).
      
      Consider an in-memory database. (Mysql-cluster (NDB), for example). You wouldn't want to pass the entire database around (or even portions of it around) for each 'job'. Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Then your message passing is limited to only interactions that aren't held in the local memory space (i.e. NUMA).
      
      With Terracotta you are breaking a sequential application into a series of behind-the-scenes messages which go from clustered node to clustered node as necessary (I'm not very well versed on this product, but I've reviewed it a couple times).
      
      Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. And thus, the more CPUs (that leverage MESI (sp?) CPU cache) the more efficient the overall architecture.
      
      Now, I can't imagine that a 768CPU monster is that cost effective - you're problem space is probably pretty limited. But a simultaneous 700 thread application is NOT hard to write in java at all. I regularly create systems that have between 1,000 and 2,000 semi-active threads. But I try to keep a CPU-intensive pool down to near the number of physical CPUs (4, 8 or 16 as the case may be). Java has tons of tools to allow execution-pools of configureable size.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by SanityInAnarchy ( 655584 ) writes:
        
        Consider an in-memory database.
        OK. [apache.org]
        Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations.
        Got it. [danga.com] Can't find a link, but I'm thinking specifically the hashing mechanism. Given a key, I can find which node should be caching that key.
        Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware.
        I'm not sure I've seen such a problem. For example, the CPU cache alone is an example of what happens when you break a problem down into smaller chunks.
        
        I can see where a single memory space might do better, though.
        a simultaneous 700 thread application is NOT hard to write in java at all.
        Once you know how, I suppose. Consider that most programmers who use threads find ways to deadlock on one or two cores.
        
        The reason I'm drawn to mes
    - Re: (Score:2)
      
      by RockDoctor ( 15477 ) writes:
      
      Surely, 640 threads ought to be enough for anybody.
      I have a slightly sickening image of Bill Gates as mill overseer, using his whip to drive the children back under the looms in some sort of dark, satanic mill. 640 threads at 20 threads/inch would give you about 32inch wide cloth, which is adequate for anyone I'm sure.
      Will someone please take the expected jokes about "patching" or "darning" over the 640 thread limit?
- Re: (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  his implementation is in Java and the JVM adds some of its own threads like threads for garbage collection, compiler threads etc. so, some of the compute goes towards those threads.
- Re: (Score:2)
  
  by drspliff ( 652992 ) writes:
  
  Each Vega2 processor has 48 cores, 768 cores in just 16 processors is pretty good and you can be certain a number of those are reserved for system use on such a large-scale machine; these are already fairly lightweight hardware threads and I can only presume more hardware threads per-core and you'd get some serious I/O starvation issues.
  
  How I'd love to have one of these boxes :)
- - Re: (Score:2, Funny)
    
    by tsalaroth ( 798327 ) writes:
    
    I, for one, welcome our fake Anonymous Coward Thread-count Overlords!
Inspiration... (Score:5, Informative)

by green-alien ( 1296909 ) writes: on Tuesday May 27, 2008 @04:26PM (#23561591)

The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [cam.ac.uk] [Practical Lock Freedom]

Share
twitter facebook
- Re: (Score:2)
  
  by Monkius ( 3888 ) writes:
  
  Indeed there is a lot of published research--by Herlihy, Michael, Fraser, Sundell & Tsigas, and others, going back to 2001. It's a hot topic for everyone trying to scale on modern hardware, certainly this custom Java processor thing Click works on seems way out in left field, for me.
Google Talk (Score:5, Informative)

by jrivar59 ( 146428 ) writes: on Tuesday May 27, 2008 @04:28PM (#23561621)

Google Talk [google.com] by the author.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by kestasjk ( 933987 ) writes:
  
  I wish I could call myself "Dr. Click" :-(
768 Cores? (Score:3, Funny)

by Shadow Wrought ( 586631 ) * writes: <{moc.liamg} {ta} {thguorw.wodahs}> on Tuesday May 27, 2008 @04:32PM (#23561697) Homepage Journal

Call me a CPU luddite, but does this mean that I can lose 768 games of Solataire simultaneously?

Share
twitter facebook
- Re:768 Cores? (Score:4, Funny)
  
  by jlechem ( 613317 ) writes: on Tuesday May 27, 2008 @04:34PM (#23561731) Homepage Journal
  
  Or better yet actually run Windows Vista. Zing!
  
  Parent Share
  twitter facebook
Google Tech Talk (Score:4, Informative)

by Bou ( 630753 ) writes: on Tuesday May 27, 2008 @04:34PM (#23561733)

Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 [google.com] .

Share
twitter facebook
scalable noNBLocking data sTRructures .. :) (Score:2, Insightful)

by rs232 ( 849320 ) writes:

Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.

KeyWords:

concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures ...
- Re:scalable noNBLocking data sTRructures .. :) (Score:5, Interesting)
  
  by Seferino ( 837142 ) writes: on Tuesday May 27, 2008 @05:13PM (#23562343) Homepage
  
  Good for us. Get the rabble away from Slashdot. Only true nerds should understand the contents. Let me add a few keywords to get rid of the softies: monads, higher-order type systems, return type, genericity.
  Your turn.
  
  Parent Share
  twitter facebook
  - Theory Pong (Score:3, Funny)
    
    by nuzak ( 959558 ) writes:
    
    > Your turn.
    
    Catamorphisms. Linear Logic.
    
    Back to you :)
false sharing? (Score:2, Interesting)

by shrimppoboy ( 853235 ) writes:

The brief description in the article sounds suspicious and incompetent.
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.
2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should ref
- Re: (Score:2)
  
  by Chirs ( 87576 ) writes:
  
  If the data set is much larger than the number of cpus, then it may be possible to arrange things such that the likelihood of two cpus hitting the same cacheline is pretty small.
  
  As for Java, in the article Dr. Click says it has a well-understood and well-implemented memory model.
- Re: (Score:3, Informative)
  
  by julesh ( 229690 ) writes:
  
  The brief description in the article sounds suspicious and incompetent.
  1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.
  
  The keyword in your statement is "typically". Click is working on the Azul processor, which is desig
Geek serendipity in a summary (Score:4, Funny)

by ThreeGigs ( 239452 ) writes: on Tuesday May 27, 2008 @04:51PM (#23561983)

and a finite-state machine built from the atomic update and logically replicated per array word.

Now *that* is what I call geek speak.

Share
twitter facebook
- Re: (Score:2)
  
  by rrohbeck ( 944847 ) writes:
  
  and a finite-state machine built from the atomic update and logically replicated per array word.
  
  Now *that* is what I call geek speak.
  Atomic? Damn geeks! ZOMG, we're all going to die from radiation!!!11!eleven!
Well, sort of lock-free. (Score:2, Informative)

by Animats ( 122034 ) writes:

It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.
This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more impo
- No it is Lock Free (Score:2, Interesting)
  
  by tbcpp ( 797625 ) writes:
  
  I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists
- Re: (Score:2)
  
  by Kupek ( 75469 ) writes:
  
  It is lock-free, but it is not wait-free. He explains the difference in his slides, and there's plenty of literature around for those who have access to Google.
From the article: (Score:4, Funny)

by Kingrames ( 858416 ) writes: on Tuesday May 27, 2008 @05:05PM (#23562195)

"# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."

Clearly, the data structures have been touched by his noodly appendage.

Share
twitter facebook
one per (Score:3, Funny)

by spoonist ( 32012 ) writes: on Tuesday May 27, 2008 @05:10PM (#23562293) Journal

Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java

Hmmm... one core per Java thread?

That sounds about right for Java apps...

Share
twitter facebook
- Re: (Score:2)
  
  by afidel ( 530433 ) writes:
  
  They are probably lightweight cores, much like those on the Sun coolthreads processors. Plus if you are paying for a system with 700+ cores you probably have an app that can keep 700+ threads busy =)
Bulk-Synchronous Parallel model, anyone ? (Score:2, Insightful)

by Seferino ( 837142 ) writes:
This is interesting indeed. When reading the summary, it made me think about BSPML, although the slides make it clear that there are a number of differences. Essentially
- BSPML doesn't limit itself to FSM but has full expressive power, including exceptions -- some implementations of BSPML use monads to solve things that this work solves by scaling down to a FSM
- BSPML doesn't support dynamic changes to the number of threads
- many BSPML algorithms are provable
- BSPML is typically compiled to fully native co
- Re: (Score:2)
  
  by mritunjai ( 518932 ) writes:
  
  So where's the code ?
  
  (Yeah, I know about it, played with it... lots of noise, not enough code!)
WTF? (Score:2, Informative)

by neuromancer23 ( 1122449 ) writes:

Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads o
- Re: (Score:2)
  
  by julesh ( 229690 ) writes:
  
  Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads".
  
  Quoth the article: "On Azul's hardware it obtains linear scaling to 768 CPUs"
  
  That kind-of implies 768 hardware threads are in use.
  - - Re: (Score:2)
      
      by TheLink ( 130905 ) writes:
      
      Linear scaling does not necessarily mean 700 threads though.
      
      The factor does not have to be 1.
      
      If it's 4, then it would be 4 threads for one core, and 2800 threads for 700 cores. And that would still be linear scaling.
From the video... (Score:3, Insightful)

by Zarf ( 5735 ) writes: on Tuesday May 27, 2008 @06:10PM (#23563127) Journal

Someone posted the video [google.com] and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.

The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?

Share
twitter facebook
Welcome back (Score:2)

by Duncan3 ( 10537 ) writes:

This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.

And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too!
- Re:Sounds great! (Score:5, Funny)
  
  by moderatorrater ( 1095745 ) writes: on Tuesday May 27, 2008 @04:39PM (#23561807)
  
  Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Insightful)
    
    by hasdikarlsam ( 414514 ) writes:
    
    This still has limited applicability. Making a single data structure useful across hundreds of CPUs is impressive, but many problems can be more easily solved by using multiple structures - pipelining it, or using divide and conquer, or any of many other approaches.
    - Re: (Score:3, Insightful)
      
      by Pseudonym ( 62607 ) writes:
      
      Researchers don't work at the fringes of what can be done "more easily". They work at the fringes of what is currently possible.
      
      Think about multi-tasking for a moment. If your problem is inherently sequential (e.g. everything is effectively sequentialised by a shared resource that can't be split, such as a piece of hardware), then you can use a single-process event loop. If your problem is inherently parallel (e.g. a web server), then you can use multiple forked processes. In otherwords: If your proble
- Re:Sounds great! (Score:5, Informative)
  
  by mikael ( 484 ) writes: on Tuesday May 27, 2008 @05:16PM (#23562393)
  
  The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).
  
  With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).
  
  To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.
  
  The author uses the Load-Link/Store Conditional [wikipedia.org] pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.
  
  Check-And-Set [wikipedia.org] only replaces the variable with a new value if the value of the variable matches a previously read old value.
  
  Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.
  
  Parent Share
  twitter facebook
  - Re:Sounds great! (Score:4, Funny)
    
    by Linker3000 ( 626634 ) writes: on Tuesday May 27, 2008 @05:26PM (#23562541) Journal
    
    I hereby appoint you official summary explainer!
    
    Thanks
    
    Parent Share
    twitter facebook
    - Re:Sounds great! (Score:5, Funny)
      
      by hey! ( 33014 ) writes: on Tuesday May 27, 2008 @05:34PM (#23562639) Homepage Journal
      
      Which is great and all, but what we usually need is more of a summary executioner.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by TheLink ( 130905 ) writes:
        
        We outsourced that to China.
        
        They said they'd provide nonblocking executions in a scalable manner.
  - Sounds bogus? (Score:2)
    
    by hackingbear ( 988354 ) writes:
    
    Well... if I remember what I learned from OS and hardware classes right. LL/SC and CAS operations do involve locks at the hardware level. These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottlenec
    - Re:Sounds bogus? (Score:5, Informative)
      
      by Kupek ( 75469 ) writes: on Tuesday May 27, 2008 @06:22PM (#23563247)
      
      Locking in software has implications that locking at the hardware level does not.
      
      If a thread locks in software, any subsequent thread must block, waiting for the first thread to finish. If the thread is preempted, then the waiting threads wait needlessly. If the thread dies, then the waiting threads are hosed.
      
      Lock-free techniques prevent this problem, at the expense of more complicated algorithms and data structures. The basic structure of most lock-free algorithms is read a value, do something to it, and then attempt to commit the changed value back to memory. The attempt fails if another thread has changed the value from underneath you, and you must try again. (This is detected through operations like compare-and-swap.) This allows greater concurrency and guarantees that the system as a whole will make progress, even if a thread is preempted or dies.
      
      Lock-free algorithms and data structures is a well established area. What Click has done here is provide a Java implementation of some data structures that yield good performance on the manycore systems his company makes.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by hackingbear ( 988354 ) writes:
        
        That's what I understood. Maybe you can be very creative and use very complicate way to avoid contentions except a few independent shared variable updates -- and then you scale your algorithm to hundreds or thousands of CPUs, you will eventually hit the memory bus / communication overhead as usual. MPP supercomputers have been trying all tricks to minimize this overhead but still a bottleneck. I just don't see what this project has achieved but to create maybe a faster concurrent hash map (or the likes) --
    - Re: (Score:2)
      
      by Fourier ( 60719 ) writes:
      
      By convention, "lock-free" is sort of an ill-defined term that roughly means "no software-based locking operations, except maybe we'll ignore spin-locking, provided the algorithm always makes global progress while spinning occurs." In any case, there is no common definition for "lock-free" which implies the absence of hardware locking.
      
      I'm not sure there is anything radically new in this project. However, given the exceptional difficulty of writing correct lock-free algorithms, just about any project explo
    - Re:Sounds bogus? (Score:5, Informative)
      
      by Chris Burke ( 6130 ) writes: on Tuesday May 27, 2008 @08:07PM (#23564597) Homepage
      
      These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.
      
      That's not strictly true.
      
      First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.
      
      Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.
      
      Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).
      
      Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.
      
      So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.
      
      Parent Share
      twitter facebook
      - Can also be done with a clean cache (Score:3, Interesting)
        
        by Gazzonyx ( 982402 ) writes:
        
        I don't necessarily take issue with what you've said (I've got you on my friend list, so I'm fairly sure that your one of the professionals on this site whose insight I love to read... I'm a software development major in college, so I usually add programmers to my friends list when I genuinely value their insights...), but I would like to point out that I've seen another approach to your first point about locking busses. The Linux kernel, IIRC, does a sideways and backwards cache flush to RAM before maskin
        
        Re: (Score:2)
        
        by Chris Burke ( 6130 ) writes:
        
        But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.
        
        Ooh, don't know anything about that code actually. A cache flush is a lot heavier weight than a normal lock, if you're only accessing one cache line that is, but I could see it being a good idea in a number of situations in OSes. Interesting.
      - Re: (Score:2)
        
        by dedazo ( 737510 ) writes:
        
        I worked on a high-concurrency application a few years ago (not multi-CPU). The problems inherent to today's x86 hardware vis-a-vis concurrency are very interesting indeed. We had a few thousand threads at any given point trying to access something that would pass as a hashtable with millions of key-value pairs. We ran into all sorts of problems until I created a lightweight structure that looked like a cross between a mutex and a Win32 critical section, which I derived from an old MSDN code sample written
        
        Re: (Score:2)
        
        by dedazo ( 737510 ) writes:
        
        Gah, I hate replying to myself but I forgot my main point, which is to incorporate the blocking mechanism in the data structure itself. I would simply never have thought of that, and I think that's pretty much the novel thing about this whole idea.
        Talk about thinking outside of the box... probably the reason I don't work for Google =)
- - Re:Sounds great! (Score:5, Funny)
    
    by Linker3000 ( 626634 ) writes: on Tuesday May 27, 2008 @04:34PM (#23561729) Journal
    
    1988? Atomic?
    
    Is this something to do with a Blondie tour?
    
    Parent Share
    twitter facebook
- Re:Java???? (Score:4, Insightful)
  
  by Anonymous Coward writes: on Tuesday May 27, 2008 @04:35PM (#23561739)
  
  700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
  
  Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?
  
  Parent Share
  twitter facebook
  - Re:Java???? (Score:5, Funny)
    
    by Anonymous Coward writes: on Tuesday May 27, 2008 @04:42PM (#23561847)
    
    700 threads in assembler? Why not use JAVA, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
    
    Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by SanityInAnarchy ( 655584 ) writes:
        
        Except that the whole reason you'd use one of these, instead of a much cheaper cluster of commodity hardware, is because you want to use shared memory and threads. There are problems which scale much better to shared memory than to shared-nothing -- or so I'm told.
        
        Oh, and Erlang is wannabe-functional. Go play with Haskell if you want a purely-functional language. No side effects means, among other things, the ability to have the parallelism done for you, instead of having to explicitly spawn a thread -- or
        
        Re: (Score:2)
        
        by SanityInAnarchy ( 655584 ) writes:
        
        Not entirely sure what that has to do with my post, but thanks, that's interesting.
        
        I'm also not entirely sure I like transactions yet. Haven't thought hard enough about it, yet.
  - Re: (Score:2)
    
    by raddan ( 519638 ) writes:
    
    700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
    Good idea, but-- assembly language is way too high-level. Why not take it a step further and just give programmers front panel switches? Bonus: you save money on keyboards!
  - Re: (Score:2)
    
    by Jherek Carnelian ( 831679 ) writes:
    
    700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread
    Because the hardware platform his company sells actually has more than 700 cores - that means 700 simultaneous threads - not sequential, simultaneous.
    
    When you have a highly parallel system, it is counter-productive to try to make the workload sequential.
  - Re: (Score:2)
    
    by DarkOx ( 621550 ) writes:
    
    Because the cost of developing your code in ASM would be a gigantic leap in terms of time and provide little in terms of gain over the output of a good compiler. It would also destroy any potential for reuseability on other platforms.
    
    The parent on the other hand I can agree with, Java development is not much faster then C++ provided you are working on more esoteric things and can't take advantage of the huge library selection. The syntax is formal patterns are mostly the same after all. Developing a mass
    - Re: (Score:3, Insightful)
      
      by samkass ( 174571 ) writes:
      
      So your argument is that with C/C++ you can make something almost as portable, almost as re-usable, almost as fast to write, almost as easy to debug, using a library selection that's almost as complete as Java. And in return you might gain a tiny bit of speed increase?
      
      And you're also postulating that there's more likely to be a fully standards-compliant C++ compiler with a standard thread interface on all those esoteric machines of which you speak than a basic JVM?
      
      I understand it's cool to hate Java on Sla
  - Stoopid (Score:2)
    
    by Nicolas MONNET ( 4727 ) writes:
    
    Because coding a bad, poorly parallelizable algorithm in ASM vs C will give you at best a 2x speedup.
    And coding a good, hyper parallelizable algorithm in Java vs C will give you at worst a 1/2x speed down, and at the same time an nx speed up, where n=768.
    And I'm not even talking about the mess ASM would be.
- Re: (Score:3, Insightful)
  
  by AKAImBatman ( 238306 ) writes:
  
  700 threads in JAVA? Why not use C++
  Hmm... lemme think about that. Maybe because Java has decent threading support built into the language? Maybe because the platform is portable to any architecture? Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?
  
  "Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers runni
  - Re: (Score:2)
    
    by Brian Gordon ( 987471 ) writes:
    
    I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)
    - Re: (Score:2)
      
      by phasm42 ( 588479 ) writes:
      
      I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)
      You don't actually write code specific to a 768-core computer. The code doesn't need more than 1 core; the idea is to make it scale well to 768 cores.
  - Re: (Score:2)
    
    by Daniel Dvorkin ( 106857 ) * writes:
    
    Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.
    
    Okay, I'll agree that well-written Java code is generally performance-competitive with compiled code, but this is a pretty sweeping assertion. Do you have any evidence for it -- or is it just a little too convenient that "no one even bothers" with benchmarks?
    - Benchmark from many years ago (Score:3, Insightful)
      
      by CustomDesigned ( 250089 ) writes:
      
      Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this [bmsi.com].
      
      Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks
  - Re:Java???? (Score:5, Insightful)
    
    by JMZero ( 449047 ) writes: on Tuesday May 27, 2008 @05:22PM (#23562475) Homepage
    
    Java is perfectly fast for real world applications, and I'd agree that the "Java is Slow" idea is outdated.
    
    But it's not conclusively faster than C++, at least not in a general sense. In terms of a small task involving lots of simple operations, you'll still often see a significant speed increase using C++. This [multicon.pl] is a good example. Now I'm sure there's more optimizations available on both sides - and plenty of stuff to tweak - but C++ is going to come out ahead by a significant margin on a lot of these tasks.
    
    A good example where the participants on both sides have some motivation is on TopCoder (where I spend a fair bit of time). Performance isn't usually the driving factor in language choice there - but sometimes it is, and when it is the answer is pretty much always C++ (unless it's a comparison between Java BigInteger and a naive implementation of the same in C++).
    
    Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance.
    
    To be clear - these are not usually "real world" tasks. As more memory and objects come into play, Java is going to do better and better. But these kinds of tasks still exist - there's still plenty of places where C++ is going to be the choice because of performance.
    
    In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Interesting)
      
      by AKAImBatman ( 238306 ) writes:
      
      Your link says: "Unfortunately, there's also a third conclusion. It seems that it's much, much easier to create a well performing program in Java. So, please consider it for a moment before you start recoding your Java program in C++ just to make it faster..."
      
      In any case, the author of that test failed. The test includes start-up times in the results. Java is always going to lose micro-benchmarks where start-up time is included. Why? Because the JVM load gets included in the benchmarks. As anyone who knows
      - Re: (Score:2)
        
        by JMZero ( 449047 ) writes:
        
        The test includes start-up times in the results.
        
        This may be a significant factor for some of his smaller tests, but I don't think it jeopardizes the overall conclusion. For example, the Ackerman test was 15s vs. 60s. There isn't going to be more than a second of startup, and most of the tests [multicon.pl] are showing differences on the order of many seconds.
        
        In terms of "is it harder to write fast code in C++", I guess it depends on what you're good at. Looking at the changes he did to the C++ code, none of it is esot
      - Re: (Score:2)
        
        by Jerry Coffin ( 824726 ) writes:
        
        The long and short of it is that Java was proven to be more than fast enough, with the average case easily keeping pace or beating C++ code.
        
        More accurately, Java is usually more than fast enough, even though it's nearly never as fast as equivalent C++. When you get down to it, performance rarely matters much.
        
        Re: (Score:3, Interesting)
        
        by TheLink ( 130905 ) writes:
        
        When performance doesn't matter much I'd use perl instead of Java, or some other higher level language (e.g. Python).
        
        Of course when your boss/customer requires Java, use Java :).
        
        I'm mixed on Lisp- Lisp is kind of high level and kind of not... I guess I'm just too lazy - perl has CPAN = lots of wheels I don't have to reinvent - and document ;).
    - It's not the Java language that is the problem (Score:2)
      
      by gatkinso ( 15975 ) writes:
      
      it is the platform (i.e. JRE). Slow. As molassas. But then again, so is managed C++ on the CLR.
      
      So it goes, the circular argument....
    - Re: (Score:2)
      
      by AnyoneEB ( 574727 ) writes:
      
      If speed if your concern, why would you switch from Java to C++ instead of just switching from JRE to gcj native? Is there really a significant difference between gcj and g++ in terms of speed? I guess Java has bounds checks which are probably not completely optimized out, but they could be disabled with --no-bounds-check.
    - Re: (Score:2)
      
      by RAMMS+EIN ( 578166 ) writes:
      
      I think the deal is that, ultimately, lower-level languages (assembly is lower than C, which is lower than C++, which is lower than Java) allow you to gain better performance than you could achieve with higher-level languages.
      
      Whether you will actually achieve the highest possible performance depends on whether you actually write the most efficient code possible. This, in turn, depends on your skills and the time you have to complete the project.
      
      Thus, what you will see in practice is that there is a lot of v
  - - Re: (Score:3, Informative)
      
      by quanticle ( 843097 ) writes:
      
      He never said that Java had surpassed C in speed, he said that Java had surpassed C++. C++ library classes are not the same as C library classes, and many C++ libraries (especially the ones outside STL and Boost) are woefully under optimized. Java has many more optimized libraries "packaged in" with the language itself.
      
      Second, neither the Doom or Unreal engines are multi-threaded. Java has threading support built into the language. To get the same with C you'd have to use POSIX threads (killing Windows
    - Re: (Score:2)
      
      by DarkOx ( 621550 ) writes:
      
      disclaimer:I am C/C++ fanboi
      
      Java is not faster then C++ and it never will be. You can't fundamentally add the overhead of a byte code interpreter and somehome come out ahead. What Java does and does very well is save programers from themselves.
      
      Like most programers I am not God's gift to software development. Chances are pretty good that as any given software project I am working on becomes larger and starts to involve more objects, and or data structures the memory management in Java is going to be bette
      - Re: (Score:2)
        
        by Jerry Coffin ( 824726 ) writes:
        
        Java is not faster then C++ and it never will be. You can't fundamentally add the overhead of a byte code interpreter and somehome [sic] come out ahead.
        
        Java doesn't necessarily involve a byte-code interpreter. JVMs have supplied Just In Time compilation for years now.
        That's not a panacea though: optimization is generally a difficult task, and good optimization can be quite difficult indeed -- many (perhaps most) NP-complete problems are basically ones of optimization. While optimizing code isn't nec
        
        Re: (Score:2)
        
        by afidel ( 530433 ) writes:
        
        Java GC sucks in general. I use quite a number of different JVM's and the tuning advice I see most often and have experienced is don't allow any given JVM to go above ~1.5GB of ram or GC performance will bring down overall system performance. That's just stupid on modern machines with up to hundreds of GB of ram. Sure there are frontends that allow you to spawn additional JVM's automatically, but that just means there's more processes and more overhead.
  - - Re: (Score:2)
      
      by cheater512 ( 783349 ) writes:
      
      Well the AVR32 architecture does support some Java byte codes.
      
      Mind you the manual does specifically state that you shouldn't use them inside interrupts. :)
    - Re: (Score:2)
      
      by shish ( 588640 ) writes:
      
      I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.
      
      What's absurd about one bit of native binary code outperforming a different bit of native binary code? o_O
      - Re: (Score:2)
        
        by smellotron ( 1039250 ) writes:
        
        I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.
        What's absurd about one bit of native binary code outperforming a different bit of native binary code? o_O
        Well, nothing... but based on the GP's usage of the phrase "JIT bytecode", my best guess is that he's talking about Python (compiles source code to bytecode "JIT") as opposed to Java (compiles bytecode to machine code JIT).
- Re:Java???? (Score:4, Informative)
  
  by Kupek ( 75469 ) writes: on Tuesday May 27, 2008 @04:42PM (#23561843)
  
  Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.
  
  Parent Share
  twitter facebook
  - - Re: (Score:2)
      
      by Kupek ( 75469 ) writes:
      
      Even if the read in Thread B occurs after the write in Thread A chronologically, Thread B may be looking at stale data.
      If Thread B has stale data, then it's commit (through a compare-and-swap or similar atomic operation) should fail, and the thread should try the operation again.
      
      Take a look at the algorithms in the presentation. I've done a fair amount of lock-free programming [vt.edu], but all of it was in C. His algorithms look like standard lock-free practices, but since I'm not familiar with what correct lock-free Java code looks like, I can't say with confidence.
      
      I will, however, say that I would be surprised if Cliff Click mad
    - java.util.concurrent.atomic (Score:2)
      
      by CustomDesigned ( 250089 ) writes:
      
      They are really non-blocking, and use a standard Java API - the java.util.concurrent.atomic package. The description of the package indicates it was intended as the low level basis for non-blocking data structures. While most JVMs are written largely in C, the Java code gets compiled to machine language. Standard packages do not have to be implemented via JNI, and can be implemented directly by the JVM - just like GNU C implements many standard C library functions directly in the compiler for performance
- Re:Java???? (Score:5, Insightful)
  
  by famebait ( 450028 ) writes: on Tuesday May 27, 2008 @04:53PM (#23562027)
  
  There is no way anything less than _really_ good coders would get something like this to work with any semblance of efficiency. If you still evaluate coders by which language they use, chances are you're not really that good a programmer.
  
  Parent Share
  twitter facebook
- Re: (Score:3, Interesting)
  
  by tppublic ( 899574 ) writes:
  
  "Why not use C++"
  Umm, because Azul runs the Java in hardware. It *is* optimized by being in Java.
- Re: (Score:2)
  
  by famebait ( 450028 ) writes:
  
  700 threads in JAVA? Why not use C++,
  
  To get the work done and working correctly before the hardware is obsolete.
- Re: (Score:2)
  
  by mikael ( 484 ) writes:
  
  They develop Java hardware for internet applications . Presumably this is for commercial organisations where it is more important to get a functional application designed and assembled than to design/test and implement a completely optimised C++ application.
- - Re: (Score:2)
    
    by Viol8 ( 599362 ) writes:
    
    "Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job; plenty-o-examples upon request)"
    
    Sorry , remind me what language the JIT compiler is written in. Java is it? No , thought not.
    - - Re: (Score:2, Informative)
        
        by fishbowl ( 7759 ) writes:
        
        In the problems that TFA addresses, I'd wager that most of the time is spent in dissemination barriers of some sort, since invariably the problems in parallel computing move into issues within the problem domain (which is ideally what we want, after all).
        
        As for a JIT being inefficient compared to a static optimizing compiler, it depends so much on the code in question and on the platform, as to not be something you can make blanket statements about.
        
        Let's hear from some HPC researchers on this. Get some rea
  - Re:Java???? (Score:4, Funny)
    
    by Reverend528 ( 585549 ) * writes: on Tuesday May 27, 2008 @05:05PM (#23562213) Homepage
    
    Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job
    Since when has writing code quickly ever been considered one of Java's strong points? Personally I'd take stdio over Java's alternative (file wrapped in a stream buffer wrapped in a buffered reader wrapped in an enigma) any day of the week.
    Sure, Java manages memory for you, but it's generally much easier to incorporate a garbage collector into C than it is to write java without file I/O.
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by FlyingGuy ( 989135 ) writes:
        
        Call me old fashioned, but ummmmm, you malloc() it , you free() it.
        
        I mean when did this whole notion of malloc() it and forget it come to pass? Did Ron popeil suddenly start writing programming languages or something? I don't care how complex your code is, at some point that bit of memory is no longer required and at that point it is your job as the programmer, to clean up after yourself and free() it, no?.
- - Re: (Score:2)
    
    by arthurpaliden ( 939626 ) writes:
    
    Actually I like being a COBOL consultant/programmer I get to work when I want and I get paid really big bucks. Lots of COBOL still out there in mission critical processes.
- Re:"2.5"? WTF? (Score:5, Informative)
  
  by badboy_tw2002 ( 524611 ) writes: on Tuesday May 27, 2008 @05:22PM (#23562477)
  
  He's built two working data structures and is working on a third (had to read the slides to figure that one out).
  
  Parent Share
  twitter facebook
- I doubt that very much (Score:2)
  
  by EmbeddedJanitor ( 597831 ) writes:
  
  Sure embedded folk have been using FSMs for years. Sure embedded folk have been using arrays for many years,
  But a FSM-controlled data access for multi-threading is a bit different. The closest I've seen is the implementation of dual-port RAM which uses hardware state machines to control access to the underlying shared RAM.
  Oh, I've been doing embedded for over 20 years too.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

why (Score:5, Interesting)

Re:why (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:why (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:2, Funny)

Inspiration... (Score:5, Informative)

Re: (Score:2)

Google Talk (Score:5, Informative)

Re: (Score:3, Funny)

768 Cores? (Score:3, Funny)

Re:768 Cores? (Score:4, Funny)

Google Tech Talk (Score:4, Informative)

scalable noNBLocking data sTRructures .. :) (Score:2, Insightful)

Re:scalable noNBLocking data sTRructures .. :) (Score:5, Interesting)

Theory Pong (Score:3, Funny)

false sharing? (Score:2, Interesting)

Re: (Score:2)

Re: (Score:3, Informative)

Geek serendipity in a summary (Score:4, Funny)

Re: (Score:2)

Well, sort of lock-free. (Score:2, Informative)

No it is Lock Free (Score:2, Interesting)

Re: (Score:2)

From the article: (Score:4, Funny)

one per (Score:3, Funny)

Re: (Score:2)

Bulk-Synchronous Parallel model, anyone ? (Score:2, Insightful)

Re: (Score:2)

WTF? (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

From the video... (Score:3, Insightful)

Welcome back (Score:2)

Re:Sounds great! (Score:5, Funny)

Re: (Score:2, Insightful)

Re: (Score:3, Insightful)

Re:Sounds great! (Score:5, Informative)

Re:Sounds great! (Score:4, Funny)

Re:Sounds great! (Score:5, Funny)

Re: (Score:2)

Sounds bogus? (Score:2)

Re:Sounds bogus? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re:Sounds bogus? (Score:5, Informative)

Can also be done with a clean cache (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Sounds great! (Score:5, Funny)

Re:Java???? (Score:4, Insightful)

Re:Java???? (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Stoopid (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Benchmark from many years ago (Score:3, Insightful)

Re:Java???? (Score:5, Insightful)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

It's not the Java language that is the problem (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)