Memory Management Technique Speeds Apps By 20% 252
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Wait-free lock-free data structures (Score:1, Informative)
You can use a wait-free lock-free data structure to add deallocated memory to a free list to reduce contention on the critical section in a heap. You can also use it to implement the idea presented in the paper.
On Windows, you can use the Interlocked Singly Linked List data structure (see InitializeSListHead).
You can also partition memory allocations over several heaps to reduce contention further.
Re:Wow, this is pretty clever (Score:4, Informative)
If you want faster AES, just upgrade your CPU [wikipedia.org].
Re:Might be particularly applicable to Java (Score:3, Informative)
Actually, Java already does something very similar to this: http://en.wikipedia.org/wiki/Java_Memory_Model [wikipedia.org]
Re:Might be particularly applicable to Java (Score:1, Informative)
Actually, Java does the complete opposite. Java allocates a large chunk of heap, and then internally allocates its own objects on that heap. because the majority of objects are short-lived, a generational garbage collector can be used. Also, some allocations are done on the stack. See here [ibm.com]
Re:Nothing to see here.... (Score:5, Informative)
Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.
With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.
And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.
Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...
Re:Wow, this is pretty clever (Score:5, Informative)
That's already implemented in the high performance ssh patch, available here [psc.edu]. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.
I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.
Re:free() is probably more parallizable than mallo (Score:5, Informative)
Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf [ncsu.edu], they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
Re:Beware the key term there: (Score:3, Informative)
The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.
It would be nice if Unix/Linux had easier and better support for asynchronous I/O.
Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.
Much simpler than Linux, where you have to use completely different and complex mechanisms if you want to use something better than synchronous read()/write() calls.
http://en.wikipedia.org/wiki/QIO [wikipedia.org]
http://en.wikipedia.org/wiki/Asynchronous_System_Trap [wikipedia.org]
Re:20%?! (Score:5, Informative)
I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free [umass.edu].
Re:You can malloc it but you can't use it (Score:4, Informative)
The PDF of the paper has all the details. The article is just fluff.
Re:Just remember to be aware of multi PROCESSOR (Score:5, Informative)
The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.
Re:Beware the key term there: (Score:3, Informative)
Operating systems like VMS made all I/O asynchronous by default...
This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.
(I think this is more-or-less correct anyway.)
Re:Beware the key term there: (Score:3, Informative)
Clearly you didn't read the paper.
One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.
The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.
Re:Beware the key term there: (Score:3, Informative)
Actually, I can see how malloc would help, if you assume that they're always allocating small-ish amounts -- just keep a certain amount reserved, or don't free stuff instantly (in case it's about to be reallocated).
However, all of this seems very much like it could be done either inside libc (as proposed) or in the kernel, without having to touch existing apps, at least on platforms like Linux where libc is shared.
Re:Beware the key term there: (Score:3, Informative)
What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?
They keys are speculative allocation, and batch freeing. They decouple the actual allocation/deallocation that the system's memory management library performs (which may involve slow system calls into the kernel, even), from the malloc and free calls that the program makes. By decoupling the rest of the program thread from the memory allocation thread, the application then doesn't always have to wait for all the accounting and data structure manipulation that malloc and free do. Of course there are times when it does have to wait, but with intelligent speculative mallocs, and batched frees, they try to minimize those.
It's actually very similar to threaded I/O in that it decouples two aspects of the program, and allows for forward progress on things that aren't dependent on the thing that gets shoved off to another thread.
Re:Beware the key term there: (Score:3, Informative)
It's true that in this case we're looking at a max of about 20%, but we're also looking at an average of about 16-18% (I'm eye-balling from the graphs). There's one test in the benchmark suite which is almost entirely CPU-bound and it is only a few percentage points faster. This, of course, makes perfect sense as some real processes are CPU-bound and so should be included in the benchmark suite. But realistically, no allocation scheme can speed up a process which does almost no allocating by very much. The rest of the tests in the benchmark show a pretty uniform increase in speed in the 15-22% range (again, eye-balling from the charts).
Re:Beware the key term there: (Score:3, Informative)
Maybe at the API level. The API for asynchronous IO is there for every system call in Win32. What isn't there, however, is the use. I've read through the specs for doing "overlapped IO", and the conclusion is always the same - getting the semantics right is a pain. It's so much easier to create a GUI thread and a working thread, that no one I have ever seen bothered with it.
Linux does have asynchronous IO, and they suffer from, pretty much, the same problem - getting the semantics right is difficult.
Shachar
Re:Nothing to see here.... (Score:4, Informative)
They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.
It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.
Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"
In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.
Re:Nothing to see here.... (Score:3, Informative)
When used for locking it is called spinning and not busy-looping, and stop your silly doomsday speak and grow a brain. The linux kernel itself more often use spinning than locking, because it is much faster and uses less cpu-cycles. You have busy-looping thousands of time each second when the kernel synchronizes threads and hardware, this is a no-go in application design, but a really common and efficient trick in low-level libraries and routines, and it will save you cpu-cycles and energy compared to semaphores, not use more.
Re:Nothing to see here.... (Score:3, Informative)
But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.
Average is about 15%. Many programs spend nearly 50% of time in memory allocation.
Re:Beware the key term there: (Score:2, Informative)
Are you sure that zero is the baseline? They may have suckered you in.
There is undoubtedly some trivial utilization of memory for which the overhead exceeds any gains.
Re:Nothing to see here.... (Score:4, Informative)
I should perhaps note that I do implement low-level libraries for an extremely reputable company as a day job; I'm familiar with low-level lock implementations both in kernel and in userlevel on Linux, Windows, and MacOS, and exactly how those implementations balance spinning versus blocking.
The Linux kernel preference for spinlocks dates from years ago, when the whole kernel ran under the BKL and was non-premptable anyways so you couldn't use blocking locks. When the BKL came out, all locks were made spinlocks to maintain correctness (and the -rt patchset started up, doing a conversion). The default implementation (still in use today by anything except the -rt patchset!) disables interrupts while any spinlock is held, and thus assumes the only thing holding the lock is another core.
In contrast, Solaris and Windows (and I think MacOSX, though I would have to check my references) use a mix of spinlocks and adaptive locks - spinlocks for use within interrupt handlers, and adaptive locks for everywhere else. Good pthread implementations (glibc included) use adaptive locks - which means the pthread implementation this paper declared too slow to use ALREADY spins ~1000 cycles before blocking. The canonical rule here is that an adaptive lock spins for the same amount of time it would take for a block/wakeup cycle, then blocks; this is guaranteed to be within a factor of 2 of optimal in all cases, which is the best overall lower bound you can possibly get. (Yes, Linux kernel is behind the times; they are slowly getting better, and when eventually the -rt patchset gets merged, Linux will have finally caught up. Sorry, Linux fanboys.)
Spinning versus blocking is a tradeoff. The research paper manages to extract all the gains from the "spin forever" side of the tradeoff without ever admitting the drawbacks (one full CPU core wasted).
As usual Smalltalk's been there done that (Score:3, Informative)
Not much to see in the article. Move along.
IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).
One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.
Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.