Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

Memory Management Technique Speeds Apps By 20% 252

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
This discussion has been archived. No new comments can be posted.

Memory Management Technique Speeds Apps By 20%

Comments Filter:
  • by Omnifarious ( 11933 ) * <eric-slashNO@SPAMomnifarious.org> on Monday April 05, 2010 @08:20PM (#31743356) Homepage Journal

    I wish I'd thought of it.

    Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

  • by Anonymous Coward on Monday April 05, 2010 @08:24PM (#31743400)

    Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.

  • by Anonymous Coward on Monday April 05, 2010 @08:28PM (#31743448)

    Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.

    If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.

    However, that isn't the end of the story. There are proprietary allocators, like the Lockless [locklessinc.com] memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.

  • by SpazmodeusG ( 1334705 ) on Monday April 05, 2010 @08:44PM (#31743584)
    Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).

    In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.
  • by Idiomatick ( 976696 ) on Monday April 05, 2010 @09:04PM (#31743734)
    Why couldn't it be applied at the compiler option level? A checkbox and recompile isn't so terrible. It could probably be done at the OS level but it'd be more of a pain.
  • by martin-boundary ( 547041 ) on Monday April 05, 2010 @09:31PM (#31743928)

    And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup

    There's theoretical good and then there's practical good. A good rule of thumb these days is that RAM is the new disk, and most current and legacy software results in huge numbers of CPU stalls. If those stalls can be converted to useful work, even at 2:1 conversion, that's better than having the stalls.

  • by b4dc0d3r ( 1268512 ) on Monday April 05, 2010 @10:04PM (#31744100)

    Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.

    The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.

  • by kscguru ( 551278 ) on Monday April 05, 2010 @10:16PM (#31744154)

    And now I've read their paper [ncsu.edu]. Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).

    In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.

  • by Yvan256 ( 722131 ) on Monday April 05, 2010 @10:33PM (#31744272) Homepage Journal

    The Fivefold Mother [penny-arcade.com]

  • by mdf356 ( 774923 ) <mdf356@NOsPAM.gmail.com> on Monday April 05, 2010 @11:30PM (#31744512) Homepage

    but this approach requires a rewrite to make it usable

    A rewrite of part of libc, yes. Change the implementation of malloc(3) and link with -lpthread and you're pretty much done.

    I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.

  • by Tenareth ( 17013 ) on Tuesday April 06, 2010 @12:18AM (#31744770) Homepage

    There are very few developers left in the US that even know what memory management is.

    People don't even try anymore.

  • by Chirs ( 87576 ) on Tuesday April 06, 2010 @12:37AM (#31744848)

    The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.

  • by Darinbob ( 1142669 ) on Tuesday April 06, 2010 @12:52AM (#31744896)
    I had a prof who referred to benchmarks and predictions as "guaranteed not to exceed" numbers.
  • by Anonymous Coward on Tuesday April 06, 2010 @02:19AM (#31745272)

    I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.

    With compiler support there could be a significant benefit here. Suppose you have a function that looks like this:

    void fn(int array[], const unsigned size)
            for(int i=0; i size; i++)
                    array[i] = array[i] * 3;
            int* newarray = calloc(size, sizeof(int));
            for(int i=0; i size; i++)
                    newarray[i] = array[i] + 5; // etc.

    If you notice, the memory allocation is independent of the first loop. Suppose the compiler knows what malloc/calloc is and can rewrite this function internally like this:
    void fn(int array[], const unsigned size)
            calloc_in_a_different_thread(size, sizeof(int));
            for(int i=0; i size; i++)
                    array[i] = array[i] * 3;
            int* newarray = get_result_of_the_async_calloc();
            for(int i=0; i size; i++)
                    newarray[i] = array[i] + 5; // etc.

    That way the memory allocation is happening at the same time as the first loop is executing. The memory allocation becomes "free" (no pun intended) as long as there is enough independent code to execute concurrently with the malloc.

  • by Nagilum23 ( 656991 ) on Tuesday April 06, 2010 @02:52AM (#31745376) Homepage
    If I may quote: "Phkmalloc was written by Poul-Henning Kamp for FreeBSD in 1995-1996 and subsequently adapted by a number of operating systems, including NetBSD, OpenBSD, and several Linux distributions." Phkmalloc is the example implementation they used for the benchmarks..
  • by KiloByte ( 825081 ) on Tuesday April 06, 2010 @05:03AM (#31745744)

    Except that pre-allocating small chunks of expected size can be done much faster in-thread if you first allocate large chunks and then sub-allocate constant sized parts. That's what g_slice() from glib does.

    Replacing malloc() with g_slice() tends to improve allocation speed by insane factors -- I've seen cases where the speedup of allocations was 10x, and since the program in question was quite malloc-heavy, the overall speed was increased by over 100%.

  • by taniwha ( 70410 ) on Tuesday April 06, 2010 @05:45AM (#31745938) Homepage Journal
    I've worked with real multi-threaded apps that turned out to use more than 50% of their time in malloc/new free/delete and the associated locks - large;y due to the use of C++ string routines by people who didn't understand the single threadedness that was going on behind the scenes. The most important thing to take away from this is that malloc/free are not cheap, they involve synchronization in multithreaded code (like stdio and most people don't know that either). (and should be avoided like the plague in real-time code because of the risk of priority inversion)
  • by spacey ( 741 ) <spacey-slashdot. ... m ['ssr' in gap]> on Tuesday April 06, 2010 @10:44AM (#31748078) Homepage

    One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.

    This is interesting stuff, but if the goal is to not have to change source, isn't this sub-par? Hasn't the Boehm collector been tested as faster than using malloc/free forever? See http://www.drdobbs.com/cpp/184401632;jsessionid=IRGXEUGCDWGBJQE1GHOSKH4ATMY32JVN [drdobbs.com] for a trivial example (a paper at ftp.cs.boulder.edu is offline, I guess with the server for now).


We gave you an atomic bomb, what do you want, mermaids? -- I. I. Rabi to the Atomic Energy Commission