Please create an account to participate in the Slashdot moderation system

Memory Management Technique Speeds Apps By 20% 252

Posted by kdawson on Monday April 05, 2010 @08:04PM from the rememberance-of-data-past dept.

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.

This discussion has been archived. No new comments can be posted.

Memory Management Technique Speeds Apps By 20%

Load All Comments

Search 252 Comments Log In/Create an Account

Comments Filter:

Beware the key term there: (Score:5, Insightful)

by Estanislao Mart�nez ( 203477 ) writes: on Monday April 05, 2010 @08:11PM (#31743272) Homepage

Beware the key term there: "up to."

Share
twitter facebook
- Re: (Score:2)
  
  by Jurily ( 900488 ) writes:
  
  OK, so they run memory allocation in a separate thread. What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?
  The idea sounds fun, but this approach requires a rewrite to make it usable, like most everything else out there.
  The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.
  - Re:Beware the key term there: (Score:4, Interesting)
    
    by Idiomatick ( 976696 ) writes: on Monday April 05, 2010 @09:04PM (#31743734)
    
    Why couldn't it be applied at the compiler option level? A checkbox and recompile isn't so terrible. It could probably be done at the OS level but it'd be more of a pain.
    
    Parent Share
    twitter facebook
    - - Re: (Score:3, Funny)
        
        by Idiomatick ( 976696 ) writes:
        
        I just assumed Windows needed the help more, my bad!
  - Re: (Score:3, Informative)
    
    by Anonymous Coward writes:
    
    The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.
    It would be nice if Unix/Linux had easier and better support for asynchronous I/O.
    Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.
    Much simpler than Linux, where you have to use completely different and co
    - Re: (Score:3, Informative)
      
      by EvanED ( 569694 ) writes:
      
      Operating systems like VMS made all I/O asynchronous by default...
      This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.
      (I think this is more-or
      - Re: (Score:3)
        
        by Jurily ( 900488 ) writes:
        
        Whatever the case may be, the idea certainly didn't spread into neither (cross-platform?) APIs, nor application design. Qt, for example, offers no asynchronous file operations, and most applications I've seen do disk I/O in the GUI thread.
        It's easy to notice because everything grinds to a halt when the disk is thrashing. One would think we can do better in the age of supercomputers.
        
        Re: (Score:3)
        
        by weicco ( 645927 ) writes:
        
        IIRC NT 5.1 and earlier didn't have asynchronous I/O canceling. That caused some problems even when I/O stuff were ran asynchronously. But my memory is bit flaky so I just can't remember what those problems were... Vista and 7 has asynchronous canceling also.
      - Re: (Score:3, Informative)
        
        by Sun ( 104778 ) writes:
        
        Maybe at the API level. The API for asynchronous IO is there for every system call in Win32. What isn't there, however, is the use. I've read through the specs for doing "overlapped IO", and the conclusion is always the same - getting the semantics right is a pain. It's so much easier to create a GUI thread and a working thread, that no one I have ever seen bothered with it.
        Linux does have asynchronous IO, and they suffer from, pretty much, the same problem - getting the semantics right is difficult.
        Shachar
  - Re: (Score:3, Informative)
    
    by Anonymous Coward writes:
    
    Clearly you didn't read the paper.
    One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.
    The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.
    - Re: (Score:3, Interesting)
      
      by spacey ( 741 ) writes:
      
      One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.
      This is interesting stuff, but if the goal is to not have to change source, isn't this sub-par? Hasn't the Boehm collector been tested as faster than using malloc/free forever? See http://www.drdobbs.com/cpp/184401632;jsessionid=IRGXEUGCDWGBJQE1GHOSKH4ATMY32JVN [drdobbs.com] for a trivial example (a paper at ftp.cs.boulder.edu is offline, I guess with the server for now).
      -Peter
  - Re: (Score:3, Interesting)
    
    by mdf356 ( 774923 ) writes:
    
    but this approach requires a rewrite to make it usable
    A rewrite of part of libc, yes. Change the implementation of malloc(3) and link with -lpthread and you're pretty much done.
    I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.
    - Re: (Score:3, Informative)
      
      by SanityInAnarchy ( 655584 ) writes:
      
      Actually, I can see how malloc would help, if you assume that they're always allocating small-ish amounts -- just keep a certain amount reserved, or don't free stuff instantly (in case it's about to be reallocated).
      However, all of this seems very much like it could be done either inside libc (as proposed) or in the kernel, without having to touch existing apps, at least on platforms like Linux where libc is shared.
      - Re: (Score:3, Interesting)
        
        by KiloByte ( 825081 ) writes:
        
        Except that pre-allocating small chunks of expected size can be done much faster in-thread if you first allocate large chunks and then sub-allocate constant sized parts. That's what g_slice() from glib does.
        Replacing malloc() with g_slice() tends to improve allocation speed by insane factors -- I've seen cases where the speedup of allocations was 10x, and since the program in question was quite malloc-heavy, the overall speed was increased by over 100%.
  - Re:Beware the key term there: (Score:4, Insightful)
    
    by mswhippingboy ( 754599 ) writes: on Tuesday April 06, 2010 @12:21AM (#31744796)
    
    What you are missing (as are most of the posters so far) is that there is considerable overhead involved in the actual management of the memory in terms of keeping track of what memory is free or allocated. This is outside the issue of maintaining locks. Moving this management overhead to a separate thread allows the otherwise single threaded app to take advantage of additional cores without any code changes. This does not appear all that novel however as modern garbage collectors do this today.
    
    Parent Share
    twitter facebook
  - Re: (Score:3, Informative)
    
    by AcidPenguin9873 ( 911493 ) writes:
    
    What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?
    They keys are speculative allocation, and batch freeing. They decouple the actual allocation/deallocation that the system's memory management library performs (which may involve slow system calls into the kernel, even), from the malloc and free calls that the program makes. By decoupling the rest of the program thread from the memory allocation thread, the application then doesn't always have to wait for all the accounting and data structure manipulation that malloc and free do. Of course there are times
- Re:Beware the key term there: (Score:5, Insightful)
  
  by Spatial ( 1235392 ) writes: on Monday April 05, 2010 @10:02PM (#31744092)
  
  I like to mentally replace that with the actual meaning: "between 0 and".
  
  It could allow software applications to run between 0 and 20% faster!
  
  Parent Share
  twitter facebook
  - Re:Beware the key term there: (Score:4, Insightful)
    
    by Georules ( 655379 ) writes: on Monday April 05, 2010 @10:05PM (#31744104)
    
    You might consider mentally replacing it with the sad reality that it might be between 0 and x faster AND it could also be infinitely slower.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by gmhowell ( 26755 ) writes:
    
    Why limit yourself? With enough overhead from the memory management, you could get it to go some negative amount faster.
- Obligatory Penny Arcade (Score:3, Interesting)
  
  by Yvan256 ( 722131 ) writes:
  
  The Fivefold Mother [penny-arcade.com]
- Re: (Score:3, Interesting)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:3, Informative)
  
  by KeithIrwin ( 243301 ) writes:
  
  It's true that in this case we're looking at a max of about 20%, but we're also looking at an average of about 16-18% (I'm eye-balling from the graphs). There's one test in the benchmark suite which is almost entirely CPU-bound and it is only a few percentage points faster. This, of course, makes perfect sense as some real processes are CPU-bound and so should be included in the benchmark suite. But realistically, no allocation scheme can speed up a process which does almost no allocating by very much.
Just remember to be aware of multi PROCESSOR (Score:2)

by TheSunborn ( 68004 ) writes:

Sound nice, but I hope the library also handle multi processor systems, where each processer have its own ram block. You don't want one cpu to allocate the memory, which is used by an other cpu.
Do operation systems even have support to say "These 2 threads should run on the same processor, but I don't care about which one".
- Re: (Score:2)
  
  by WrongSizeGlass ( 838941 ) writes:
  
  What platform is this library available for? I didn't see it in the article.
  
  How much does this slow down an application that's running on a single CPU with a single core? Splitting memory allocation off into its own thread will negatively impact performance when running on many of the existing desktops out there.
  - Re: (Score:2, Troll)
    
    by sys.stdout.write ( 1551563 ) writes:
    
    wrapped in a [sic] ugly brown robe and a poorly draped orange sarong
    Ah, the Ubuntu color scheme!
    - Re: (Score:2)
      
      by Idiomatick ( 976696 ) writes:
      
      It is purple now you insensitive clod!
- Re: (Score:2)
  
  by jisatsusha ( 755173 ) writes:
  
  Windows does - SetThreadIdealProcessor() [microsoft.com].
  - - Re: (Score:2)
      
      by GigaplexNZ ( 1233886 ) writes:
      
      "I want this thread to be run on whatever CPU that other thread runs without caring which core it is."
      I assume you meant something more along these lines?
    - processor affinity (Score:2)
      
      by SpaceLifeForm ( 228190 ) writes:
      
      Link [ibm.com]
- Re:Just remember to be aware of multi PROCESSOR (Score:5, Informative)
  
  by Mad Merlin ( 837387 ) writes: on Monday April 05, 2010 @10:20PM (#31744188) Homepage
  
  The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.
  
  Parent Share
  twitter facebook
Nothing to see here.... (Score:5, Insightful)

by Ancient_Hacker ( 751168 ) writes: on Monday April 05, 2010 @08:11PM (#31743284)

Nothing to see here...
Moving malloc() to a separate thread does not do a thing for the putative word processor.
They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.
But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.

Share
twitter facebook
- Re:Nothing to see here.... (Score:5, Funny)
  
  by Anonymous Coward writes: on Monday April 05, 2010 @08:20PM (#31743354)
  
  new malloc()
  I see what you did there.
  
  Parent Share
  twitter facebook
- Re:Nothing to see here.... (Score:5, Interesting)
  
  by Anonymous Coward writes: on Monday April 05, 2010 @08:28PM (#31743448)
  
  Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.
  If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.
  However, that isn't the end of the story. There are proprietary allocators, like the Lockless [locklessinc.com] memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.
  
  Parent Share
  twitter facebook
  - Re:Nothing to see here.... (Score:4, Insightful)
    
    by wealthychef ( 584778 ) writes: on Monday April 05, 2010 @11:28PM (#31744484)
    
    But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Insightful)
      
      by headLITE ( 171240 ) writes:
      
      A large amount of malloc()/free() calls is something very typical of server applications that handle many concurrent requests. In this scenario, the problem is made worse by the locking used in many traditional implementations. Don't underestimate that.
      This is becoming more and more of a problem in client applications as well. Thanks to object orientation, many modern applications are little more than endless streams of created and subsequently destroyed objects; and in many modern languages this happens im
    - Re: (Score:3, Informative)
      
      by julesh ( 229690 ) writes:
      
      But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.
      Average is about 15%. Many programs spend nearly 50% of time in memory allocation.
    - Re: (Score:3, Interesting)
      
      by taniwha ( 70410 ) writes:
      
      I've worked with real multi-threaded apps that turned out to use more than 50% of their time in malloc/new free/delete and the associated locks - large;y due to the use of C++ string routines by people who didn't understand the single threadedness that was going on behind the scenes. The most important thing to take away from this is that malloc/free are not cheap, they involve synchronization in multithreaded code (like stdio and most people don't know that either). (and should be avoided like the plague
- Re:Nothing to see here.... (Score:5, Informative)
  
  by kscguru ( 551278 ) writes: on Monday April 05, 2010 @08:53PM (#31743670)
  
  Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.
  With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.
  And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.
  Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Insightful)
    
    by AuMatar ( 183847 ) writes:
    
    Wouldn't it be rather trivial to write a lockless malloc? Just have every thread allocate its own memory and maintain its own free list- problem solved.
    - Re:Nothing to see here.... (Score:5, Interesting)
      
      by b4dc0d3r ( 1268512 ) writes: on Monday April 05, 2010 @10:04PM (#31744100)
      
      Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.
      The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by Tenareth ( 17013 ) writes:
        
        If each thread allocates its own memory then it is just returning to LWP, which was good a while back, but threading should try to avoid allocating its own memory except in a few specific instances.
        However, based on this thread most people don't know how CPUs work.
  - Re: (Score:3, Interesting)
    
    by martin-boundary ( 547041 ) writes:
    
    And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup
    
    There's theoretical good and then there's practical good. A good rule of thumb these days is that RAM is the new disk, and most current and legacy software results in huge numbers of CPU stalls. If those stalls can be converted to useful work, even at 2:1 conversion, that's better than having the stalls.
  - Re:Nothing to see here.... (Score:5, Interesting)
    
    by kscguru ( 551278 ) writes: on Monday April 05, 2010 @10:16PM (#31744154)
    
    And now I've read their paper [ncsu.edu]. Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).
    In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.
    
    Parent Share
    twitter facebook
    - Re:Nothing to see here.... (Score:4, Funny)
      
      by Angst Badger ( 8636 ) writes: on Tuesday April 06, 2010 @01:16AM (#31744982)
      
      This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers.
      On the contrary, this opens up whole new possibilities for MS-DOS!
      
      Parent Share
      twitter facebook
    - Re:Nothing to see here.... (Score:4, Informative)
      
      by Anonymous Coward writes: on Tuesday April 06, 2010 @03:28AM (#31745482)
      
      They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.
      It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.
      Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"
      In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.
      
      Parent Share
      twitter facebook
    - Re: (Score:3, Informative)
      
      by Carewolf ( 581105 ) writes:
      
      In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not i
      - Re: (Score:3, Insightful)
        
        by julesh ( 229690 ) writes:
        
        When used for locking it is called spinning and not busy-looping, and stop your silly doomsday speak and grow a brain. The linux kernel itself more often use spinning than locking, because it is much faster and uses less cpu-cycles. You have busy-looping thousands of time each second when the kernel synchronizes threads and hardware, this is a no-go in application design, but a really common and efficient trick in low-level libraries and routines, and it will save you cpu-cycles and energy compared to semap
      - Re:Nothing to see here.... (Score:4, Informative)
        
        by kscguru ( 551278 ) writes: on Tuesday April 06, 2010 @01:12PM (#31750238)
        
        I should perhaps note that I do implement low-level libraries for an extremely reputable company as a day job; I'm familiar with low-level lock implementations both in kernel and in userlevel on Linux, Windows, and MacOS, and exactly how those implementations balance spinning versus blocking.
        The Linux kernel preference for spinlocks dates from years ago, when the whole kernel ran under the BKL and was non-premptable anyways so you couldn't use blocking locks. When the BKL came out, all locks were made spinlocks to maintain correctness (and the -rt patchset started up, doing a conversion). The default implementation (still in use today by anything except the -rt patchset!) disables interrupts while any spinlock is held, and thus assumes the only thing holding the lock is another core.
        In contrast, Solaris and Windows (and I think MacOSX, though I would have to check my references) use a mix of spinlocks and adaptive locks - spinlocks for use within interrupt handlers, and adaptive locks for everywhere else. Good pthread implementations (glibc included) use adaptive locks - which means the pthread implementation this paper declared too slow to use ALREADY spins ~1000 cycles before blocking. The canonical rule here is that an adaptive lock spins for the same amount of time it would take for a block/wakeup cycle, then blocks; this is guaranteed to be within a factor of 2 of optimal in all cases, which is the best overall lower bound you can possibly get. (Yes, Linux kernel is behind the times; they are slowly getting better, and when eventually the -rt patchset gets merged, Linux will have finally caught up. Sorry, Linux fanboys.)
        Spinning versus blocking is a tradeoff. The research paper manages to extract all the gains from the "spin forever" side of the tradeoff without ever admitting the drawbacks (one full CPU core wasted).
        
        Parent Share
        twitter facebook
Wow, this is pretty clever (Score:4, Interesting)

by Omnifarious ( 11933 ) * writes: <eric-slash@omnifa r i o u s . o rg> on Monday April 05, 2010 @08:20PM (#31743356) Homepage Journal

I wish I'd thought of it.
Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

Share
twitter facebook
- Re:Wow, this is pretty clever (Score:4, Informative)
  
  by nxtw ( 866177 ) writes: on Monday April 05, 2010 @08:24PM (#31743402)
  
  If you want faster AES, just upgrade your CPU [wikipedia.org].
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Omnifarious ( 11933 ) * writes:
    
    Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)
    - Re:Wow, this is pretty clever (Score:4, Insightful)
      
      by nxtw ( 866177 ) writes: on Monday April 05, 2010 @09:09PM (#31743770)
      
      Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)
      Are your storage and network devices that fast?
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by x2A ( 858210 ) writes:
        
        "Are your storage and network devices that fast?"
        Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.
        
        Re: (Score:2)
        
        by nxtw ( 866177 ) writes:
        
        Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.
        What does that have to do with anything? AES-NI enables AES implementations to encrypt and decrypt much faster than any network and storage system I've ever used can provide data. Parallelizing AES-NI would save energy only in the case when
      - Re: (Score:2)
        
        by Omnifarious ( 11933 ) * writes:
        
        *sigh* And... (I forgot to put this in) Intel's new instructions only make things twice as fast with CBC mode because CBC mode can't be pipelined. CBC mode requires the results of the previous operation before doing the next.
        That also implies that if you're going to be using them to increase the speed of CTR mode you are best doing several blocks before you switch to doing something non-AES related. That also argues for pre-computing blocks in CTR mode. So really, the whole pre-computation thing shou
      - Re: (Score:2)
        
        by nxtw ( 866177 ) writes:
        
        Sucks for you.
  - Re: (Score:2)
    
    by mirix ( 1649853 ) writes:
    
    Of course, you need software that uses the instructions as well.
    That said, I've got one of the Via chips with hardware RNG on it, and once I loaded the module for it, /dev/random just spews data. It's an insane improvement over normal (software) /dev/random. I believe it has some other sorts of encryption friendly features, but I haven't played with it much, yet.
- Re:Wow, this is pretty clever (Score:5, Informative)
  
  by macemoneta ( 154740 ) writes: on Monday April 05, 2010 @09:07PM (#31743760) Homepage
  
  Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.
  That's already implemented in the high performance ssh patch, available here [psc.edu]. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.
  I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by sowth ( 748135 ) * writes:
  
  I was thinking Soekris Engineering's vpn accelerator card [soekris.com] would help, but it appears to only be able to do 250 Mbps. (You wanted 1 gigabit/s, right?)
  That card is really old too. I first read about it probably 10 years ago. I don't think it has changed in that time... I wonder if someone makes a faster accelerator? Then again, what about the GPU? Has anyone tried encryption with GPUs before? They've done other supercomputing tasks. A quick search says they have. [nvidia.com]
20%?! (Score:5, Insightful)

by temojen ( 678985 ) writes: on Monday April 05, 2010 @08:21PM (#31743372) Journal

If most programs are spending 20% of their time on memory management, something is wrong.

Share
twitter facebook
- Re: (Score:2)
  
  by kisielk ( 467327 ) writes:
  
  From the paper:
  Previous studies show that some C programs spend up to one third of their execution time in dynamic memory management routines such as malloc and free
  You can check the PDF for the cited studies.
- Re: (Score:3, Insightful)
  
  by naasking ( 94116 ) writes:
  
  Not at all. 20% is a very typical overhead for dynamic memory management. Did you think malloc/free costs nothing?
  - Re: (Score:2)
    
    by SpazmodeusG ( 1334705 ) writes:
    
    I don't think that's what he was getting at. I think he means you can avoid that much malloc/free-ing.
    Memory pooling and allocating outside of tight loops comes to mind.
    - Re:20%?! (Score:5, Informative)
      
      by naasking ( 94116 ) writes: <naasking@[ ]il.com ['gma' in gap]> on Monday April 05, 2010 @09:30PM (#31743926) Homepage
      
      I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free [umass.edu].
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by AuMatar ( 183847 ) writes:
    
    And OOP makes things worse in this area- it tends to have a lot of small, short life objects that need to be constructed and destructed. Java is particularly bad at this due to decisions like immutable strings. Those extra object allocations add up quickly.
    - Re: (Score:2)
      
      by naasking ( 94116 ) writes:
      
      Memory management overhead in GC'd languages is typically around 30%, so not much worse than malloc/free which averages around 15-20%. You gain a lot of productivity for that 10-15% overhead tradeoff though.
      - Re: (Score:2)
        
        by AuMatar ( 183847 ) writes:
        
        I've personally never seen much if any productivity gain from GC. If anything I've seen a loss- I find that managing memory helps me make better designs, and memory ownership problems are almost always the first sign that there's a major design flaw. Quite frankly memory bugs are rare among decent programmers- I find one a year or so in most places I've worked, and those usually come about due to someone trying to be a bit too clever minimizing memcpy calls. It doesn't even make the top 20 list for bug
        
        Re: (Score:2)
        
        by naasking ( 94116 ) writes:
        
        You significantly understate the complexities of manual memory management. Any two programs with compatible interfaces written in a GC'd language will compose without leaks, where any two programs with compatible interfaces written with manual memory management will not necessarily compose. This requires diverting valuable resources from development to analyzing the safety of any composition of non-composable programs.
        
        Re: (Score:2)
        
        by AuMatar ( 183847 ) writes:
        
        No, I actually write programs in C++ as a career. Memory management is trivial. It just doesn't cause bugs, unless you hire really bad programmers. The number of bugs it does cause are trivial- under 1% of bugs are due to it. And I say this with over a decade of experience coding.
      - Re: (Score:2)
        
        by Nadaka ( 224565 ) writes:
        
        Besides that java running with optimized options under hotspot can beat c++ in creating/destroying objects vs c++'s alloc/malloc by up to 4 times. In some cases it can push java to execute faster than c++ (well crafted c with structs would still beat them both though).
      - Re: (Score:2)
        
        by dgatwood ( 11270 ) writes:
        
        Double the overhead is not much worse?
        
        Re: (Score:2)
        
        by naasking ( 94116 ) writes:
        
        By doubling the memory management overhead in the general case, you now get to work with pervasive immutable data, generally not worry about memory leaks, and dramatically simplify the semantics of interface boundaries.
  - Re: (Score:2)
    
    by Tenareth ( 17013 ) writes:
    
    Yes, most modern programmers do think that way based on most of the code I've seen in the past 10 years.
  - Re: (Score:3, Insightful)
    
    by RAMMS+EIN ( 578166 ) writes:
    
    ``20% is a very typical overhead for dynamic memory management. Did you think malloc/free costs nothing?''
    Many people actually seem to think that, and that only automatic memory management is costly. Out in the real world, of course, managing memory costs resources no matter how you do it, and you can sometimes make huge performance gains by optimizing it. I've seen percentages of time spent on memory management anywhere from 99% in real programs. As always: measure, don't guess.
  - Re:20%?! (Score:5, Funny)
    
    by guyminuslife ( 1349809 ) writes: on Tuesday April 06, 2010 @02:52AM (#31745372)
    
    I was aware that malloc() had a price tag attached, but free()? That's misleading advertising.
    
    Parent Share
    twitter facebook
Might be particularly applicable to Java (Score:2, Interesting)

by Anonymous Coward writes:

Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.
- Re: (Score:3, Informative)
  
  by binarylarry ( 1338699 ) writes:
  
  Actually, Java already does something very similar to this: http://en.wikipedia.org/wiki/Java_Memory_Model [wikipedia.org]
- Re:Might be particularly applicable to Java (Score:5, Interesting)
  
  by SpazmodeusG ( 1334705 ) writes: on Monday April 05, 2010 @08:44PM (#31743584)
  
  Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).
  
  In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Nadaka ( 224565 ) writes:
    
    I would assume you are running hotspot with the server parameter? Anything else you do to bump java performance?
  - Re: (Score:2)
    
    by buchner.johannes ( 1139593 ) writes:
    
    The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.
    What would the results be when using boehmgc?
    - Re: (Score:2)
      
      by SpazmodeusG ( 1334705 ) writes:
      
      I'll get around to trying boehm gc one day but I'm pretty sure i'd see a speedup over a naive implementation (bulk allocations and frees are always faster than allocating in a loop) but it wouldn't be as good as an implementation that uses a memory pool (millions of the same type of object for a DMC compression algorithm - it's essentially made to be memory pooled).
      Similar to the Java scenario.
- Re: (Score:2)
  
  by glwtta ( 532858 ) writes:
  
  Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules.
  
  That's not my understanding of how Java works. In fact, I can't see any benefit from this approach in Java at all: new object allocation is extremely fast (trivial, really) and heap compacting / garbage collection is already parallelized, and much more optimized than what malloc/free can do.
  
  I may be wrong about this, but the on
- Re: (Score:2)
  
  by radish ( 98371 ) writes:
  
  Java does basically no mallocs. Obviously there's an initial malloc to allocate the heap to the JVM but after that it's all managed by the JVM itself, and it's been demonstrated as being much faster than a traditional malloc/free approach [ibm.com]. Assuming you set your Xms and Xmx sizes correctly the system malloc implementation is basically irrelevant to Java execution speed.
  - Re: (Score:2)
    
    by Azarael ( 896715 ) writes:
    
    And how does this situation differ other than the fact that the alloc/free operations are done local to the JVM instead of making system calls? The fact that the JVM is doing the work doesn't magically make memory management easier.
    The other thing that I'm skeptical about is that the article seems to be contradicted by a more recent paper by the author that they are referencing (see Zorn http://portal.acm.org/citation.cfm?id=582419.582421 [acm.org]). In the newer paper, Zorn et al. say that custom allocators are le
Uhm, isn't this just garbage collection? (Score:2)

by GWBasic ( 900357 ) writes:

Uhm, so what's the big deal? .Net's garbage collector runs in its own thread.
You can malloc it but you can't use it (Score:2, Insightful)

by lordlod ( 458156 ) writes:

The article(s) are very scarce on details but it seems like the gains will be limited in most applications. Fundamentally you have to block until the malloc has finished before you can use it. So it helps if you malloc well ahead of time, but not if you malloc as you need it.
A common simplified structure is:
malloc memory use memory free memory
With these new innovations you get:
async malloc memory block until malloc finishes use memory async free memory
And free shouldn't take a noticable amount of ti
- Re:You can malloc it but you can't use it (Score:4, Informative)
  
  by kisielk ( 467327 ) writes: on Monday April 05, 2010 @09:43PM (#31744000)
  
  The PDF of the paper has all the details. The article is just fluff.
  
  Parent Share
  twitter facebook
It's programmers that need parallelization (Score:5, Insightful)

by w0mprat ( 1317953 ) writes: on Monday April 05, 2010 @08:46PM (#31743596)

Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.

If I still coded much anymore it would drive me to drink.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by __aasqbs9791 ( 1402899 ) writes:
  
  ...If I still coded much anymore it would drive me to drink.
  Maybe that's my problem? If I started drinking maybe I could handle it [programming for other people] again.
Does it matter anymore? (Score:3, Interesting)

by Tenareth ( 17013 ) writes: on Tuesday April 06, 2010 @12:18AM (#31744770) Homepage

There are very few developers left in the US that even know what memory management is.
People don't even try anymore.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by jasmusic ( 786052 ) writes:
  
  Those developers can hold the rest of the software industry hostage for mad income. OS kernels don't write themselves.
More useful is the inline protection algorithms (Score:3, Interesting)

by Chirs ( 87576 ) writes: on Tuesday April 06, 2010 @12:37AM (#31744848)

The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.

Share
twitter facebook
As usual Smalltalk's been there done that (Score:3, Informative)

by itsybitsy ( 149808 ) * writes: on Tuesday April 06, 2010 @01:50PM (#31750984)

Not much to see in the article. Move along.
IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).
One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.
Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by 14erCleaner ( 745600 ) writes:
  
  The speedup comes from using a memory-allocation library (PHKMalloc) that does extensive and expensive checking to avoid programmer errors, then basically hides most of its overhead in the second thread (so that the allocation thread would be mostly doing sanity checking). For most programs, this probably won't help performance any. It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.
  I once submi
  - Re: (Score:2)
    
    by beakerMeep ( 716990 ) writes:
    
    He didn't reject it, he just moved it to another processor.
  - Re: (Score:2)
    
    by droopycom ( 470921 ) writes:
    
    It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.
    Actually its a very interesting trick.
    There are so many slow algorithm in the wild that having a simple method to speed them all up would be very useful.
    Ah, yes, this would not be a theoretical break-through, but a very practical one indeed... ... if their claims can be substantiated of course
- Re: (Score:3, Insightful)
  
  by Zironic ( 1112127 ) writes:
  
  It's a performance gain because it's extremely rare that all your cores are maxed out at once, if you can distribute the computing power more evenly it's a performance gain in most circumstances even if the net computing power required increases.
- Re:free() is probably more parallizable than mallo (Score:5, Informative)
  
  by JessGras ( 953965 ) writes: on Monday April 05, 2010 @09:11PM (#31743782)
  
  Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf [ncsu.edu], they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
  Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by droopycom ( 470921 ) writes:
    
    So basically, they apply garbage collection techniques to regular malloc/free program
    Not bad, all things considered...
    - Re: (Score:2)
      
      by SirSlud ( 67381 ) writes:
      
      It's clear this is widely available - the devil is in the details. You can always abstract over a blocking system when you have threads/cores available. Those who work on fixed platforms like me kinda have to look to making the threading/locking more efficient. If you have the option of spending money on more potential parallel computing power for less, and can move up the chain, then solutions like this are more valuable.
      It's interesting watching people talk about this kinda stuff, because ultimately the v

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Beware the key term there: (Score:5, Insightful)

Re: (Score:2)

Re:Beware the key term there: (Score:4, Interesting)

Re: (Score:3, Funny)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Re: (Score:3, Interesting)

Re:Beware the key term there: (Score:4, Insightful)

Re: (Score:3, Informative)

Re:Beware the key term there: (Score:5, Insightful)

Re:Beware the key term there: (Score:4, Insightful)

Re: (Score:2)

Obligatory Penny Arcade (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Just remember to be aware of multi PROCESSOR (Score:2)

Re: (Score:2)

Re: (Score:2, Troll)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

processor affinity (Score:2)

Re:Just remember to be aware of multi PROCESSOR (Score:5, Informative)

Nothing to see here.... (Score:5, Insightful)

Re:Nothing to see here.... (Score:5, Funny)

Re:Nothing to see here.... (Score:5, Interesting)

Re:Nothing to see here.... (Score:4, Insightful)

Re: (Score:3, Insightful)

Re: (Score:3, Informative)

Re: (Score:3, Interesting)

Re:Nothing to see here.... (Score:5, Informative)

Re: (Score:3, Insightful)

Re:Nothing to see here.... (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3, Interesting)

Re:Nothing to see here.... (Score:5, Interesting)

Re:Nothing to see here.... (Score:4, Funny)

Re:Nothing to see here.... (Score:4, Informative)

Re: (Score:3, Informative)

Re: (Score:3, Insightful)

Re:Nothing to see here.... (Score:4, Informative)

Wow, this is pretty clever (Score:4, Interesting)

Re:Wow, this is pretty clever (Score:4, Informative)

Re: (Score:2)

Re:Wow, this is pretty clever (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Wow, this is pretty clever (Score:5, Informative)

Re: (Score:2)

20%?! (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re:20%?! (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re:20%?! (Score:5, Funny)

Might be particularly applicable to Java (Score:2, Interesting)

Re: (Score:3, Informative)

Re:Might be particularly applicable to Java (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)