Forgot your password?
typodupeerror
Software

Knuth Got It Wrong 298

Posted by kdawson
from the to-be-or-not-to-be-heap dept.
davecb writes "Think you've mastered the art of server performance? Think again. Poul-Henning Kamp, in an article at ACM Queue, finds an off-by-ten error in btrees, because they fail to take virtual memory into account. And he solves the problem in the open source 'Varnish' HTTP accelerator, for all of us to see and use."
This discussion has been archived. No new comments can be posted.

Knuth Got It Wrong

Comments Filter:
  • Nope, he didn't (Score:5, Informative)

    by siddesu (698447) on Monday June 14, 2010 @07:29PM (#32572714)
    Knuth's analysis is valid in the framework of his assumptions, and what is described in the linked article has been known as "cache oblivious b-tree" for not so short time.
  • by Anonymous Coward on Monday June 14, 2010 @07:34PM (#32572762)

    link [xkcd.org]

  • by DragonWriter (970822) on Monday June 14, 2010 @07:47PM (#32572904)

    Yes it's true. In some real-world applications an algorithm encounters it's worst case running time more than the predicted theoretical average case running time.

    That's actually not the issue.

    The issue is that in the real world, the assumptions underlying the calculation of algorithmic complexity may not hold. For instance, Knuth's analysis that the author of the article here holds to be misleading (not, as the Slashdot title suggests, "wrong") calculates the complexity based on the assumption of ideal random access memory, that is, memory for which all accesses are equal cost.

    In real-world computers using caches (as pretty much all real-world computers do, often at many different levels) that assumption does not hold -- access to some parts of the memory address space are much more expensive than accesses to other parts of the address space, and which parts of the address space are expensive changes over time (and how it changes over time potentially depends on the caching strategy used at every level lower than the level at which the algorithm is implemented.)

    This means that algorithms in the real world can scale worse than their theoretical "worst-case", if that theoretical worst-case scaling is based on the assumption of constant memory access cost, since that assumption does not hold in the real world.

  • by Anonymous Coward on Monday June 14, 2010 @07:50PM (#32572936)

    Poul-Henning is definitely not a "doof". He's single-handedly responsible for a huge amount of the FreeBSD kernel. Yes, this is the same FreeBSD that powers Yahoo!, that provided a significant portion of Mac OS X, and runs hundreds of thousands of businesses world-wide.

    To suggest that phk doesn't know what he's talking about is absurd. He's one of the top three UNIX gurus in the entire world. In fact, the Internet today is what it is thanks to his hard work and dedication.

  • well (Score:2, Informative)

    by larry bagina (561269) on Monday June 14, 2010 @07:53PM (#32572968) Journal
    I don't have time to read through the article and verify the numbers (at least right now) but anyone who's even paged through TAOCP knows it was written for computers where "swap" was what the operator did when the tape was full. (Ok, they also had drum memory).
  • by SashaMan (263632) on Monday June 14, 2010 @07:59PM (#32573040)

    The summary is wrong when it talks about "an off by ten error in btrees". In fact, the article talks about how normal binary heap implementations are slow when virtual memory is taken into account.

    In fact, b-trees ARE cache aware and ARE optimized to limit paging on disk. PHK's algorithm is essentially a cache-aware version of a binary heap.

    That is, binary tree is to b-tree as binary heap is to PHK's b-heap.

  • by MadAndy (122592) on Monday June 14, 2010 @08:04PM (#32573074)
    Knuth's stuff assumes everything is RAM-resident - as long as you don't violate that what he wrote is as valid as ever. I'm quite certain that he'll have different suggestions for tasks involving storing data on disk. Even though the disk writes are implicit because the OS is doing them, they're still there, just as they would be had he coded them explicitly. So of course you're going to get poor performance using a RAM-resident algorithm for a disk-resident application.

    The RAM resident stuff is still useful, both at a lower level, but also for those of us creating applications that can live entirely in memory. A web site I did recently is careful to ensure that the entire primary data set can fit in memory, and for that site everything he wrote is still perfectly valid.

    In fact, for very high performing websites you try to ensure that at least most of your requests come from memory rather than disk, which makes Knuth's stuff more important than ever. If you can't do it in RAM then you'd better have a lot of spindles!

  • by ImABanker (1439821) on Monday June 14, 2010 @08:28PM (#32573298)
    How does one get from, "I have not found a systematic flaw in the quality of Knuth et al.'s reasoning" to "Knuth Got it Wrong"?
  • Re:Timer wheels (Score:3, Informative)

    by chhamilton (264664) on Monday June 14, 2010 @08:56PM (#32573486)

    The main reason they are using the buckets is to delay sorting costs as far as possible into the future so that there is less cost for most timers (as most timers are apparently deleted far before expiring). I'd suggest that the major performance gain is due to this lazy sorting, and not because their data structure avoids page faults. (Well, it does avoid page faults that the old linked list algorithm would have caused, but these page faults are due to the sorting going on in the original code which is avoided in the second. If timers did not expire, the two approaches would be quite similar, both generating page faults when sorting the linked lists, which likely have bad locality, and neither being as good as an IO-efficient algorithm.)

    Using timer wheels as a heap structure wouldn't be appropriate unless many of the objects placed in the heap are removed prior to making it to the top of the heap. If this is not the case the sorting of the items from one bucket to the next bucket (sorting a linked list) would cause many page faults if the list didn't fit in internal memory. Timer wheels do nothing to address data locality which is the main problem faced by extremely large heaps. Your mention of in-order access is only true if the lists at each bucket as indeed stored sequentially in memory. This is hard to guarantee unless that space is reserved ahead of time or some dynamic reallocation scheme is used. I read the linked article as implying that simple linked lists were used, which generally have very bad data locality. Even if if a linear list was guaranteed, however, the sorting step when pushing things from one bucket down to the next bucket would incur page faults (assuming the list was too big to fit in memory) unless an I/O-efficient or cache-oblivious sort were used. (Which could easily be done, making an IO-efficient timer wheel structure.)

    The algorithm discussed in the article is for a general purpose heap. In most heaps the main cost is in removing elements from the root of the heap as they bubble up through it, rather than deleting them prematurely (as is the case with timers). Different approaches for fundamentally different problems.

  • Careful there... (Score:3, Informative)

    by Estanislao Martínez (203477) on Monday June 14, 2010 @09:24PM (#32573672) Homepage

    This means that algorithms in the real world can scale worse than their theoretical "worst-case", if that theoretical worst-case scaling is based on the assumption of constant memory access cost, since that assumption does not hold in the real world.

    I agree with all of your post except this part. Algorithmic complexity theory is about orders of magnitude, not about precise numbers. That's why we have O(n) as a complexity class but not O(2n), O(3n) as separate classes; saying that an algorithm has O(n) worst-case time performance is saying that the time it takes to run it is approximated by some linear function of the problem size. We don't particularly care which linear function it is, because that depends too closely on your assumptions about the computational model and/or hardware. What we really care about when we call it O(n) is that that's better than something like O(n^2) (a.k.a. polynomial time [wikipedia.org] or just "P") or O(log n), no matter what computational model you assume.

    As long as the slow memory accesses in the physical hardware still respect some bound, you can treat that upper bound as the constant worst-case memory access cost, and use the algorithmic complexity analysis to calculate an upper bound on the algorithmic complexity. If we turn your argument on its head, we'd say instead that the actual physical cost of running algorithms is often faster than the complexity class indicates because many memory accesses are much faster than the constant upper bound, but that doesn't make the linear upper bound invalid. The best you might be able to do is to assume a finer-grained computational model with variable cost memory access, and prove that you can get a lower upper bound there, but the original higher upper bound is still an upper bound for the computational model chosen, and can still be useful when reasoning about a system.

  • Many servers (Score:5, Informative)

    by tepples (727027) <[moc.liamg] [ta] [selppet]> on Monday June 14, 2010 @09:36PM (#32573746) Homepage Journal

    Unless you have many servers

    The article does in fact mention many servers, specifically replacing twelve Squid servers with three Varnish servers.

    it is cheaper to throw money at the problem in the form of physical RAM before you start thinking about problems like this.

    No matter how much physical RAM you have, you're still limited by the speed of the interface between RAM and the CPU. If a set of closely related nodes can fit in a cache line, this improves locality so that the CPU can do more work instead of twiddling its thumbs.

  • by borgboy (218060) on Monday June 14, 2010 @10:57PM (#32574240)

    In no way whatsoever did he say 'remember swap is slow; try not to use it.'

    That's as wrong as the idiotic summary.

    Here's a relevant quote:

    A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it.

    The article is about redesigning binary heaps to account for non-linear access times between nodes due to swap. This point is critical. He's NOT avoiding swap, he's planning for it.

  • Re:Careful there... (Score:5, Informative)

    by Rakishi (759894) on Monday June 14, 2010 @10:59PM (#32574252)

    No, you're missing the point. And wrong. And also missing a lot knowledge about what O() means.

    It doesn't matter how many orders of magnitude slower an operation is. O() is about SCALING for input size n. That's all. Constant factors do not matter. The GP already said this, please pay bloody attention next time. The running time of each operation is just a constant and has no impact on the O() performance. An order of magnitude or fifty is all the same. You talk about 1/100 speed and compare n to n^2. Hahahaha. Take n = 1000000. You know what the difference between n and n^2 is then? 1000000. Your puny factor of 100 is irrelevant against six orders of magnitude performance difference.

    The worst case performance does not change. It's still O(n) or O (n^2). In fact it's quite possible for an O(n^2) algorithm to be faster than an O(log(n)) algorithm for small n under certain conditions. O is all about n being so bloody large that constant factors don't matter anymore.

    This is all basic introductory algorithms stuff, please read up on it before trying to chime in on any arguments related it it in the future, okay?

  • by Anonymous Coward on Monday June 14, 2010 @11:48PM (#32574552)

    You start with k, and follow with dawson.

  • by cwills (200262) on Tuesday June 15, 2010 @12:12AM (#32574674)

    This really isn't anything new. Knuth didn't get it "wrong". He based his analysis of the algorithms assuming a system that had dedicated memory and where each instruction of code ran uninterrupted and in a consistent fashion.

    Certain memory access patterns are "bad" under a system that uses virtual memory, especially when the base system is memory constrained. This has been a well known fact for decades. In fact one of the maybe lost arts of programming was ensuring reference locality, not only of data, but also of code. It was a common practice to ensure that often called subroutines or functions where either located in same page of memory as the calling code, or to group all the often called functions into as few pages of memory as possible.

    Basically, every address space has what is sometimes called a working set, a set of pages that have been recently referenced. There are three things that can happen with a working set. It can remain the same size, it can grow and it can shrink. If it remains the same, there is no additional load to the operating system. If it shrinks, there is no additional load to the operating system, in fact this can help a memory constrained system. A growing working set however an lead to a thrashing system. Some operating systems will monitor working set sizes and can adjust dispatch priorities and execution classes depending on what the recent working set size history is. An application with a growing working set may very will find itself at the end of the queue way behind applications that have a static working set size.

    Take for an example the following very simple program

    static string buffer[256][4096]
    while not infile.eof() do
    infile.readinto(buffer[0],256)
    outfile.writefrom(buffer[0],256)
    end

    Here the working set of this program will be very small. Ignoring the file i/o routines, all the code and data references will be limited to basically a fixed section of memory. From a virtual memory stand point, this is a "well behaved" application.

    Now take the following

    static string buffer[256][4096]
    while not infile.eof() do
    bindex = random(0,4095)
    infile.readinto(buffer[ bindex ], 256)
    outfile.wwritefrom(buffer[ bindex ], 256)
    end

    Functionally the same program, however the data reference pattern here is all over the place. The working set will be large, since many of the buffer pages will be referenced. The program never stays long on the same memory location.

    Finally take the following example

    static string buffer[256][4096]
    infile.readinto(buffer[0], 256* 4096) // fill the entire buffer
    for i = 0 to 4095 do
    numbercrunch( buffer[i] )
    end

    Here there will be an initially huge working set as the data is read in. However, the working set will shrink to a reasonable size once the numbercrunching phase starts since the data references will all be localized to a small block of memory.

  • Re:Careful there... (Score:3, Informative)

    by Rakishi (759894) on Tuesday June 15, 2010 @01:36AM (#32574956)

    Sigh. No, the GP doesn't know what he's talking about. And apparently neither do you. If he did he would not mention something as idiotic as cache making the algorithm go from O(n) to O(n^2). It makes it go from O(x*n) to O(100*x*n), both of which are O(n). But if n is 50, O(x*n^2) would come out faster.

    You're both confusing the very real issue of the constant running time of operations and O algorithm scaling. The algorithm is still O(n) however for the data sets in question it's slower than an O(n^2) algorithm. The worst case O performance is identical, that is simply a question of scaling to data sizes. However the practical constant factors are now much larger than one would normally expect and they vary based on other factors.

    It's a simple distinction, really, which is why I'm confused at people's inability to grasp it.

    I said nothing about the article in question or anything related to it on purpose. That is because none of the stuff you mentioned had really anything to do with that. Not specifically and the article makes no claims that agree with you. This is simply an argument about what worst case and O notation refers to.

    The article essentially is about finding a way to reduce the constant factor in an algorithm. The O notation makes no sense in this case. It's still O(n) or whatnot. Except it's now 10 times faster.

    Honestly, slashdot keeps letting me down these days. Sigh.

  • Re:Crank it to 11 (Score:3, Informative)

    by deniable (76198) on Tuesday June 15, 2010 @02:27AM (#32575122)
    No, but they do understand octal.
  • by phkamp (524380) on Tuesday June 15, 2010 @04:26AM (#32575568) Homepage

    What a misleading title, it is not even in the same continent as the article.

    A large number of people obviously didn't read the actual article.

    And I guess Knuth has quite a fanboi community on slashdot. I wonder if he really appreciates that ?

    Some of those who did read the article, does not seem to know the difference between a binary heap and a binary tree, and even the pretty strong clue to the difference in the text, did not make them go check wikipedia. 10 out of 10 for selfesteem, but 0 out of 10 for clue.

    Those who think CS should be unsullied by actual computers should make sure to note this belief on their resume. (Trust me, everybody you send your resume to will appreciate that.)

    Those who advocate getting rid of Virtual Memory must have much more money for RAM than is sensible. I wish I could afford that.

    About five comments tries, in vain it seems, to explain the point of my article to the unwashed masses (kudos!, but really, what are you doing here ?)

    Not one comment rises to a level where I feel a need to answer it specifically. On Reddit over the weekend there were about a handful.

    Is that really par for the course on slashdot these days ?

    Sic transit gloria mundi...

    Poul-Henning

  • Re:Careful there... (Score:3, Informative)

    by Nitage (1010087) on Tuesday June 15, 2010 @05:31AM (#32575780)
    O(1) *is* O(100000). The point of the article is that memory access is not O(1) - it's O(n) or worse.
  • by Anonymous Coward on Tuesday June 15, 2010 @06:10AM (#32575908)

    http://research.microsoft.com/en-us/um/people/trishulc/papers/cache-conscious.pdf

  • by Anonymous Coward on Tuesday June 15, 2010 @08:06AM (#32576380)

    3x16 = 48, not 300.

    But I understand what you want to say.

    But if the performance of your app is based upon having a lot of swap activity, you should do it wisely.

    Allocating too much and resorting to 'let the OS sort it out' is not the best approach, which leaves space for optimisation. This is such an optimisation.

    But better programming would be to ensure the system doesn't swap at all or at a very low swapin swapout rate.

Any given program, when running, is obsolete.

Working...