Knuth Got It Wrong 298
davecb writes "Think you've mastered the art of server performance? Think again. Poul-Henning Kamp, in an article at ACM Queue, finds an off-by-ten error in btrees, because they fail to take virtual memory into account. And he solves the problem in the open source 'Varnish' HTTP accelerator, for all of us to see and use."
Re:Journaling Filesystems (Score:2, Interesting)
Why trust the OS? (Score:5, Interesting)
Re:as Knuth told me when I was at his house (Score:3, Interesting)
I wrote Knuth an email once. He never wrote me back.
As with the xkdc.org reference in the grandparent, this is something of an in joke. Don doesn't do email. B-)
Re:Knuth didn't get anything wrong (Score:3, Interesting)
Squid is his use case to beat? Has he been in any high performance app stack in the past five years?
Squid... really?
Considering Varnish was written due primarily to the lacking performance of Squid, I don't see why this is so bad?
Re:Knuth didn't get anything wrong (Score:3, Interesting)
Other then varnish and squid what caching reverse proxy software is there? It looks like Nginx added caching to their core recently, although I'm not exactly sure if its intended to act in the same capacity as squid and varnish or to be more like mod_cache for lighttpd. I guess you could use apache and mod_proxy but that's not exactly high performance. I know my employer looked at the various offerings and we ended up writing our own on top of a webserver called OKWS.
Don Knuth pays people who find errors (Score:3, Interesting)
Re:Nope, he didn't (Score:3, Interesting)
Instead I had to read a bunch of text and timing charts before he simply showed what his improvement was. Yes, cache oblivious is worse, but that wasn't the problem Knuth was trying to solve. You could make further arguments that the paging system should group links by location AND popularity. You could move more popular links to the top of the tree so you don't have to traverse past the first page to find them. Also, I would think different applications would have different potential improvements. Algorithms for hosting links to news articles (where newer articles are more popular) might not work well with this algorithm when every newly inserted link ends up at the bottom of the tree.
On the flip side, most people don't care. Unless you have many servers, it is cheaper to throw money at the problem in the form of physical RAM before you start thinking about problems like this.
Re:Nope, he didn't (Score:3, Interesting)
For instance if the b-tree node is roughly the size of a page, the entire node must be read into memory at once from disk. Even reading a single byte of the record may necessitate reading the entire page. However this size will likely be much larger than a cache line. So you could have quite a lot of variance in performance based on how much of the structure you access; say scanning through the whole node versus just reading a few words at the top.
Re:Why trust the OS? (Score:2, Interesting)
Why is this funny? Almost all DBMS kernels do it -- they manage physical memory and disk blocks themselves. More to the point, Dr. Kamp is glossing over the fact that one doesn't use in-memory data structures where data my reside on disk.
In this case, the suitable data structure is any variation of B+Tree, which actually contains children of a node in one memory page and when there is no more room in that page a page split happens. He used the wrong D/S to begin with, concludes Knuth was wrong and finally ends up implementing Knuh's B+Tree.
Alas, no more (Score:5, Interesting)
Unfortunately, he no longer gives out reward checks [wikipedia.org] for finding bugs in his texts. This seems to be mostly because proud bug-finders inevitably post images of the checks online, which of course, contain Knuth's bank account numbers. More discussion here [stanford.edu].
Re:as Knuth told me when I was at his house (Score:4, Interesting)
He uses email; his secretary prints them out (after some selection) and forwards them. And he reacts to them. Well, at least, to some, if he knows you. Don is the most humble genius that I know, and I have always found him approachable in a way that is rare for other famous people.
He also uses email (typically via his secretary, again) to organize trips. I also have direct emails from him when some new TeX developments irks him.
Re:Agreed (Score:3, Interesting)
I mean, just for example, Knuth spends whole chapters talking about sorting algorithms. Almost all modern programmers don't have any use for that stuff, because re-implementing sort in your application code is pointless and counterproductive. The built-in sort provided by any modern language is optimized at a lower level and will handily beat the pants off anything you can do. Heck, even the Schwartzian Transform is built in to most modern languages these days.
Okay, sure, the guys who build low-level tools (compilers and system libraries and such) still need to know about sorting routines. What percentage of the programmer population is that? 0.1%? And even there, Knuth is somewhat obsolete, because (among other things) the stuff you're sorting is almost certainly stored in high-level cross-platform data structures, not some flat array of machine integers. I mean, it's still true that you probably don't want to code a bubble sort, but that's not exactly a profound revelation.
It's a BINARY heap (not std. heap rakslice) (Score:1, Interesting)
See this:
http://en.wikipedia.org/wiki/Binary_heap [wikipedia.org]
"It's not a very useful heap if it has the middle value at the top." - by rakslice (90330) on Tuesday June 15, @01:09PM (#32580468) Homepage
See that URL at the top, because what's noted in this article is NOT a "std. heap", but instead, a BINARY heap... & it IS built up from a BINARY TREE structure!
Additionally, those ARE basically constructed in the manner I noted doing comparisons of each "potential parent node" to it's "potential children node(s)" beneath it, usually using "Greater than" or "Less than" type comparisons & putting items to the left & right of said "potential parent node(s)". IN fact, see this from that same URL on BINARY HEAPS:
http://en.wikipedia.org/wiki/Binary_heap [wikipedia.org]
"The heap property: each node is greater than or equal to each of its children according to some comparison predicate which is fixed for the entire data structure."
(Just like a Binary Tree is constructed & as I noted in my orig. post)
The "pertinent quote" from said Wiki is this:
http://en.wikipedia.org/wiki/Binary_heap [wikipedia.org]
A binary heap is a heap data structure created using a binary tree.
(And, what do BINARY trees need? A pivot, like many sorts do, and that is usually the "mid-most value", because "binary search patterns", depend on that...)
Hence, why I noted to use my "puny trick" IF NEEDED (not always needed, as for instance in DB work, rowcounts in datasets ARE available), to find said PIVOT midpoint value IF you do not know the total # of elements in your dataset beforehand (or, if it changes, such as values being eliminated during say, deduplications of elements/records of said dataset & say, DB style indexing + reindexing is NOT in place in your app)...
APK
P.S.=> However, I did state "correct me if I am 'off'" in my first post, so "have @ it" guys (as rakslice has attempted to do, but, he used a std. heap NOT what this article's about in BINARY heaps usage)!
(Here? I am PRETTY SURE I am correct, but, I can stand new ideas & corrections as much as the next guy can @ times & though I have worked with Binary Trees more than a few times? Binary Heaps I have not, so... keep those "critiques" coming if you wish!)
So again: "have @ it" with the critiques, especially in material of this nature & the fact I have not "visited it" in QUITE SOME TIME (it's the FUN stuff imo!)... apk