Is It Possible to Implement Faster Binary Searches? (github.com) 98
Last week Slashdot reader scandum described the search for the most efficient sorting algorithm.
Now he's back, touting a new implementation for binary searches (using the same GitHub repo, and written in 15 to 30 lines of C code) that he says may be "up to 40%" faster for 32-bit integers. ("Keep in mind performance will vary depending on hardware and compiler optimizations.") The most commonly used binary search variant was first published by Hermann Bottenbruch in 1962 and hasn't notably changed since. Binary searches are one of the corner stones of computer science...
The reason the algorithms are faster appears to be a combination of simpler calculations, branch prediction, and a reduction in cache misses.
Now he's back, touting a new implementation for binary searches (using the same GitHub repo, and written in 15 to 30 lines of C code) that he says may be "up to 40%" faster for 32-bit integers. ("Keep in mind performance will vary depending on hardware and compiler optimizations.") The most commonly used binary search variant was first published by Hermann Bottenbruch in 1962 and hasn't notably changed since. Binary searches are one of the corner stones of computer science...
The reason the algorithms are faster appears to be a combination of simpler calculations, branch prediction, and a reduction in cache misses.
Better than Binary? (Score:2)
Re: (Score:2)
Re: (Score:1)
There are no "quantum computers".
No love for Radix sort? (Score:1)
Why no love for Radix sort? Yeah, it's pretty much restricted to sorting fixed-length values, but if that's what you're sorting then its O(n) performance scaling is as good as you could possibly ask for.
Granted, most sample implementations sort by decimal digits for clarity, which introduces a lot of computational overhead and a large constant multiplier, but it's just as effective (and a _lot_ faster) if you use one-byte digits. Then it only takes 4 passes to completely sort any length list of 32-bit val
Re: (Score:2)
Wait, nevermind. they said search, not sort. Need more coffee.
Re: (Score:2)
Re: (Score:2)
How would you do that? A radix sort doesn't have a structure that would be terribly helpful for lookup.
Are you thinking of something like a a 256-ary tree, where each node has (up to) 256 children based on the value of the next byte of the key? (sort of a tree of nested radix bins?)
That would indeed give you O(k) lookup, (technically O(log_base_256 n) = O(log n) since constants are ignored for Big-O notation), but the memory overhead would be immense since you need an array linking to up to 256 sub-bins fo
Re: (Score:2)
Yup... you pretty much called it.
Memory overhead is high, but is still an asymptotically linear function of n where n is the total # elements you have stored, as there is no need to store the nodes which do not have children. Since you have to store the elements anyways, if this structure becomes how you store the data in the first place, then the overhead is generally inconsequential.
Insertion, deletion, and finding are all O(k) where k is the length of the key to be inserted, and the data is always
Re: (Score:2)
As Andrei Alexandrescu points out [youtube.com], when he invented a new sorting algorithm, everyone has been measuring sorting algorithms with an incomplete formula. The standard cost is:
O(n) = C(n) + S(n)
C(n) = time of Compares
S(n) = time of Swaps
The complete formula is:
Blended Cost = (C(n) + M(n) + k*D(n))/n
C(n) = time of Compares
M(N) = time of Moves
D(n) = average distance between subsequent array accesses
Radix sort makes 3 passes -- this can potentially blow your D$ cache. i.e. I've implemented Radix sort for Float32
Re: No love for Radix sort? (Score:3)
"orientated" is not a word
Re: (Score:3)
It's probably an Americanism, you know how they like to make words look more complicated, like turning "burgled" into "burglarized".
Re: (Score:2)
Thanks for the catch!
The "official" term is: Data-oriented design [wikipedia.org]
Re: (Score:2)
Good.
Apparently I'm "wrong" about it not being a word. The normal verb is orient [wiktionary.org], past participle is "oriented". But wiktionary does list orientated as a backformation from "orientation" [wiktionary.org]. Then it notes "usually considered incorrect in American English". Since Wiktionary has no citations anywhere I'm unsure how much to trust it.
It's one of those things that just grinds my gears, like people saying "I hate when I loose my keys".
Re:No love for Radix sort? (Score:5, Funny)
This is why you should implement everything on 6502 which doesn't suffer from things like cache misses etc. Everything runs at optimum speed at all times.
Re: (Score:2)
I use an emulated virtual 6502 for just this reason
Re: (Score:3)
Except that some instructions ran faster when addressing "zero page", the first 256 bytes. On a stock C-64 a lot of that was used by the kernel, but some of it was available for your programs. Of course cart games and stuff dumped the kernel and optimized to their heart's content. I bet the Defender cart did something like that. Defender rocked on the C-64. It saved me a lot of quarters.
Re: (Score:2)
I've implemented Radix sort for Float32 on the PS2 and found it performed worse, while on the Wii it was OK if I'm remembering correctly.
Yet as I understand it, some sort of bucket or radix sort was standard for sorting things on the original PlayStation, whose GPU lacked any sort of depth buffer.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Informative)
>there is very little reason to actually use binary tree for anything.
Except any time you are doing range lookups and wanted sorted results. Those are very good reasons to use tree structures.
Re:No change to Order of complexity (Score:5, Insightful)
>> there is very little reason to actually use binary tree for anything.
> Except any time you are doing range lookups and wanted sorted results. Those are very good reasons to use tree structures.
The best algorithm is
Algorithms and data structures are different (Score:5, Insightful)
This article is about binary search, an algorithm. It's not about binary tree, which is a data structure.
An algorithm is a sequence of actions, here used to find an element in a sorted array. A sorted array is a data structure. A structure is a thing, a noun. Algorithms are essentially verbs - things you can do TO a data structure.
It is not about binary tree, a data structure.
> Anyway there is a much faster algorithm, the hash table.
A hash table is a data structure, it is a thing. It is not an algorithm.
There are algorithms (sequences of steps) to build a hash table, to insert elements in a hash table, to delete elements, to find elements. These are algorithms.
Re: (Score:2)
A whole lot depends on the size and distribution of the data set.
A (sparse) hash-table lookup is an O(1) operation, but that one constant-time hashing operation can take much, much longer than visiting a node. If you're only storing thousands of values, a tree could potentially be much faster.
And of course a hash table is only as good as your hashing function, which is highly context dependent. A hashing function that's poorly optimized for the actual values being stored will create frequent hash collisio
Re: (Score:2)
Depend on what you're doing. Suppose you have one item in the structure and wish to find the next item in sequence? Hash starts looking like a very bad choice. In fact, binary tree doesn't look so good. What you want is a sorted table. But if you're also going to be doing insertions, then that's a bad choice.
Of course, this is irrelevant to what was claimed. But I suspect his sort mechanism is sensitive to the size of the data being sorted. (E.g., for fewer than 12 items a bubble sort variant is gene
Re: (Score:2)
Re: (Score:3)
Re: No change to Order of complexity (Score:2)
Does the inline core of the quaternary search being a linear search stop the algorithm even qualifying as binary search? Not sure there are even cache or branch prediction advantages either way but maybe worth trying an actual inline binary search there.
Only if you make it data-dependent (Score:3)
Caches are limited in what they can do here, because binary search does mostly non-local memory access. The real improvement on large arrays is the interpolation used, but this one works only if the data is distributed in a way that makes interpolation effective. In a real-wold scenario, this effect may be anywhere from even worse than regular search to giving the improvements described.
Also take into account that in a real-world scenario, you will have longer search keys. That reduces the potential gains even more.
Re: (Score:2)
Indeed, which is where optimized data structures can take up the slack - depending on context. E.g. sorting array data as a binary tree rather than a linear sequence will result in much higher locality since all the first several nodes to be traversed are all at the beginning of the tree. With 32bit values you can store 1000 nodes in a single 4k memory page, which is almost the entire top 10 levels of a binary tree.
Of course such a structure tends to be expensive to update since tree position is determine
Re: (Score:3)
Exactly. But that basic limitation remains: For something as fundamentally optimal and general as binary search, you can only get improvements by making it less general and hence in the general case, less optimal.
There are also other concerns. For example, a typical case for searching is an existence check. That is of course much better served by a hash-table. Another case is an interval search (all elements falling into an interval). A balanced binary search tree is much better than a general binary search
Re: (Score:1)
Could you use Benford's law ( https://en.wikipedia.org/wiki/... [wikipedia.org] ) to generate realistic test set and also to use that principle for optimizing the algorithm? In real world scenarios it should then be most likely optimal, perhaps?
Re: (Score:2)
No.
Excellent improvement on the implementation (Score:3, Interesting)
A binary search is a binary search, but the implementations the author uses make "educated guesses" when when limited to the dataset of integers allows optimization. The end result is a search that can go through a 1,000,000 dataset and find the index in around half the time of the other best algorithm, both of which outperform "classic" binary search.
If you're working with small datasets or unknown data types (i.e. not everything is a 32-bit integer) it's [IMHO] not worth changing your code for the small advantage. For large datasets and known dataset types... yes this will save you compute cycles, memory, and make everything from SQL servers to authentication servers faster.
This is well-written. So, by the way, is the C code.
Ehud
Re: (Score:2)
So, by the way, is the C code.
Ehud
Apart from all those single-line if statements without curly brackets.
(nb. some of the have them, some don't...)
Re: (Score:1)
> Apart from all those single-line if statements without curly brackets.
The "Harvard comma" of C-coding :) This has been discussed TO DEATH ("ad nauseum").
I was going to post links... but hey, free Google search:
https://www.google.com/search?... [google.com]
I think you weren't serious but I didn't see /s so I'm trying to be helpful to others who might wonder what the heck we're discussing ;)
Ehud
Re: (Score:2)
The "Harvard comma" of C-coding :)
No, the "Harvard comma" would be the curly braces on a separate line, making the code much much longer than it ought to be. I didn't mention that though.
https://jeremybytes.blogspot.c... [blogspot.com]
If statements without curly braces is much less open to debate: It's bad.
Re: (Score:2)
Personally I do prefer to just have all if statements with braces in line with the Zephyr coding style.
Re: (Score:2)
Am I missing something? I only skimmed the article, but it sounded like they're just doing a classic interpolation search, along with the sort of optimizations you'd normally apply when making a real-world algorithm rather than a high-clarity reference implementation.
Re: (Score:1)
"I only skimmed the article, ..."
Ssssss, apostate!
Re: (Score:2)
Improvment in binary search is pointless...
We already have ways to do this that can make use of cache better than binary trees (Say B-Trees for instance), but why does shaving a few percent off a linear problem make the news? Give me a way to run some previously geometric problems in linear time, now that's something useful.
IF you want to optimize your binary tree code for your hardware, do it in assembly and forget what compile optimizations you have or don't have. Do you want a performance boost? SKIP
Is this code fast??? (Score:2)
I'm always cautious of code optimization exercises like this. In application, the code tends to be more complicated and many of the simplifying assumptions do not apply. The classic example is sorting a list of integers (like quick-sort). It's a great academic exercise. However, in practice, I'm always sorting objects, structures, or records (similar to a database.) This changes the minute optimizations subtly, with the result being the integer quick-sort might not be the best choice. I often prefer s
Re: (Score:3)
I don't know how you think array indices are implemented but defined in terms of pointers and result in identical code generation, despite your belief that the less readable form of the source is somehow faster...
Re: (Score:2)
He's probably thinking of languages that allow arrays to be used as hash map lookups. Though now that I think about it, I'm not sure which language would allow you to do that, but then turn around and let you do pointer math (but I wouldn't be surprised if it were perl).
Re: (Score:2)
AWK, PERL Groovy and obviously C++ containers allow array syntax to access hash tables.
Python probably, too.
Re: (Score:2)
Yeah, but do any of those then let you take that object and do pointer math on it instead? They're kinda 2 incompatible concepts. Obviously C++ is gonna allow you to do pointer arithmetic on any damned thing you want, but on something hashmap backed it's probably gonna be pointless and probably stands a good chance of being. But yeah, I suppose you could have an object that is internally array based but overloads [] in such a way to give a hash appearance and you could do the pointer math directly if you kn
Re: (Score:2)
[] usually just calls put(X,Y) and get(X) - deferring to their implementations. More precisely, [] returns an helper object with a cast operator and 'operation = ()' defined.
Re: (Score:2)
Actually, they usually do NOT result in the same code, array lookups are almost always slower by at least one pointer addition:
Pointer dereference:
*location => one pointer dereference
Array lookup
a[location_index] => *(a+index) => one pointer addition + one pointer dereference
Re: (Score:2)
At which optimization level? I'd expect a optimized compile to reduce everything to as simple as possible. And I'd expect an unoptimized compile to leave the code as close to the original as possible.
Re: (Score:2)
I believe that, in general, it's true over a wide range of optimization levels.
When manipulating data by index you're, incrementing, manipulating, etc. indexes which are stored in memory as integers, and quite likely used in comparisons, etc. that require that their integer values be known, so that they can't be converted to pointer arithmetic behind the scenes. And then you use them to access an array. There's not really a whole lot of ways to avoid the associated pointer addition.
There's also a whole lot
Re: (Score:3)
Every compiler I worked with last 30 - 35 years optimizes that away!
Re: (Score:2)
Unfortunately, there are an ample number of embedded C compilers that can't optimize a repeat array reference away, and most interpreters never optimize an array reference. Thus:
elt = array[i]; /* something */ }
/* something different */ }
/* something */ }
/* other small unrelated calculations as applicable - this gives the CPU something to do while the load completes */
if (elt == 3) {
else if (elt == 2) {
else {
is almost always faster than:
if (array[i] == 3) { /* something */ }
else if (array[i] ==
Re: (Score:2)
Could you name some? I have wondered about this in the past, but all the platforms I have used where I tested it used GCC as the compiler and performed this optimization. I guess maybe automotive or other safety critical industries where the compiler used would be heavily regulated?
Re: (Score:2)
I'm thinking about some of the Microchip compilers in particular. In comparison, GCC and Microsoft's C compiler are considered very good compilers in the embedded world.
This is the assembly code for some random snippets from a hardware bug that I found in 2014. The following flips a bit. On an Intel x86 processor, it should be a single XOR instruction. On this processor (PIC18), it should be no more than three instructions. /* Compliment a b
191: LATAbits.LATA9 = !LATAbits.LATA9;
Re: (Score:2)
Really? High School computer class? (Score:2)
That's interesting because I studied computer science in the 1990's and "binary searches" of the type they mention were just abstract curiosities to understand the general idea. Nobody in their right mind in a real non-trivial application would use a flat area of memory with sorted data and then perform a bin
Re:Really? High School computer class? (Score:5, Informative)
Nobody in their right mind in a real non-trivial application would use a flat area of memory with sorted data and then perform a binary search algorithm on it.
I've done just that many times in the past. It's especially useful for managing data where you want to quickly categorize or access by ranges of the value of one of its fields, and you don't want the overhead of building a big tree structure.
It's common enough that the C stdlib.h header includes a bsearch() function, and Python has a whole "bisect" module supporting binary search variants. Binary search can also be the main component of an insertion sort algorithm.
Re: (Score:2)
Building a tree might be more costly than sorting then searching. It has to do a lot of malloc/free, etc.
A flat memory structure might cache better, too.
Re: (Score:2)
As I'm sure you know, the modern CPU and memory architecture is far removed from the constant-cost memory read assumption that underlies a lot of algorithmic analysis. Sure, it applies asymptotically, but many applications do not live in an asymptotic world, and the coefficients matter.
Binary search on a sorted sequence that fits in cache can be way faster than tree traversal over a structure that doesn't.
Re: (Score:2)
We must have had different teachers.
It depends heavily on what you're doing - if you're working on a "live" data set that's being frequently updated, then yes, a tree is the obvious choice (or perhaps a hash table)
On the other hand, if you're dealing with something like a lookup table that's rarely modified, a flat array will be much smaller and faster.
And if lookup performance is *really* important you can even do one better than that - a position-encoded binary tree stored in a flat array, so that all the
Re: (Score:2)
I just took a modern algorithms course as part of my Master's degree program and I can assure you that we have not progressed beyond ACL trees, Red-Black Trees, and B-Trees when it comes to fast ways to sort, store and lookup values in a list.
By the way, I don't share your views of what is trivial. Database applications do a LOT of dumping and sorting data in flat memory areas. In fact, that's the majority of what they are doing under the covers.
However, I seriously doubt any serious programmer would g
Re: (Score:2)
However, I seriously doubt any serious programmer would go out and reinvent the wheel by implementing any of the algorithms and data structures we studied. Modern libraries already exist for ALL of these and it is a waste of time coding yet another one.
It wouldn't be a serious problem, since most of these algorithms/structures are simple enough to code up in an afternoon. The reason to do so would be if there's something specific about your data set that allows a more efficient customization.
Re: (Score:2)
However, I seriously doubt any serious programmer would go out and reinvent the wheel by implementing any of the algorithms and data structures we studied. Modern libraries already exist for ALL of these and it is a waste of time coding yet another one.
It wouldn't be a serious problem, since most of these algorithms/structures are simple enough to code up in an afternoon. The reason to do so would be if there's something specific about your data set that allows a more efficient customization.
That situation would be rare indeed. These days, all this is implementation details hidden behind an interface or template (depending on the language) and if you really are coding up say a binary tree to sort and store some data, you really are wasting your time. Also, coding ACL trees or Red-Black trees isn't exactly something you do in an afternoon, especially the insert and delete part. Yea, I can wack up some tree traversal function in an hour for any of those, but unless you are *really* space-const
Maybe (Score:2)
Check all the algorithms one by one to see if there's a faster way.
Re: (Score:2)
the time burned testing all the distinct algorithms is, usually, wasted and better spent elsewhere. It can be useful for education, but the often very slight performance benefits do not normally justify the project time spent searching among them.
In some cases on real-life systems, yes! (Score:2)
Essentially RAM is now considerably slower than CPUs, so a naively implemented binary search will mostly stall for memory accesses.
One way to get around that is to interleave multiple searches at once. Essentially you read in the value you base your decision on, then you tell the cache to fetch it while you read the value you previously told the cache to fetch. This way you do several searches at once using the time the data needs to get you to do the address calculations.
Re: (Score:2)
You probably would need hundreds of searches concurrently to gain advantage this way, though there are not enough cache lines, but sure, it might be slightly more efficient.
Re: (Score:2)
Some years ago I've heard of a talk that did it with the (back then) new C++ feature of co-routines. The speaker measured it and the optimum was around a dozend or so of concurrent searches. In fact if you have more it will get slower again as the data to maintain the searches becomes bigger.
Re: (Score:2)
Of course, that's an architecture dependent result. So the machine you're operating on now might have very different characteristics. And it's probably also dependent on various things, like "what else is the system doing?". "Will I get swapped out in the middle of execution?" , etc.
Cache Is King (Score:2)
You use the binary search tree inside of Array approach.
When you need to take the left child, index *= 2.
When you need to take the right child, index *=2 + 1.
This puts all the early memory accesses into around the same memory region, and only later memory accesses could be out of cache, which is the opposite of how the normal Binary Search algorithm works.
Massive improvement in cache hits, and no pointers used for the structure of the tree itself. Cache is king.
Important and Obvious (Score:2)
This is both important and obvious. When you need more speed, you can code to take advantage of the cache and CPU features, and it can make a huge difference. That should be obvious, but people forget or assume it's too difficult. I remember at work seeing structures with comments indicating where the structure broke across cache lines so that developers could better keep fields that were often used together on the same cache lines.
In the real world, you rarely do a binary search on an array of 32-bit in
For timing sensitive programs (Score:2)
The rest appear to exploit branch lookahead by having the forward moving branch "expected" and the backward moving branch in a conditional jump.
Re: (Score:2)
If that's true, that's likely due to optimization. Doing so adds log_2 n comparisons that are more likely to fail than succeed to a degree where each depth grants about 3/4ths of an extra comparison.
How about aut-tuning like FFTW? (Score:5, Interesting)
In the FFT world, the race to faster implementations was taken by storm a while ago by a couple of students at MIT who invented FFTW (the Fastest Fourier Transform in the West), an autotuning version that took into account just about everything about the current CPU, memory system, and problem size to produce the most streamlined version possible that runs on general-purpose hardware. The world is a better place because of their work.
Has anyone done the same thing for quicksort or binary sort?
Re: How about aut-tuning like FFTW? (Score:1)
Yes, itâ(TM)s called the C-library
Re: (Score:2)
Yes, itÃ(TM)s called the C-library
AFAIK nothing in the C library looks at the specific hardware that the code is currently executing on and optimizes the algorithm to run faster on that hardware. Rather, the C library authors aim to find a single algorithm that can be expected to run reasonably well on most commonly-used hardware, which is not the same thing at all.
Re: (Score:2)
Re: (Score:2)
That's not necessarily the same as what FFTW does: they do empirical testing of different formulations of FFT to find the one that runs fastest on the current hardware and problem size, and caches that information so that the next time FFTW gets called on the same problem size on the same hardware, it's super fast. From memory, they average about a 2x speedup over generic algorithms.
Re: (Score:1)
As does the glibc team, a lot of the code is written in straight assembler with "IFDEF _arch_". I'm not sure they go and try to optimize indefinitely but most 'basic' code (sorting, fft etc) has been hand-written to give the best performance on most common architectures.
For most code, in many cases the compiler has many tricks built-in so it is sometimes faster to write in plain C so it can create branchless if statements and optimize out loops. I've been astounded sometimes to find out my binary for eg. At
Re: (Score:2)
The difference is that FFTW does the tuning on-line, immediately prior to executing the FFT that's been requested (or through a separate tuning call). It is all empirically determined based on the particular hardware, beyond just a CPU architecture, that is being used at that moment. It's a run-time determination, not a compile-time one.
Re: (Score:2)
Binary search (Score:1)
Regula falsi (Score:2)
Re: (Score:2)
Interpolated search can be _much_ faster (Score:4, Interesting)
I made a web page many years ago which allows you to search for any given decimal digit sequence within the first 1E9 digits of Pi (3.1415926...), the algorithm I came up with is exactly the same: I knew that Pi contains an effectively random sequence of digits, so I could guess that in a sorted array of all possible starting positions, a sequence like 18790314 (Albert Einstein's birth date) should occur approximately 18.79% into the index array:
https://tmsw.no/pi-search/ [tmsw.no]
Find 18790314
Found at 18,830,019: 636518371058586 18790314 193020374710719
Total time = 0.001965 seconds (9 suffix lookups)
With a naive binary search I would have needed ~30 iterations instead of the 9 taken by the interpolating algorithm.
When you consider that both the digit array (1GB) and the 4 times larger index array (4GB, split into 10 separate 400MB files) resides on disk at my web host, the response time is quite good.
Terje
Comparison sort cannot be way better than N*log(N) (Score:1)