Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Programming

Is It Possible to Implement Faster Binary Searches? (github.com) 98

Last week Slashdot reader scandum described the search for the most efficient sorting algorithm.

Now he's back, touting a new implementation for binary searches (using the same GitHub repo, and written in 15 to 30 lines of C code) that he says may be "up to 40%" faster for 32-bit integers. ("Keep in mind performance will vary depending on hardware and compiler optimizations.") The most commonly used binary search variant was first published by Hermann Bottenbruch in 1962 and hasn't notably changed since. Binary searches are one of the corner stones of computer science...

The reason the algorithms are faster appears to be a combination of simpler calculations, branch prediction, and a reduction in cache misses.

This discussion has been archived. No new comments can be posted.

Is It Possible to Implement Faster Binary Searches?

Comments Filter:
    • Analog.
    • Why no love for Radix sort? Yeah, it's pretty much restricted to sorting fixed-length values, but if that's what you're sorting then its O(n) performance scaling is as good as you could possibly ask for.

      Granted, most sample implementations sort by decimal digits for clarity, which introduces a lot of computational overhead and a large constant multiplier, but it's just as effective (and a _lot_ faster) if you use one-byte digits. Then it only takes 4 passes to completely sort any length list of 32-bit val

      • Wait, nevermind. they said search, not sort. Need more coffee.

        • by mark-t ( 151149 )
          You can take the same underlying structure used in multibyte radix sort and implement a O(k) lookup table, where k is the length of the key to look up, and the theoretically asymptotic minimum of any search.
          • How would you do that? A radix sort doesn't have a structure that would be terribly helpful for lookup.

            Are you thinking of something like a a 256-ary tree, where each node has (up to) 256 children based on the value of the next byte of the key? (sort of a tree of nested radix bins?)

            That would indeed give you O(k) lookup, (technically O(log_base_256 n) = O(log n) since constants are ignored for Big-O notation), but the memory overhead would be immense since you need an array linking to up to 256 sub-bins fo

            • by mark-t ( 151149 )

              Yup... you pretty much called it.

              Memory overhead is high, but is still an asymptotically linear function of n where n is the total # elements you have stored, as there is no need to store the nodes which do not have children. Since you have to store the elements anyways, if this structure becomes how you store the data in the first place, then the overhead is generally inconsequential.

              Insertion, deletion, and finding are all O(k) where k is the length of the key to be inserted, and the data is always

      • As Andrei Alexandrescu points out [youtube.com], when he invented a new sorting algorithm, everyone has been measuring sorting algorithms with an incomplete formula. The standard cost is:

        O(n) = C(n) + S(n)

        C(n) = time of Compares
        S(n) = time of Swaps

        The complete formula is:

        Blended Cost = (C(n) + M(n) + k*D(n))/n

        C(n) = time of Compares
        M(N) = time of Moves
        D(n) = average distance between subsequent array accesses

        Radix sort makes 3 passes -- this can potentially blow your D$ cache. i.e. I've implemented Radix sort for Float32

        • "orientated" is not a word

          • by _merlin ( 160982 )

            It's probably an Americanism, you know how they like to make words look more complicated, like turning "burgled" into "burglarized".

          • Thanks for the catch!

            The "official" term is: Data-oriented design [wikipedia.org]

            • Good.

              Apparently I'm "wrong" about it not being a word. The normal verb is orient [wiktionary.org], past participle is "oriented". But wiktionary does list orientated as a backformation from "orientation" [wiktionary.org]. Then it notes "usually considered incorrect in American English". Since Wiktionary has no citations anywhere I'm unsure how much to trust it.

              It's one of those things that just grinds my gears, like people saying "I hate when I loose my keys".

        • by raynet ( 51803 ) on Saturday August 01, 2020 @03:51PM (#60355933) Homepage

          This is why you should implement everything on 6502 which doesn't suffer from things like cache misses etc. Everything runs at optimum speed at all times.

          • I use an emulated virtual 6502 for just this reason

          • Except that some instructions ran faster when addressing "zero page", the first 256 bytes. On a stock C-64 a lot of that was used by the kernel, but some of it was available for your programs. Of course cart games and stuff dumped the kernel and optimized to their heart's content. I bet the Defender cart did something like that. Defender rocked on the C-64. It saved me a lot of quarters.

        • by tepples ( 727027 )

          I've implemented Radix sort for Float32 on the PS2 and found it performed worse, while on the Wii it was OK if I'm remembering correctly.

          Yet as I understand it, some sort of bucket or radix sort was standard for sorting things on the original PlayStation, whose GPU lacked any sort of depth buffer.

      • Radix (MSB) sort can handle arbitrary values ... and there are two different ways to do it. For sort of a single (variable-length) memcmp-able key, a (linear) prescan can generate a dense set of integers representing key prefixes in order. The radix sort is applied to the integers. If links are allowed in comments: https://tinyurl.com/stradix [tinyurl.com].
    • by mark-t ( 151149 )
      That's asymptotically the same complexity as binary
  • by gweihir ( 88907 ) on Saturday August 01, 2020 @11:02AM (#60355281)

    Caches are limited in what they can do here, because binary search does mostly non-local memory access. The real improvement on large arrays is the interpolation used, but this one works only if the data is distributed in a way that makes interpolation effective. In a real-wold scenario, this effect may be anywhere from even worse than regular search to giving the improvements described.

    Also take into account that in a real-world scenario, you will have longer search keys. That reduces the potential gains even more.

    • Indeed, which is where optimized data structures can take up the slack - depending on context. E.g. sorting array data as a binary tree rather than a linear sequence will result in much higher locality since all the first several nodes to be traversed are all at the beginning of the tree. With 32bit values you can store 1000 nodes in a single 4k memory page, which is almost the entire top 10 levels of a binary tree.

      Of course such a structure tends to be expensive to update since tree position is determine

      • by gweihir ( 88907 )

        Exactly. But that basic limitation remains: For something as fundamentally optimal and general as binary search, you can only get improvements by making it less general and hence in the general case, less optimal.

        There are also other concerns. For example, a typical case for searching is an existence check. That is of course much better served by a hash-table. Another case is an interval search (all elements falling into an interval). A balanced binary search tree is much better than a general binary search

  • by gavron ( 1300111 ) on Saturday August 01, 2020 @11:03AM (#60355285)

    A binary search is a binary search, but the implementations the author uses make "educated guesses" when when limited to the dataset of integers allows optimization. The end result is a search that can go through a 1,000,000 dataset and find the index in around half the time of the other best algorithm, both of which outperform "classic" binary search.

    If you're working with small datasets or unknown data types (i.e. not everything is a 32-bit integer) it's [IMHO] not worth changing your code for the small advantage. For large datasets and known dataset types... yes this will save you compute cycles, memory, and make everything from SQL servers to authentication servers faster.

    This is well-written. So, by the way, is the C code.

    Ehud

    • So, by the way, is the C code.

      Ehud

      Apart from all those single-line if statements without curly brackets.

      (nb. some of the have them, some don't...)

      • by gavron ( 1300111 )

        > Apart from all those single-line if statements without curly brackets.

        The "Harvard comma" of C-coding :) This has been discussed TO DEATH ("ad nauseum").
        I was going to post links... but hey, free Google search:

        https://www.google.com/search?... [google.com]

        I think you weren't serious but I didn't see /s so I'm trying to be helpful to others who might wonder what the heck we're discussing ;)

        Ehud

        • The "Harvard comma" of C-coding :)

          No, the "Harvard comma" would be the curly braces on a separate line, making the code much much longer than it ought to be. I didn't mention that though.

          https://jeremybytes.blogspot.c... [blogspot.com]

          If statements without curly braces is much less open to debate: It's bad.

          • I'm not sure it's that much less open to debate. The Linux kernel coding style specifically states not to use them for single line statements unless one of the other branches has multiple lines in which case all branches should have braces.
            Personally I do prefer to just have all if statements with braces in line with the Zephyr coding style.
    • Am I missing something? I only skimmed the article, but it sounded like they're just doing a classic interpolation search, along with the sort of optimizations you'd normally apply when making a real-world algorithm rather than a high-clarity reference implementation.

      • by Anonymous Coward

        "I only skimmed the article, ..."

        Ssssss, apostate!

    • Improvment in binary search is pointless...

      We already have ways to do this that can make use of cache better than binary trees (Say B-Trees for instance), but why does shaving a few percent off a linear problem make the news? Give me a way to run some previously geometric problems in linear time, now that's something useful.

      IF you want to optimize your binary tree code for your hardware, do it in assembly and forget what compile optimizations you have or don't have. Do you want a performance boost? SKIP

  • I'm always cautious of code optimization exercises like this. In application, the code tends to be more complicated and many of the simplifying assumptions do not apply. The classic example is sorting a list of integers (like quick-sort). It's a great academic exercise. However, in practice, I'm always sorting objects, structures, or records (similar to a database.) This changes the minute optimizations subtly, with the result being the integer quick-sort might not be the best choice. I often prefer s

    • by Shaiku ( 1045292 )

      I don't know how you think array indices are implemented but defined in terms of pointers and result in identical code generation, despite your belief that the less readable form of the source is somehow faster...

      • He's probably thinking of languages that allow arrays to be used as hash map lookups. Though now that I think about it, I'm not sure which language would allow you to do that, but then turn around and let you do pointer math (but I wouldn't be surprised if it were perl).

        • AWK, PERL Groovy and obviously C++ containers allow array syntax to access hash tables.
          Python probably, too.

          • Yeah, but do any of those then let you take that object and do pointer math on it instead? They're kinda 2 incompatible concepts. Obviously C++ is gonna allow you to do pointer arithmetic on any damned thing you want, but on something hashmap backed it's probably gonna be pointless and probably stands a good chance of being. But yeah, I suppose you could have an object that is internally array based but overloads [] in such a way to give a hash appearance and you could do the pointer math directly if you kn

            • [] usually just calls put(X,Y) and get(X) - deferring to their implementations. More precisely, [] returns an helper object with a cast operator and 'operation = ()' defined.

      • Actually, they usually do NOT result in the same code, array lookups are almost always slower by at least one pointer addition:
        Pointer dereference:
        *location => one pointer dereference
        Array lookup
        a[location_index] => *(a+index) => one pointer addition + one pointer dereference

        • by HiThere ( 15173 )

          At which optimization level? I'd expect a optimized compile to reduce everything to as simple as possible. And I'd expect an unoptimized compile to leave the code as close to the original as possible.

          • I believe that, in general, it's true over a wide range of optimization levels.

            When manipulating data by index you're, incrementing, manipulating, etc. indexes which are stored in memory as integers, and quite likely used in comparisons, etc. that require that their integer values be known, so that they can't be converted to pointer arithmetic behind the scenes. And then you use them to access an array. There's not really a whole lot of ways to avoid the associated pointer addition.

            There's also a whole lot

    • Every compiler I worked with last 30 - 35 years optimizes that away!

      • Unfortunately, there are an ample number of embedded C compilers that can't optimize a repeat array reference away, and most interpreters never optimize an array reference. Thus:

        elt = array[i];
        /* other small unrelated calculations as applicable - this gives the CPU something to do while the load completes */
        if (elt == 3) { /* something */ }
        else if (elt == 2) { /* something different */ }
        else { /* something */ }

        is almost always faster than:

        if (array[i] == 3) { /* something */ }
        else if (array[i] ==

        • > Unfortunately, there are an ample number of embedded C compilers that can't optimize a repeat array reference

          Could you name some? I have wondered about this in the past, but all the platforms I have used where I tested it used GCC as the compiler and performed this optimization. I guess maybe automotive or other safety critical industries where the compiler used would be heavily regulated?
          • I'm thinking about some of the Microchip compilers in particular. In comparison, GCC and Microsoft's C compiler are considered very good compilers in the embedded world.

            This is the assembly code for some random snippets from a hardware bug that I found in 2014. The following flips a bit. On an Intel x86 processor, it should be a single XOR instruction. On this processor (PIC18), it should be no more than three instructions.
            191: LATAbits.LATA9 = !LATAbits.LATA9; /* Compliment a b

  • The most commonly used binary search variant was first published by Hermann Bottenbruch in 1962 and hasn't notably changed since. Binary searches are one of the corner stones of computer science...

    That's interesting because I studied computer science in the 1990's and "binary searches" of the type they mention were just abstract curiosities to understand the general idea. Nobody in their right mind in a real non-trivial application would use a flat area of memory with sorted data and then perform a bin
    • by Waffle Iron ( 339739 ) on Saturday August 01, 2020 @11:50AM (#60355407)

      Nobody in their right mind in a real non-trivial application would use a flat area of memory with sorted data and then perform a binary search algorithm on it.

      I've done just that many times in the past. It's especially useful for managing data where you want to quickly categorize or access by ranges of the value of one of its fields, and you don't want the overhead of building a big tree structure.

      It's common enough that the C stdlib.h header includes a bsearch() function, and Python has a whole "bisect" module supporting binary search variants. Binary search can also be the main component of an insertion sort algorithm.

    • Building a tree might be more costly than sorting then searching. It has to do a lot of malloc/free, etc.

      A flat memory structure might cache better, too.

    • As I'm sure you know, the modern CPU and memory architecture is far removed from the constant-cost memory read assumption that underlies a lot of algorithmic analysis. Sure, it applies asymptotically, but many applications do not live in an asymptotic world, and the coefficients matter.

      Binary search on a sorted sequence that fits in cache can be way faster than tree traversal over a structure that doesn't.

    • We must have had different teachers.

      It depends heavily on what you're doing - if you're working on a "live" data set that's being frequently updated, then yes, a tree is the obvious choice (or perhaps a hash table)

      On the other hand, if you're dealing with something like a lookup table that's rarely modified, a flat array will be much smaller and faster.

      And if lookup performance is *really* important you can even do one better than that - a position-encoded binary tree stored in a flat array, so that all the

    • I just took a modern algorithms course as part of my Master's degree program and I can assure you that we have not progressed beyond ACL trees, Red-Black Trees, and B-Trees when it comes to fast ways to sort, store and lookup values in a list.

      By the way, I don't share your views of what is trivial. Database applications do a LOT of dumping and sorting data in flat memory areas. In fact, that's the majority of what they are doing under the covers.

      However, I seriously doubt any serious programmer would g

      • However, I seriously doubt any serious programmer would go out and reinvent the wheel by implementing any of the algorithms and data structures we studied. Modern libraries already exist for ALL of these and it is a waste of time coding yet another one.

        It wouldn't be a serious problem, since most of these algorithms/structures are simple enough to code up in an afternoon. The reason to do so would be if there's something specific about your data set that allows a more efficient customization.

        • However, I seriously doubt any serious programmer would go out and reinvent the wheel by implementing any of the algorithms and data structures we studied. Modern libraries already exist for ALL of these and it is a waste of time coding yet another one.

          It wouldn't be a serious problem, since most of these algorithms/structures are simple enough to code up in an afternoon. The reason to do so would be if there's something specific about your data set that allows a more efficient customization.

          That situation would be rare indeed. These days, all this is implementation details hidden behind an interface or template (depending on the language) and if you really are coding up say a binary tree to sort and store some data, you really are wasting your time. Also, coding ACL trees or Red-Black trees isn't exactly something you do in an afternoon, especially the insert and delete part. Yea, I can wack up some tree traversal function in an hour for any of those, but unless you are *really* space-const

  • Check all the algorithms one by one to see if there's a faster way.

    • the time burned testing all the distinct algorithms is, usually, wasted and better spent elsewhere. It can be useful for education, but the often very slight performance benefits do not normally justify the project time spent searching among them.

  • Essentially RAM is now considerably slower than CPUs, so a naively implemented binary search will mostly stall for memory accesses.

    One way to get around that is to interleave multiple searches at once. Essentially you read in the value you base your decision on, then you tell the cache to fetch it while you read the value you previously told the cache to fetch. This way you do several searches at once using the time the data needs to get you to do the address calculations.

    • by raynet ( 51803 )

      You probably would need hundreds of searches concurrently to gain advantage this way, though there are not enough cache lines, but sure, it might be slightly more efficient.

      • Some years ago I've heard of a talk that did it with the (back then) new C++ feature of co-routines. The speaker measured it and the optimum was around a dozend or so of concurrent searches. In fact if you have more it will get slower again as the data to maintain the searches becomes bigger.

        • by HiThere ( 15173 )

          Of course, that's an architecture dependent result. So the machine you're operating on now might have very different characteristics. And it's probably also dependent on various things, like "what else is the system doing?". "Will I get swapped out in the middle of execution?" , etc.

  • You use the binary search tree inside of Array approach.
    When you need to take the left child, index *= 2.
    When you need to take the right child, index *=2 + 1.

    This puts all the early memory accesses into around the same memory region, and only later memory accesses could be out of cache, which is the opposite of how the normal Binary Search algorithm works.

    Massive improvement in cache hits, and no pointers used for the structure of the tree itself. Cache is king.

  • This is both important and obvious. When you need more speed, you can code to take advantage of the cache and CPU features, and it can make a huge difference. That should be obvious, but people forget or assume it's too difficult. I remember at work seeing structures with comments indicating where the structure broke across cache lines so that developers could better keep fields that were often used together on the same cache lines.

    In the real world, you rarely do a binary search on an array of 32-bit in

  • None of the algorithms, including the "standard", utilize an early exit on match.
    The rest appear to exploit branch lookahead by having the forward moving branch "expected" and the backward moving branch in a conditional jump.
    • by Sigma 7 ( 266129 )

      None of the algorithms, including the "standard", utilize an early exit on match.

      If that's true, that's likely due to optimization. Doing so adds log_2 n comparisons that are more likely to fail than succeed to a degree where each depth grants about 3/4ths of an extra comparison.

  • by pz ( 113803 ) on Saturday August 01, 2020 @12:18PM (#60355471) Journal

    In the FFT world, the race to faster implementations was taken by storm a while ago by a couple of students at MIT who invented FFTW (the Fastest Fourier Transform in the West), an autotuning version that took into account just about everything about the current CPU, memory system, and problem size to produce the most streamlined version possible that runs on general-purpose hardware. The world is a better place because of their work.

    Has anyone done the same thing for quicksort or binary sort?

    • Yes, itâ(TM)s called the C-library

      • by Jeremi ( 14640 )

        Yes, itÃ(TM)s called the C-library

        AFAIK nothing in the C library looks at the specific hardware that the code is currently executing on and optimizes the algorithm to run faster on that hardware. Rather, the C library authors aim to find a single algorithm that can be expected to run reasonably well on most commonly-used hardware, which is not the same thing at all.

        • Then you should take a look a the glibc source code, they have tons of architecture dependent versions of various functions.
          • by pz ( 113803 )

            That's not necessarily the same as what FFTW does: they do empirical testing of different formulations of FFT to find the one that runs fastest on the current hardware and problem size, and caches that information so that the next time FFTW gets called on the same problem size on the same hardware, it's super fast. From memory, they average about a 2x speedup over generic algorithms.

            • by guruevi ( 827432 )

              As does the glibc team, a lot of the code is written in straight assembler with "IFDEF _arch_". I'm not sure they go and try to optimize indefinitely but most 'basic' code (sorting, fft etc) has been hand-written to give the best performance on most common architectures.

              For most code, in many cases the compiler has many tricks built-in so it is sometimes faster to write in plain C so it can create branchless if statements and optimize out loops. I've been astounded sometimes to find out my binary for eg. At

              • by pz ( 113803 )

                The difference is that FFTW does the tuning on-line, immediately prior to executing the FFT that's been requested (or through a separate tuning call). It is all empirically determined based on the particular hardware, beyond just a CPU architecture, that is being used at that moment. It's a run-time determination, not a compile-time one.

            • Didn't say that it was 1:1 equivalent. Doing what FFTW does in something like libc is most likely not possible since most libc calls are far less computationally intensive than a full FFTW so you would just decrease performance (which is also why "JIT like Java can apply performance tuning during execution" is pure pipe dream).
  • Switching from, say, linear search to binary search provides an exponential speedup. Any further improvements on this are likely to be worth the effort only on exceedingly large data sets, or in critical inner loops. But it's all good.
  • It's not clear from the description in the README: is the interpolated binary search distinct from a regula falsi root finiding variant?
  • by Terje Mathisen ( 128806 ) on Saturday August 01, 2020 @02:34PM (#60355779)

    I made a web page many years ago which allows you to search for any given decimal digit sequence within the first 1E9 digits of Pi (3.1415926...), the algorithm I came up with is exactly the same: I knew that Pi contains an effectively random sequence of digits, so I could guess that in a sorted array of all possible starting positions, a sequence like 18790314 (Albert Einstein's birth date) should occur approximately 18.79% into the index array:

    https://tmsw.no/pi-search/ [tmsw.no]

    Find 18790314
    Found at 18,830,019: 636518371058586 18790314 193020374710719

    Total time = 0.001965 seconds (9 suffix lookups)

    With a naive binary search I would have needed ~30 iterations instead of the 9 taken by the interpolating algorithm.

    When you consider that both the digit array (1GB) and the 4 times larger index array (4GB, split into 10 separate 400MB files) resides on disk at my web host, the response time is quite good.

    Terje

  • It can be proven that sorting by comparison will take in the worst case scenario around N*log(N) comparisons. The actual lower limit is log2(N!) - like in sorting by insertion. However, this is close to N*log(N) for big N. One thing that you can optimize is the cache access, that can really make a huge time difference (like 10*X) at the same number of operations. Another thing is to optimize common cases. Often you need to sort arrays that contain many parts that are already monotonic. The sort used in PHP

A committee takes root and grows, it flowers, wilts and dies, scattering the seed from which other committees will bloom. -- Parkinson

Working...