Genetic Algorithm Improves Shellsort 71
gstover writes "As a personal programming project recently, I used a genetic algorithm to find a better sequence of increments for Shellsort. It found at least one sequence that out-performs the best previously known by about 5 percent. Here
is an article about it (also here).
I believe this is the first time a
genetic algorithm has actually improved Shellsort."
Impressive (Score:2, Interesting)
I could imagine your shell sort could be applied to database technologies by creating a separate low priority thread that would generate the "gene pool" to improve the indexing of the tables in the background.
On a side note I am learning about genetic algorithms with hopes that I may work on it when I take my Masters in Computer Science.
Re:Nice and all (Score:4, Interesting)
There may be some cases where shellsort is more desirable for the exact data being sorted, I don't really know for sure. The importance of this is that he has used a GA to better the optimization work of humans on shellsort. He has laid the groundwork and circumstantial proof out for others to do the same with other algorithms. Of course he evolved a set of constants more than an algorithm itself.
The next logical place to go with this work, IMHO:
1) Invent a concise fake machine language for sorting algorithms (a convenience to make the rest of this easier). It should have basic instructions used in manipulating and sorting arrays (move, compare, copy, branching, etc...).
2) Write a "sort engine" that takes as input algorithms written in your fake language and uses them to sort things (outputting some performance numbers).
3) Implement every known array sorting algorithm you can find in your little fake machine sort language.
4) Let a GA evolve the whole algorithm by arbitrarily replacing bytes with other valid bytes from your specialized assembler language, starting with all the known sort algs as parents. Let it run until it stabilizes, using a relatively high mutation rate.
Of course, the big problem is that if your language implements any kind of looping construct, or any other way that code can be re-executed (and it will almost have to), then you face a "halting problem" when testing each new child. The pratical workaround is of course to know that any reasonable algorithm must finish the sort in a certain bounded amount of cpu cycles, and terminate any children who take longer.
5) Translate the winning candidate(s) custom machien source back into a generic language like C, and puzzle over exactly why it works so damn well.
Re:Nice and all (Score:4, Interesting)
The big problem with GA though, IIRC, is that the resulting solution is often incomprehensible to a human. I believe Bill Joy did some work with GAs and had comments along those lines (sorry, I couldn't find the quote). Consider for a moment though trying to troubleshoot code generated by a computer. Bad variable names would be just the start of your problems. The logic patterns employed would be essentially random to a human. Many of the patterns would be vestigial and wouldn't even be relevant, but you wouldn't even know that. Identifying the primary execution paths would be a huge chore... never mind actually understanding the basis for why the generated solution actually works.
How comfortable would you be deploying a solution (hardware or software) where the fundamental design isn't even understood? How the heck do you fix such a thing once it's deployed?
Re:Impressive (Score:4, Interesting)
An obvious extension to generic lengths is to use these precomputed networks as recursion base cases for quicksort, instead of switching to selection sort for lengths x (x ~ 5 typically).
IIRC shellsort works similarly: recursively sorting subsequences and merging results.
Limited number of test inputs (Score:1, Interesting)
Also, randomly ordered inputs are not common in practical applications. Did you test (or consider testing) partially sorted and partially reverse-sorted numbers? Presumably the contrived sequences (such as Sedgewick's) are balanced to provide good results for both random and non-random data.
Also see comb sort (Score:2, Interesting)
http://cs.clackamas.cc.or.us/molatore/cs260Spr0
http://world.std.com/~jdveale/combsort.htm
Re:Actually this HAS beed done before (Score:2, Interesting)
The only GA-originated increments for Shellsort that I could find online was the one in Crew fall 2001 report [allegheny.edu], but that sequence didn't perform very well, & they didn't document their technique except to say that it wasn't a full-blown GA. (They placed some limitations on it. They also mention that Simpson & Yachavaram [allegheny.edu] used only a limited GA.)
I wondered if a GA that fully manipulated bit-strings could produce a sequence as good as Sedgewick's, & I also wanted to document the technique so someone else interested in GAs could implement their own or at least be a little entertained with how it was done.
Re:Nice and all (Score:3, Interesting)
Re:Nice and all (Score:3, Interesting)
Shell Sort is proven, but good increments can't be (Score:3, Interesting)
Shell Sort works by roughly sorting, then sorting a little finer, and so on, until you finely sort every piece of data, and hopefully at this final and potentially most time-consuming step, there isn't a lot of work to be done.
The thing is, there's no platonic God-like perfect mathmatically-proven best choice of how fine or rough the early iterations should be. No matter what increments you use (within certain constraints) Shell Sort will still work every single time (the last step at increment 1 guarantees it) -- it's just a matter of performance. Nobody knows what increments work best, and you can't just mathmatically figure it out.
Re:Nice and all (Score:3, Interesting)
There are some inherit disadvantages of GP, while being much more powerful than GA. One of them being the problem of code growth, usually after some generations, the organisms gets to be too large and starting to cosume too much memory and cpu resources to be feasible. One of the papers I wrote while taking Evolutionary Computation class while in college was the investigation and a 'solution/improvement' to the code growth problem.