Collaborative Filtering and the Rise of Ensembles 58
igrigorik writes "First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."
Group Labor (Score:2, Informative)
Re:Management Summary ? (Score:3, Informative)
Here's a stab at 3 sentences:
Making a prediction by running multiple statistical prediction algorithms and combining their results often seems to work well. This is called an "ensemble method". Ensemble methods seem to work particularly well on collaborative filtering problems.
Re:Open source governance (Score:4, Informative)
"Fitness" doesn't always have to be related to the output; it can be related to the quality of a guessed input.
Consider the corollary of a poll test: a model in which "trusted" voters receive extra votes while everyone else still gets on vote. You can determine "trustworthiness" (or "karma", if you will) the same way Slashdot does - through moderation and meta-moderation, or you can use a more objective "de minimis research" criteria (like a poll test but without the punishment for failure.)
So someone voting on a school board bond election who can correctly answer questions about the stated usage of that bond, or the school district's financial bond rating, or who attends a school board meeting discussing the bond, could get 2 votes for the price of one.
This would a) allow "passionate" (albeit informed) voters to have more of a say than someone who is indifferent, and b) encourage people to do research and get involved in politics.
In a way, it's anti-democratic, but if you are going to insert any sort of elitism into the system, it might as well be a meritocracy.
Re:I'm sorry (Score:5, Informative)
Yeah, the term dates back at least to the 1990s. The classic survey paper (over 1000 citations!) on the subject is "Ensemble Methods in Machine Learning" [pdf] [wsu.edu] by Tom Dietterich (2000), for those who want to glance through a survey. Though be warned that some of its specific conclusions are now dated--- e.g. there's been a *lot* written in both statistics and machine learning since then on what boosting "really" is and why it works.
Dietterich presents the more machine-learning view of it, focused on algorithms, combination of predictions, iterative refinement, etc. The best survey from a statistical approach is probably Ch. 16 of this book [amazon.com] by three [stanford.edu] Stanford [stanford.edu] profs [stanford.edu], which you can probably read some of on Google Books [google.com].