## Collaborative Filtering and the Rise of Ensembles 58 58

igrigorik writes

*"First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."*
## Open source governance (Score:2, Interesting)

So can ensembles be used to create more sophisticated forms of direct democracy? That is, where everyone has input into decision-making, but where that input is vastly more complex than simple majority rule by the mob?

FWIW, this is called open source governance [metagovernment.org].

## a few different things going on (Score:5, Interesting)

There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.

For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.

## Re:Group Labor (Score:5, Interesting)

Although that's true with humans, it's a bit curious why it'd be true with algorithms. After all, the aggregation of 3 algorithms is still just an algorithm. It's not even totally clear which algorithms are ensembles and which aren't--- some non-ensemble methods could be re-analyzed using ensemble terminology, and some ensemble methods could be rewritten as unified iterative loops that don't look very ensemble-y. The jury's still out on the whole subject, as far as I can tell (I'm not an ML person, but I'm an AI person whose research bleeds into ML).

An exception is when you're aggregating information from truly different statistical problems, in which case you inherently have an ensemble problem, until someone comes up with the theory (plus tractable implementation) to view the problem as one unified statistical problem. I think collaborative filtering is currently in that stage--- there's no canonical way to pose the problem in the terminology of statistical regression/etc. that captures all aspects of it.

## Re:a few different things going on (Score:5, Interesting)

"There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems."

My opinion:

1) It's a form of regularization & noise averaging. Different classes of estimators have different systematic errors and proper averaging almost always performs better than any single estimator. In more limited contexts, e.g. parametric estimators with variable numbers of free parameters (small compared to # of data points) this is well established in Bayesian contexts. It's like regularization in the sense that the averaging will exclude the howlers---occasional idiosyncratic screw-ups from any single estimator, phenomena that tend to happen with under-regularized estimators.

2) "But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information."

If the thing to be predicted also includes the unknown parameter of "exactly how are the judges going to define the performance metric", then similarly averaging over different possibilities is a good risk-minimization strategy.

In practice (knowing how people watch movies and what netflix cares about)the statistical setting of Netflix seems to be this:

There is an unknown distribution of draw tuples of (movies, people, ratings). (in practice, "date-of-rating" and"date-of-movie" turn out to be additional useful data).

You have observed a large number of these tuples already.

Then, given a number of draws for (movies, person, ratings) where person is fixed, predict the the rating given (new_movie, same_person).

The asymmetry is natural because movies and people are not interchangable: movies do not have opinions about people.

I don't consider Netflix to be very ill-posed statistically.

## Re:Open source governance (Score:3, Interesting)

It deals with the problems of underrepresentation (right now if 51% of people voted for one person, and the other 49% voted for another, the 49 group is completely unrepresented), lack of knowledge among voters (most people know nothing about politics, but at least know someone who knows more than they do), and a lot of government incompetence problems (if someone in "your party" does something you don't like, it's not "the Republican" vs "the Democrat", you can just transfer your vote to another person in your party.. although hopefully this would just destroy political parties instead, since they're not helpful).

## Re:a few different things going on (Score:5, Interesting)

Not sure if there's a lot of argument, as the bias-variance decomposition does seem to give some critical insight in why ensemble techniques work. Couldn't find a good link, but loosely every statistical prediction method will make errors due to bias (being systematicaly wrong, always in the same direction), and errors due to variance (being highly dependent on the fitness data, overfitted). Error methods such as sum-of-squares can be readily decomposed into a term for bias and a term for variance.

Some methods will have errors mainly due to bias (linear methods). Others have error mainly due to variance (neural nets for instance). Whichever method has the lowest sum of the errors - bias + variance - wins. This holds for any prediction method, so also for an ensemble method. If you take for instance the ensemble method of

bagging, you will use the same method on differently sampled data (Bootstrap AGGregated). This has as an effect that you average out all the error due to variance, and are left with the error due to bias. If you happen to use a low bias method (such as a neural network or a decision tree), this works fine.For the recommender challenges, something even niftier is happening. Here many different methods are aggregated, each with different biases and different variances. So, in theory, when doing this, one should be able to average out both bias and variance error, and converge on the (Bayesian) optimal predictor. Note however that averaging is in itself a prediction method, so ensemble methods have their own bias and variance tradeoff. Fortunately, computing the mean is an unbiased and low variance method in its own right.

What I always found very interesting about ensemble methods is that they effectively contradict Occam's razor, in that the end result of an ensemble is not a single theory that predicts the data well, but a whole set of mutually contradicting theories that each hold some of the truth. The ensemble result might actually be huge, even when the system is simple.