## Collaborative Filtering and the Rise of Ensembles 58

Posted
by
timothy

from the soon-there-will-be-symphonies dept.

from the soon-there-will-be-symphonies dept.

igrigorik writes

*"First the Netflix challenge was won with the help of ensemble techniques, and now the GitHub challenge is over, and more than half of the top entries are also based on ensembles. Good knowledge of statistics, psychology and algorithms is still crucial, but the ensemble technique alone has the potential to make the collaborative filtering space a lot more, well, collaborative! Here's a look at the basic theory behind ensembles, how they shaped the results of the GitHub challenge, and how this pattern can be used in the future."*
## Open source governance (Score:2, Interesting)

So can ensembles be used to create more sophisticated forms of direct democracy? That is, where everyone has input into decision-making, but where that input is vastly more complex than simple majority rule by the mob?

FWIW, this is called open source governance [metagovernment.org].

## Re: (Score:1, Insightful)

Thing about gov is everyone insists on having equal voting power... and the whole reason ensembles work is due to unequal weights---maybe those who vote "better" than others are given more voting weight?

## Re: (Score:3, Insightful)

"better" would require a fitness function; and everyone thinks that they vote "best".

If it were possible to define such a perfect function, we wouldn't really need voting anyway. We could just get a computer to crunch all the parameters and spit out a utopia.

## Re: (Score:2)

Donald Fagen [wikipedia.org], is that you?

--

I.G.Y. [donaldfagen.com]## Re:Open source governance (Score:4, Informative)

"Fitness" doesn't always have to be related to the output; it can be related to the quality of a guessed input.

Consider the corollary of a poll test: a model in which "trusted" voters receive extra votes while everyone else still gets on vote. You can determine "trustworthiness" (or "karma", if you will) the same way Slashdot does - through moderation and meta-moderation, or you can use a more objective "de minimis research" criteria (like a poll test but without the punishment for failure.)

So someone voting on a school board bond election who can correctly answer questions about the stated usage of that bond, or the school district's financial bond rating, or who attends a school board meeting discussing the bond, could get 2 votes for the price of one.

This would a) allow "passionate" (albeit informed) voters to have more of a say than someone who is indifferent, and b) encourage people to do research and get involved in politics.

In a way, it's anti-democratic, but if you are going to insert any sort of elitism into the system, it might as well be a meritocracy.

## Re: (Score:3, Interesting)

It deals with the problems of underrepresentation (right now if 51% of peopl

## Re: (Score:2)

Instead of trying to come up with a way to decide who gets extra votes, why not just go with transferable votes, where you can either vote on issues or tranfer your vote to another person (reversible of course).

because a lot of people are lazy and/or greedy and would simply sell all thier votes to the highest bidder

It deals with the problems of underrepresentation (right now if 51% of people voted for one person, and the other 49% voted for another, the 49 group is completely unrepresented

Not really, if you consider a president normally has little power, most power is held in houses where everything is voted on, so the 49% get their representation, unfortunately people at some point decided strong governance is worth more than representation of individuals so most countries have retarded voting systems that take away quite a bit of this representation. currently most countries have fair

## Re: (Score:2)

How does it deal with that? The 49% aren't represented, whether they vote directly for the losing candidate or

## Re: (Score:3, Insightful)

Giving everyone one vote already does that, without the ability

## Re: (Score:2)

Not a chance. "The IQ of a mob is equal to that of its lowest member....

divided by the size of the mob."## Group Labor (Score:2, Informative)

## Re:Group Labor (Score:5, Interesting)

Although that's true with humans, it's a bit curious why it'd be true with algorithms. After all, the aggregation of 3 algorithms is still just an algorithm. It's not even totally clear which algorithms are ensembles and which aren't--- some non-ensemble methods could be re-analyzed using ensemble terminology, and some ensemble methods could be rewritten as unified iterative loops that don't look very ensemble-y. The jury's still out on the whole subject, as far as I can tell (I'm not an ML person, but I'm an AI person whose research bleeds into ML).

An exception is when you're aggregating information from truly different statistical problems, in which case you inherently have an ensemble problem, until someone comes up with the theory (plus tractable implementation) to view the problem as one unified statistical problem. I think collaborative filtering is currently in that stage--- there's no canonical way to pose the problem in the terminology of statistical regression/etc. that captures all aspects of it.

## Re: (Score:2)

## Re: (Score:1, Insightful)

No, I think his pond is just too deep for you.

## Re: (Score:2)

"For A's optimal conditions, use A"

"For B's optimal conditions, use B"

"For any other conditions, use C"

I'm s

## Re: (Score:1)

Well, it's firstly true with humans because you have always been able to do parallel processing with multiple humans.

But it's perfectly applicable with algorithms, especially now that we have better parallel machines.

But even so, you take a task where you have several heuristics, something like the ubiquitous quicksort but a little less reliable. If all of them have a 50/50 chance of working, you set all of them at the problem at once, and even on a single core, you end up with a fairly robust and accurate

## Re: (Score:2)

So in other words, it's Yet Another Buzzword?

## Re: (Score:2)

## Re: (Score:1)

## a few different things going on (Score:5, Interesting)

There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems. But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information.

For example, you can see the Netflix prize as a few different standard statistical problems. As a per-movie regression, predicting what Person A will rate Movie B, given ratings vector of Person A and the ratings vectors of everyone who's already rated Movie B [the per-person ratings vectors excluding B are the X's, and the ratings on B are the Y's]. Or you slice the movie-ratings matrix the other way, with per-movie ratings vectors as the X's. Add in some other views (those are the two most straightforward), aggregate all the info you get from them, and you do better than any one approach alone.

## Re:a few different things going on (Score:5, Interesting)

"There's a lot of argument over why ensemble techniques work well in general, when using them on well-posed statistical problems."

My opinion:

1) It's a form of regularization & noise averaging. Different classes of estimators have different systematic errors and proper averaging almost always performs better than any single estimator. In more limited contexts, e.g. parametric estimators with variable numbers of free parameters (small compared to # of data points) this is well established in Bayesian contexts. It's like regularization in the sense that the averaging will exclude the howlers---occasional idiosyncratic screw-ups from any single estimator, phenomena that tend to happen with under-regularized estimators.

2) "But in the collaborative filtering case, they work well at least in part because there's not a canonical way of posing the problem statistically that's also tractable--- there are instead multiple ways to view the problem, which expose different information. Aggregating those views is a pretty straightforward way of getting more information."

If the thing to be predicted also includes the unknown parameter of "exactly how are the judges going to define the performance metric", then similarly averaging over different possibilities is a good risk-minimization strategy.

In practice (knowing how people watch movies and what netflix cares about)the statistical setting of Netflix seems to be this:

There is an unknown distribution of draw tuples of (movies, people, ratings). (in practice, "date-of-rating" and"date-of-movie" turn out to be additional useful data).

You have observed a large number of these tuples already.

Then, given a number of draws for (movies, person, ratings) where person is fixed, predict the the rating given (new_movie, same_person).

The asymmetry is natural because movies and people are not interchangable: movies do not have opinions about people.

I don't consider Netflix to be very ill-posed statistically.

## Re: (Score:2)

In fact, the actual problem definition was very well thought out, and the performance metric is not at all dependent on the judges, but is precisely the least squares criterion.

What makes Netflix ill-posed is that the solution is highly sensitive to small changes in the supplied data. This is empirically clear from the fact that people tried many, many different solution techniques (which use the data in different ways), and each of those doesn't get very close to

## Re:a few different things going on (Score:5, Interesting)

Not sure if there's a lot of argument, as the bias-variance decomposition does seem to give some critical insight in why ensemble techniques work. Couldn't find a good link, but loosely every statistical prediction method will make errors due to bias (being systematicaly wrong, always in the same direction), and errors due to variance (being highly dependent on the fitness data, overfitted). Error methods such as sum-of-squares can be readily decomposed into a term for bias and a term for variance.

Some methods will have errors mainly due to bias (linear methods). Others have error mainly due to variance (neural nets for instance). Whichever method has the lowest sum of the errors - bias + variance - wins. This holds for any prediction method, so also for an ensemble method. If you take for instance the ensemble method of

bagging, you will use the same method on differently sampled data (Bootstrap AGGregated). This has as an effect that you average out all the error due to variance, and are left with the error due to bias. If you happen to use a low bias method (such as a neural network or a decision tree), this works fine.For the recommender challenges, something even niftier is happening. Here many different methods are aggregated, each with different biases and different variances. So, in theory, when doing this, one should be able to average out both bias and variance error, and converge on the (Bayesian) optimal predictor. Note however that averaging is in itself a prediction method, so ensemble methods have their own bias and variance tradeoff. Fortunately, computing the mean is an unbiased and low variance method in its own right.

What I always found very interesting about ensemble methods is that they effectively contradict Occam's razor, in that the end result of an ensemble is not a single theory that predicts the data well, but a whole set of mutually contradicting theories that each hold some of the truth. The ensemble result might actually be huge, even when the system is simple.

## Re: (Score:2)

What I always found very interesting about ensemble methods is that they effectively contradict Occam's razor, in that the end result of an ensemble is not a single theory that predicts the data well, but a whole set of mutually contradicting theories that each hold some of the truth. The ensemble result might actually be huge, even when the system is simple.Well put. We have some software that does predictive modeling/data mining using ensemble techniques. The ensemble models dependably work better, with

## Re: (Score:2)

There's nothing special about ensemble methods that makes them better than a unified model. On the contrary, they are linear combinations of predictors, and the lack of unification makes this a rather inefficient idea.

Compare with a series representation (eg Fourier, wavelets, etc): if each of the component predictors of your ensemble method is de

## Weighting of ensembles (Score:3, Funny)

1.)Length of input compared to cleaned word.

2.)Number of nonpositional letter matches.

3.)Number of positional letter matches.

Depending on how rules are weighted determines what the answer will be (either sad or happy). I know at my job this weighting process requires very careful politicking.

## Re: (Score:2)

Serious question: do you guys have a software routine you use to account for fat-fingered typos? So do you weight the positional changes by how close a letter is to another on the keyboard?

So for example,

Leyboard

would be a closer to match to "keyboard" than

Reyboard

because L is geospatially closer to K on the keyboard than R is?

That would be an interesting program to see, anyway.

## Re: (Score:2)

"Can someone give me the management summary in laymen terms ?"

Do-Your-Maths (full stop).

## Re: (Score:3, Informative)

Here's a stab at 3 sentences:

Making a prediction by running multiple statistical prediction algorithms and combining their results often seems to work well. This is called an "ensemble method". Ensemble methods seem to work particularly well on collaborative filtering problems.

## Re: (Score:3, Insightful)

Algorithms produce better results working in committees, unlike you.

## I'm sorry (Score:1)

## Re: (Score:2)

No problem! Please inform us what year we should have stopped naming stuff and just stuck with the tried and true.

Sorry, progress doesn't stop because you want it to or are getting a little long in the tooth to learn new things.

## Re: (Score:2)

The "progress" in terminology you're talking about is mostly the result of you being too young to have bothered to learn the old things.

Which is fine. Way too much stuff has happened in the past for anybody to know it all; it's just unremarkable that everything old should be new again, under a new name, as ideas get born and reborn.

## Re: (Score:1)

But in this case "Ensemble" is not a new term, even by my old fart standards, it's been around since 1995 and describes the use of multiple modeling methods to

## Re: (Score:1)

First line of Chapter 16 Ensemble Learning: "The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models".

Yeah, that's what I meant to say

## Re:I'm sorry (Score:5, Informative)

Yeah, the term dates back at least to the 1990s. The classic survey paper (over 1000 citations!) on the subject is "Ensemble Methods in Machine Learning" [pdf] [wsu.edu] by Tom Dietterich (2000), for those who want to glance through a survey. Though be warned that some of its specific conclusions are now dated--- e.g. there's been a *lot* written in both statistics and machine learning since then on what boosting "really" is and why it works.

Dietterich presents the more machine-learning view of it, focused on algorithms, combination of predictions, iterative refinement, etc. The best survey from a statistical approach is probably Ch. 16 of this book [amazon.com] by three [stanford.edu] Stanford [stanford.edu] profs [stanford.edu], which you can probably read some of on Google Books [google.com].

## Re: (Score:1)

Current favorite tech book: Racing the Beam [bit.ly] , on the Atari VCSSo your the one that bought that book.

## Re: (Score:2)

What's that cryptic comment mean? I did indeed buy it, though, yes. =]

## Re: (Score:1)

Just struck me as something where I really couldn't imagine a whole lot of people would read something like this.

## Re: (Score:2)

Hmm, I could see that, though the Atari VCS (aka Atari 2600) was a pretty popular platform. There's fan forums [atariage.com] and such devoted to it. It seemed cool to me that someone finally wrote a good book on it, since it was pretty influential.

## Re: (Score:1)

As much as I share your loathing for marketing, I don't believe this is a case of that. There is no company pushing this term, it's a community (or research?) -driven word. I'm not sure there IS a word for it yet except for ensemble. If you know otherwise, please let us know.

I'll get off your lawn now.

## Re: (Score:1)

## Re: (Score:1)

A field is typically defined as a patch of grass with some kind of enclosure around it, often made of wood, stone or natural bushes.

Please while you are on your campaign, have math, science and computing rename the concepts they refer to as fields.

## Re: (Score:2)

As applied in statistical modeling it is a process which selects the model which fits the data the best. Thus it it makes me irate as my understanding is in opposition with its classical use.In fact it isn't. The classical practice is selecting the model which fits the data the best, known as "model selection".

Ensemble methods do little "yes/no" model selection and instead make predictions from weighted combinations of many base predictors. To make a prediction, you need to score many or all of the base m

## Sounds just like MCTS (Score:1)

Machine learning ensembles sounds just like monte-carlo tree search (MCTS) techniques (also called UCT), which are used in computer go (and more and more other AI problems) with great success.

The idea is that instead of trying to analyze a board position (which can be really, really difficult) using clever algorithms, you ask a random/simplistic algorithm to play out the rest of the game thousands upon thousands of times and see how many of those games it wins. The more it wins the better the positions.

Soun