Augmenting Data Beats Better Algorithms 179
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
Re:Is it just me that is surprised here? (Score:5, Informative)
Re:Heuristics?? (Score:5, Informative)
A bit like swap vs. real memory (Score:2, Informative)
So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.
Re:attn computer scientists: stop renaming stuff (Score:3, Informative)
I know anonymous cowards like playing expert, but there's a reason why you're the butt of so many jokes here -- only thing you're usually expert in is misinformation and disingenuity.
This does not mean what I think you think it means (Score:4, Informative)
The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.
The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.
But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.
Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.
In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.
Re:Is it just me that is surprised here? (Score:3, Informative)
In linear regression models for forecasting there is what's known as a "variable inflation factor". This factor helps a statistician know when their linear regression model is beginning to perform poorly when too much data is in the equation because different variables (containing different, but inter-related data) will eventually begin to conflict with one another.
For the Netflix thing, this could show up as a problem if the model is trying to recommend which movie you should rent next based on actors/actresses in previous movies you've watched, which movies you rated higher than others, which genres those highly rated movies were in, which actors/actresses you had rated highly, and which movies those highly rated actors/actresses had been in that you hadn't seen yet. It's quite likely that someone like Kevin Bacon has been in some romantic comedy with another one of your favorite actors or actresses, but you absolutely hate horror movies and he's in a "horror" film with that same actor or actress. The recommendation model would likely try to recommend a movie to you based on three positives (a favorite film and two separate favorite actors) because there's only one negative in the equation. (your hatred for horror movies) This is a very simplistic example, but that's the problem of too much data with too simplistic of an algorithm. A linear regression might have this problem, but if one were to build in an additional bit of algorithm magic that made sure horror movies were "filtered out" or severely punished for being in the horror genre before looking for other factors like favorite actors/actresses in movies then the algorithm would perform better. But then, of course, additional types of data would be needed to adequately "fill in the gaps" for the new monster algorithm that you've created.
Re:Heuristics?? (Score:3, Informative)
Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.
Re:Heuristics?? (Score:2, Informative)
"In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]
CS theorists talk about randomized algorithms [wikipedia.org]. They don't produce an optimal result. CS theorists talk about online algorithms [wikipedia.org]. They don't produce an optimal result. CS theorists talk about approximation algorithms [wikipedia.org]. They don't produce an optimal result.
Producing an optimal result isn't a requirement of being an algorithm. Heuristics are just algorithms that tend to produce useful results most of the time. In fact, Wikipedia page [wikipedia.org] for the CS notion of a heuristic is called "heuristic algorithm."