Augmenting Data Beats Better Algorithms 179

Posted by kdawson on Tuesday April 01, 2008 @03:10PM from the tell-it-to-the-dhs dept.

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

Augmenting Data Beats Better Algorithms

This discussion has been archived. No new comments can be posted.

Search 179 Comments Log In/Create an Account

Comments Filter:

Re:Is it just me that is surprised here? (Score:5, Informative)

by gnick ( 1211984 ) writes: on Tuesday April 01, 2008 @03:27PM (#22933516) Homepage

The netflix challenge is to arrive at a better algorithm with the supplied data.
Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.

Re:Heuristics?? (Score:5, Informative)

by EvanED ( 569694 ) writes: <evaned@NOSpaM.gmail.com> on Tuesday April 01, 2008 @03:33PM (#22933586)

One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia [wikipedia.org].

A bit like swap vs. real memory (Score:2, Informative)

by etymxris ( 121288 ) writes: on Tuesday April 01, 2008 @03:41PM (#22933692)

A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.

So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.

Re:attn computer scientists: stop renaming stuff (Score:3, Informative)

by Sciros ( 986030 ) writes: on Tuesday April 01, 2008 @03:51PM (#22933796) Journal

What noobery. You're confusing the "what" with the "how". Finding eigenvalues is part of a particular page rank algorithm. It's not THE page rank algorithm. Likewise, statistical inference is part of particular "machine learning" systems. It's not THE system. Using statistical inference alone will give you crude (albeit good, with enough training data) baselines to work from in some applications such as automatic text translation, but you'll need more than that to overcome issues like data sparseness, etc.

I know anonymous cowards like playing expert, but there's a reason why you're the butt of so many jokes here -- only thing you're usually expert in is misinformation and disingenuity.

This does not mean what I think you think it means (Score:4, Informative)

by aibob ( 1035288 ) writes: on Tuesday April 01, 2008 @05:09PM (#22934762)

I am a graduate student in computer science, emphasizing the use of machine learning.

The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.

The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.

But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.

Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.

In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.

Re:Is it just me that is surprised here? (Score:3, Informative)

by cavemanf16 ( 303184 ) writes: on Tuesday April 01, 2008 @05:24PM (#22934912) Homepage Journal

I tend to agree that augmenting data helps improve the model if the model is not yet overwhelmed with data, but you have to have a decent model to begin with or it won't work. Additionally, the payoff of additional data added to the model is a diminishing return as the amount of data available begins to overwhelm any given model. In other words, the more data you collect and put into your model, the more expensive, time consuming, and difficult it becomes to continue to rely on the original model.

In linear regression models for forecasting there is what's known as a "variable inflation factor". This factor helps a statistician know when their linear regression model is beginning to perform poorly when too much data is in the equation because different variables (containing different, but inter-related data) will eventually begin to conflict with one another.

For the Netflix thing, this could show up as a problem if the model is trying to recommend which movie you should rent next based on actors/actresses in previous movies you've watched, which movies you rated higher than others, which genres those highly rated movies were in, which actors/actresses you had rated highly, and which movies those highly rated actors/actresses had been in that you hadn't seen yet. It's quite likely that someone like Kevin Bacon has been in some romantic comedy with another one of your favorite actors or actresses, but you absolutely hate horror movies and he's in a "horror" film with that same actor or actress. The recommendation model would likely try to recommend a movie to you based on three positives (a favorite film and two separate favorite actors) because there's only one negative in the equation. (your hatred for horror movies) This is a very simplistic example, but that's the problem of too much data with too simplistic of an algorithm. A linear regression might have this problem, but if one were to build in an additional bit of algorithm magic that made sure horror movies were "filtered out" or severely punished for being in the horror genre before looking for other factors like favorite actors/actresses in movies then the algorithm would perform better. But then, of course, additional types of data would be needed to adequately "fill in the gaps" for the new monster algorithm that you've created.

Re:Heuristics?? (Score:3, Informative)

by glwtta ( 532858 ) writes: on Tuesday April 01, 2008 @05:25PM (#22934928) Homepage

Aren't these heuristics and not algorithms?

Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.

Re:Heuristics?? (Score:2, Informative)

by EvanED ( 569694 ) writes: <evaned@NOSpaM.gmail.com> on Tuesday April 01, 2008 @09:55PM (#22937176)

Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.

"In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]

CS theorists talk about randomized algorithms [wikipedia.org]. They don't produce an optimal result. CS theorists talk about online algorithms [wikipedia.org]. They don't produce an optimal result. CS theorists talk about approximation algorithms [wikipedia.org]. They don't produce an optimal result.

Producing an optimal result isn't a requirement of being an algorithm. Heuristics are just algorithms that tend to produce useful results most of the time. In fact, Wikipedia page [wikipedia.org] for the CS notion of a heuristic is called "heuristic algorithm."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Augmenting Data Beats Better Algorithms 179

Augmenting Data Beats Better Algorithms More Login

Augmenting Data Beats Better Algorithms

Re:Is it just me that is surprised here? (Score:5, Informative)

Re:Heuristics?? (Score:5, Informative)

A bit like swap vs. real memory (Score:2, Informative)

Re:attn computer scientists: stop renaming stuff (Score:3, Informative)

This does not mean what I think you think it means (Score:4, Informative)

Re:Is it just me that is surprised here? (Score:3, Informative)

Re:Heuristics?? (Score:3, Informative)

Re:Heuristics?? (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot