Augmenting Data Beats Better Algorithms 179

Posted by kdawson on Tuesday April 01, 2008 @03:10PM from the tell-it-to-the-dhs dept.

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

Augmenting Data Beats Better Algorithms

This discussion has been archived. No new comments can be posted.

Search 179 Comments Log In/Create an Account

Comments Filter:

This reminds me (Score:3, Interesting)

by FredFredrickson ( 1177871 ) * writes: on Tuesday April 01, 2008 @03:16PM (#22933396) Homepage Journal

This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.

Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)

Too a large extent ... (Score:3, Interesting)

by haluness ( 219661 ) writes: on Tuesday April 01, 2008 @03:17PM (#22933420)

I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.

Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.

Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.

Depends on the problem. (Score:2, Interesting)

by v(*_*)vvvv ( 233078 ) writes: on Tuesday April 01, 2008 @04:05PM (#22933988)

Would you rather know more or be smarter?

Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.

You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.

I wonder if accessing foreign web sites is legal in this competition though, because that definitely alters the complexion of the problem.

To say google succeeded by expanding their data pool is an oversimplification, because not only did they select what they felt was most important, they ignored what they felt was not. Intelligent selection took place to set their initial conditions for their algorithm. So it isn't just data augmentation. It is the ability to augment data relative to a goal, and this is much deeper than just "more data" vs "algorithm". In fact, you can also find situations where algorithms are used to make these intelligent selections, in which case the selection process can be as or more important than just the sheer volume of available data alone.

Re:Depends on the Problem (Score:4, Interesting)

by blahplusplus ( 757119 ) writes: on Tuesday April 01, 2008 @04:23PM (#22934214)

"I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, ...."

Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.

All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.

Re:Too a large extent ... (Score:3, Interesting)

by __aaahtg7394 ( 307602 ) writes: on Tuesday April 01, 2008 @05:00PM (#22934642)

I see many people publish papers on a new method that does 1% better than preexisting methods.
If that 1% is from 95% to 96% accuracy, it's actually a 20% improvement in error rates! I know this sounds like an example from "How to Lie With Statistics," but it is the correct way to look at this sort of problem.

It's like n-9s uptime. Each nine in your reliability score costs geometrically more than the last; the same sort of thing holds for the scores measured in ML training.

Re:Heuristics?? (Score:2, Interesting)

by nategoose ( 1004564 ) writes: on Tuesday April 01, 2008 @06:23PM (#22935620)

In this particular case I think that the distinction is important. Saying that something is a better algorithm doesn't imply that it gives a better result(s) as all correct results are semantically the same. Algorithms are ranked on their resource usage. Heuristics are ranked on the perceived goodness of their results. Algorithms must have the same correct results by definition.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Augmenting Data Beats Better Algorithms 179

Augmenting Data Beats Better Algorithms More Login

Augmenting Data Beats Better Algorithms

This reminds me (Score:3, Interesting)

Too a large extent ... (Score:3, Interesting)

Depends on the problem. (Score:2, Interesting)

Re:Depends on the Problem (Score:4, Interesting)

Re:Too a large extent ... (Score:3, Interesting)

Re:Heuristics?? (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot