Augmenting Data Beats Better Algorithms 179
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
This reminds me (Score:3, Interesting)
Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)
Too a large extent ... (Score:3, Interesting)
Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.
Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.
Depends on the problem. (Score:2, Interesting)
Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.
You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.
I wonder if accessing foreign web sites is legal in this competition though, because that definitely alters the complexion of the problem.
To say google succeeded by expanding their data pool is an oversimplification, because not only did they select what they felt was most important, they ignored what they felt was not. Intelligent selection took place to set their initial conditions for their algorithm. So it isn't just data augmentation. It is the ability to augment data relative to a goal, and this is much deeper than just "more data" vs "algorithm". In fact, you can also find situations where algorithms are used to make these intelligent selections, in which case the selection process can be as or more important than just the sheer volume of available data alone.
Re:Depends on the Problem (Score:4, Interesting)
Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.
All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.
Re:Too a large extent ... (Score:3, Interesting)
It's like n-9s uptime. Each nine in your reliability score costs geometrically more than the last; the same sort of thing holds for the scores measured in ML training.
Re:Heuristics?? (Score:2, Interesting)