Augmenting Data Beats Better Algorithms 179
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
Heuristics?? (Score:1, Insightful)
Re:Heuristics?? (Score:5, Informative)
Re: (Score:2)
Re: (Score:2)
More data will produce better results.
Re: (Score:2)
Re: (Score:3, Informative)
Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.
Re: (Score:2, Interesting)
Re: (Score:3, Insightful)
Since we are obviously talking about the "goodness" of the results produced by the algorithm, I think it's pretty safe to assume that the broader definition of "algorithm" is being used.
Re: (Score:3, Insightful)
Not always. Approximation algorithms are often ranked on their accuracy. Online algorithms are often ranked on something called the competitive ratio. Randomized algorithms are usually ranked on their resource uses, but all three of these needn't be optimal (in the context of an optimization problem) -- or produce correct results (in the context of a decision problem).
Algorithms must have the same correct results by definition.
[citation needed]
Re: (Score:2, Informative)
"In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]
CS theorists talk about randomized algorithms [wikipedia.org]. They don't produce an optimal result. CS theorists talk ab
Re: (Score:2)
If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense).
But the whole thing is just the kind of nitpicking that only someone who is really proud of having taken Intro to Complexity and Computability Theory recently would engage in.
Re: (Score:2)
If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense).
You have to be careful in defining "the problem". Deterministic computers only execute algorithms. However the problem that the algorithm solves may not be the actual problem you really care about. When those two problem definitions differ, but an algorithm for the former is an approximation or useful substitute for an algorithm solving the true problem, what you've got is a heuristic.
Say I want to choose the best 10 students out of 100 for competing in a math competition. Clearly there's no algorithm
Re: (Score:2)
And to make things interesting, many words have several meanings, especially scientific terms that also have far more common, and less precise, general meanings. It's not that hard a concept to grasp, really.
Re: (Score:2)
Welcome to the English language, where words mean what they are defined by the user to mean. This may or not may not reveal the speaker to be an asshat, but trying to say that it's an incorrect usage is a sure way to reveal yourself as one.
Depends on the Problem (Score:5, Insightful)
I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.
Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
Re:Depends on the Problem (Score:5, Insightful)
Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
but of course you're only asked to predict a subset of ~3,000,000. (still a lot for the given data, but hey, it's $1,000,000
Re: (Score:3, Insightful)
Re:Depends on the Problem (Score:4, Interesting)
Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.
All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.
Re:Depends on the Problem (Score:4, Insightful)
Anyways, if you're paranoid about data on you being used - there's a less well-known field of recommender systems which uses implicit data gathering which can be easily setup on any site. For example, it might say that because you clicked on product X many times today, you're probably in want of it and they can use that data. Of course, implicit data gathering is more faulty than explicit data gathering, but it just goes to show that if you spend time on the internet, websites can always use your data for their own means.
Re: (Score:3, Insightful)
The Netflick data shouldn't be regarded as representative of anything. That data set has shockingly low dimensionality. So far as I know, they make no attempt to differentiate what kind of enjoyment the viewer obtained from the movie, or even determine whether the movie was viewed
I think better is subjective... (Score:3, Insightful)
Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.
Re: (Score:2)
Um, Yes? (Score:5, Insightful)
I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
Re:Um, Yes? (Score:2)
For the Sake of Discussion (Score:4, Insightful)
You could also make an elaborate algorithm that uses user age, sex & location
Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!
Re: (Score:2)
You could also make an elaborate algorithm that uses user age, sex & location
That's just more data, IMHO, and nothing to do with the algorithm - you'd just be running the learner over more fields. What is a "better" algorithm? In formal terms, the "better" algorithm will classify with a higher accuracy during performance (the phase after the learner has "learnt") than another one using the same data and in a consistent manner (i.e not for some particular sample).
I am only vaguely familiar with the netflix prize but I think you are asking a rather open-ended question here. Relevant
Re: (Score:2)
Simply p
Re:Um, Yes? (Score:3, Funny)
This reminds me (Score:3, Interesting)
Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)
Is it just me that is surprised here? (Score:2, Insightful)
I read the article in question here and can say that I'm surprised that this is even a question.
Re:Is it just me that is surprised here? (Score:5, Informative)
Re: (Score:3, Informative)
In linear
Re: (Score:2)
Re: (Score:2)
What do you think? Will more data usually perform better than a better algorithm?"
Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?
I read the article in question here and can say that I'm surprised that this is even a question.
Good point. There doesn't appear to be any mention of the improvement of supplemented data AND an improved algorithm.
Too a large extent ... (Score:3, Interesting)
Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.
Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.
Re: (Score:3, Interesting)
I see many people publish papers on a new method that does 1% better than preexisting methods.
If that 1% is from 95% to 96% accuracy, it's actually a 20% improvement in error rates! I know this sounds like an example from "How to Lie With Statistics," but it is the correct way to look at this sort of problem.
It's like n-9s uptime. Each nine in your reliability score costs geometrically more than the last; the same sort of thing holds for the scores measured in ML training.
Assuming the algorithm isn't evil (Score:1)
More vs Better (Score:4, Insightful)
Re: (Score:2)
I agree here, though when humans are involved I think it can be difficult to get accurate data and the skill is in asking for information which has the least subjectiveness.
To give an example closer to the topic, I watched Beowulf last night. After watching the film I was left with a feeling that he wasn't the hero I assumed he was (never having known the real story except a vague knowledge he was some sort of kick-ass old English hero).
I spent a while doing some research and discovered that the film is
All things being equal... (Score:4, Insightful)
And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.
Re: (Score:2)
Golf clap for almost (but not quite) getting the point of that.
Obvious? (Score:2)
Is it just me, or is it pretty obvious that this all just depends on the algorithm and the data?
Like I could "augment" the data with worthless or misleading data, and get the same or worse results. If I have a huge set of really good and useful data, I can get better results without making my algorithm more advanced. And no matter how advanced my algorithm is, it won't return good results if it doesn't have sufficient data.
When a challenge is put out to improve these algorithms, it's really because the
Hold on a sec... (Score:5, Funny)
I need more data.
Re: (Score:2)
This is classic XOR thinking that permeates our society. One or the other, not both is rarely a correct option. It is mostly for boolean operations, which this is clearly not. This is clearly an AND function. More Data AND a Better Algorithm is actually the most correct answer. "Which helps more?" is a silly question except for deciding on how much resources should be split in improving both, along with how much easier is one vs the other.
Re: (Score:2)
Re: (Score:2)
I was optimizing for humor.
Five stars (Score:5, Insightful)
How to translate the entire experience of watching a movie into a lone number is a separate issue.
Re: (Score:2)
I don't think this is a problem of it being a 1-5 scale instead of 1-10. It's not like there's really that much information given by scoring a movie a 7 instead of 8, since it's all subjective anyway on any given day those scores could have been reversed.
I think it's more the extremely common situation where people don't want to give an "average" score, so you get score inflation such that only th
Re: (Score:2)
But that'
Anchoring effect! (Score:2)
Re: (Score:2)
-Rick
Will more data usually perform better than a bette (Score:2)
apparently... (Score:2)
To augment or algorithm is the question? (Score:2)
So, the upshot is to look at both approaches and take the best course of action for your needs.
Isn't an algorithm just data? (Score:2)
Um, yes? (Score:2)
Ultimately, everything is data.... a Turing machine doesn't know "algorithms"....it's a state machine and its all data to it. So, yeah, although we like to pretend that code and data are separate things, the very bedrock of theory that computer science sits on says that its all the same, and ultimately, when we choose a data driven architecture or an algorithmically heavy one, we're really just choosing wher
This is assuming... (Score:3, Insightful)
Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.
The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.
For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.
A bit like swap vs. real memory (Score:2, Informative)
So when I think of this recommendation system, a better algorithm is like having swap space enabled. It
Recommendations Systems and subjectivity (Score:4, Insightful)
The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.
First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)
The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.
Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.
My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.
You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.
That being said, the data is fun to work with!!
more data in which dimension? (Score:2)
The fundamental idea is to be able to identify clusters of movies, or users (who like a certain type of movie), and the idea of clusters is built on some form of distance. When you add a new dimension to your feature vector, you get a chance to identify groups of entities better, using that dimension. You may d
One Trivial Result, One Big Assumption (Score:4, Insightful)
The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.
Depends on the problem. (Score:2, Interesting)
Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.
You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.
I wonder if accessing foreign web sites is legal in this competition though, be
Augmented? (Score:2)
This is a rule of algorithms (Score:2)
against the terms of the prize (Score:2)
http://www.netflix.com/TermsOfUse [netflix.com]
see also:
http://www.netflixprize.com/community/viewtopic.php?id=98 [netflixprize.com]
http://www.netflixprize.com/community/viewtopic.php?id=20 [netflixprize.com]
http://www.netflixprize.com/community/viewtopic.php?id=14 [netflixprize.com]
note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options.
Algorithms help too (Score:2)
This article is a completely false dichotomy.
This does not mean what I think you think it means (Score:4, Informative)
The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.
The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.
But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.
Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.
In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.
Does nobody know Shannon anymore? (Score:2, Insightful)
Hello? duh!! more data yields better algorithms (Score:2)
So using an old saw --which comes first, the chicken or the egg? Or is there a superior que
Told You So (Score:2)
Re:attn computer scientists: stop renaming stuff (Score:5, Funny)
Re: (Score:2)
Re:attn computer scientists: stop renaming stuff (Score:5, Funny)
Re:attn computer scientists: stop renaming stuff (Score:5, Funny)
Re:attn computer scientists: stop renaming stuff (Score:5, Funny)
Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics without tenure.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Insightful)
I think we all need each other, folks
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Informative)
I
Re:attn computer scientists: stop renaming stuff (Score:4, Insightful)
And nonlinear dimensionality reduction is just nonconvex trace optimization coupled with kernel principal component analysis (fine, call it "singular value decomposition") using Mercer's theorem to map the resulting dot product through a kernel function (usually represented as a Hermitian positive semidefinite Gram matrix), yielding an inner product space of higher (possibly infinite) dimensionality in which the original problem is linearly separable.
Now take this description and write an algorithm that performs it efficiently. And you use PageRank as an example, so let's call "efficient" "performs as well as Google on the entire web's worth of data".
If you can't do this, perhaps you should reconsider your view of computer scientists. There's no reason whatsoever to play up the boundaries between two very related fields. Arbitrary boundaries in knowledge are already bad enough; they need to be knocked down, not reinforced.
Re: (Score:2)
Why all the hate, people? Different disciplines have different terminology. Sure, there are probably some mathematical generalizations for common computer science problems. And there are probably some CS generalizati
Re: (Score:2)
Re: (Score:2)
What about lambda calculus ? (Score:2)
Re:attn computer scientists: stop renaming stuff (Score:5, Funny)
Riiiht. And mathematical research is just finding a Hamiltonian cycle in a graph defined by the set of axioms used.
Re: (Score:3, Funny)
Adapted from a joke I saw on Jester the other day:
A physicist, a computer scientist and a mathematician are sharing a hotel room. It must have bad wiring or something.
Late at night when they're all asleep a small fire starts in the room. The smell of smoke wakes the physicist. He gets up, notices the fire and looking
Re: (Score:2)
Re: (Score:2)
*sigh* (Score:2)
Re: (Score:2)
*Woosh*