Forgot your password?

Augmenting Data Beats Better Algorithms 179

Posted by kdawson
from the tell-it-to-the-dhs dept.
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
This discussion has been archived. No new comments can be posted.

Augmenting Data Beats Better Algorithms

Comments Filter:
  • Heuristics?? (Score:1, Insightful)

    by nategoose (1004564)
    Aren't these heuristics and not algorithms?
    • Re:Heuristics?? (Score:5, Informative)

      by EvanED (569694) <> on Tuesday April 01, 2008 @03:33PM (#22933586)
      One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia [].
      • by tuomoks (246421)
        Correct! I have to say I'm not amazed. We did the same kind of ratings a long time ago, guess where - right, in insurance. Part of the risk management. Is a red headed under 30 less risk than a blond at the same age? Better - what and how costly will be their next accident. Trying to predict human behavior, the cause and the results has been there a long time. The same was done for example ships world wide we insured but there wasn't just the information of the shipping company, we did background checking a
        • When I was first read this article I was confused what they meant. But after I thought about it for awhile, it seems self-evident. Using a consumer example: A 1920x1080 HD-DVD is going to look better than a 720x480 SD-DVD with 1920x1080 upscaling algorithm applied to it. That's fairly self-evident.

          More data will produce better results.

          • Exactly. I want to tag this as 'duh'. The challenge was to mine the original data set though, so I'm not sure if this would even be a valid entry?

            Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated.
            [emphasis mine]
    • Re: (Score:3, Informative)

      by glwtta (532858)
      Aren't these heuristics and not algorithms?

      Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.
      • Re: (Score:2, Interesting)

        by nategoose (1004564)
        In this particular case I think that the distinction is important. Saying that something is a better algorithm doesn't imply that it gives a better result(s) as all correct results are semantically the same. Algorithms are ranked on their resource usage. Heuristics are ranked on the perceived goodness of their results. Algorithms must have the same correct results by definition.
        • Re: (Score:3, Insightful)

          by glwtta (532858)
          Algorithms must have the same correct results by definition.

          Since we are obviously talking about the "goodness" of the results produced by the algorithm, I think it's pretty safe to assume that the broader definition of "algorithm" is being used.
        • Re: (Score:3, Insightful)

          by EvanED (569694)
          Algorithms are ranked on their resource usage.
          Not always. Approximation algorithms are often ranked on their accuracy. Online algorithms are often ranked on something called the competitive ratio. Randomized algorithms are usually ranked on their resource uses, but all three of these needn't be optimal (in the context of an optimization problem) -- or produce correct results (in the context of a decision problem).

          Algorithms must have the same correct results by definition.
          [citation needed]
      • Re: (Score:2, Informative)

        by EvanED (569694)
        Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.

        "In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]

        CS theorists talk about randomized algorithms []. They don't produce an optimal result. CS theorists talk ab
        • by glwtta (532858)
          Producing an optimal result isn't a requirement of being an algorithm.

          If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense).

          But the whole thing is just the kind of nitpicking that only someone who is really proud of having taken Intro to Complexity and Computability Theory recently would engage in.
          • by SnowZero (92219)

            If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense).

            You have to be careful in defining "the problem". Deterministic computers only execute algorithms. However the problem that the algorithm solves may not be the actual problem you really care about. When those two problem definitions differ, but an algorithm for the former is an approximation or useful substitute for an algorithm solving the true problem, what you've got is a heuristic.

            Say I want to choose the best 10 students out of 100 for competing in a math competition. Clearly there's no algorithm

  • by roadkill_cr (1155149) on Tuesday April 01, 2008 @03:14PM (#22933376)
    I think it heavily depends on what you're kind of data your mining.

    I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

    Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
    • Exactly. An algorithm can't see what isn't there, so the more data you have, the better your result will be. You can of course improve upon the algorithm, but the quality/quantity of data is always going to be more important.
      • by ubrgeek (679399)
        Isn't that similar to the posting about Berkley's joke recommender posting from the other day []? Rate jokes and it then suggests ones you should like. I tried it and I don't know if the pool from which the jokes are pulled is shallow, but the ones it returned after I finished "calibrating" it were terrible and not along the lines of what I would have assumed the system thought I would think were funny.
      • Re: (Score:3, Insightful)

        by Brian Gordon (987471)
        It's not always going to be more important. There's really no difference between a sample of 10 million and a sample of 100 million.. at that point it's obviously more effective to put work into improving the algorithm.. but that turning point (again obviously) would come way before 10 million samples of data. It's a balance.
        • by teh moges (875080) on Tuesday April 01, 2008 @06:01PM (#22935336) Homepage
          Think less in sheer numbers and more in density. If there are 200 million possible 'combinations' (say, 50,000 customers and 4000 movies in a Netflix-like situation), then with 10 million data samples, we only have 5% of the possible data. This means that if we are predicting inside the data scope, we are predicting into an unknown field that is 19 times larger then the known.
          Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.
          • by leenks (906881)
            Or we over-fit to the training data and end up performing badly in the real world when trends change (eg new style of film production appears)
          • 5% is a 20th of 100, not a 19th :)
            • by teh moges (875080)
              Take 5% - the 'known', leaves 95% - the 'unknown'. The unknown is 19 times larger then the known.
          • by CastrTroy (595695)
            I don't think it's about density or sheer numbers. It's about how much other data you have about the data you're looking at. In this case, they augmented the netflix data with data from IMDB. By having more data about the movies that were being rated, such as actors, producers, directors, year of filming, and other information, they were better able to recommend movies. If the only thing you know about a movie is who voted for it, but not why, then you are going to have a hard time recommending movies t
          • by deander2 (26173) *
            but the netflix prize has 500,000 users, 20,000 movies and 10,000,000 ratings. that's 10,000,000,000 possible ratings, making the given 0.1%.

            but of course you're only asked to predict a subset of ~3,000,000. (still a lot for the given data, but hey, it's $1,000,000 ;)
    • Re: (Score:3, Insightful)

      by RedHelix (882676)
      Well, yeah, augmenting data can produce more reliable results than better algorithms. If a legion of film buffs went through every single film record on Netflix's database and assigned "recommendable" films to it, then went and looked up the rental history of every Netflix user and assigned them individual recommendations, you would probably end up with a recommendation system that beats any algorithm. The dataset here would be ENORMOUS. But the reason algorithms exist is so that doesn't have to happen. i
    • by blahplusplus (757119) on Tuesday April 01, 2008 @04:23PM (#22934214)
      "I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, ...."

      Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.

      All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.
      • by roadkill_cr (1155149) on Tuesday April 01, 2008 @04:30PM (#22934300)
        It's true that you lose some anonymity, but there is so much to gain. To be perfectly honest, I'm completely fine with rating products on and Netflix - I only go to these sites to shop for products and movies, so why not take full advantage of their recommendation system? If I am in consumer mode, I want the salesman to be as competent as possible.

        Anyways, if you're paranoid about data on you being used - there's a less well-known field of recommender systems which uses implicit data gathering which can be easily setup on any site. For example, it might say that because you clicked on product X many times today, you're probably in want of it and they can use that data. Of course, implicit data gathering is more faulty than explicit data gathering, but it just goes to show that if you spend time on the internet, websites can always use your data for their own means.
    • Re: (Score:3, Insightful)

      by epine (68316)
      It seems to be a bad day for science writing. The piece on rowing a galley was a joke. And now we're being told that one data mining problem with a dominant low-hanging return on augmenting data represents a general principle.

      The Netflick data shouldn't be regarded as representative of anything. That data set has shockingly low dimensionality. So far as I know, they make no attempt to differentiate what kind of enjoyment the viewer obtained from the movie, or even determine whether the movie was viewed
  • by 3p1ph4ny (835701) on Tuesday April 01, 2008 @03:14PM (#22933378) Homepage
    In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."

    Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.
    • In this case, better is well defined. They're looking for a system that can take a certain data set and use it to predict another data set. ultimately, the quality of picks is determined by the user. For this contest, they've got data sets that they can use to determine which is the best method.
  • Um, Yes? (Score:5, Insightful)

    by randyest (589159) on Tuesday April 01, 2008 @03:15PM (#22933390) Homepage
    Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

    I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
    • It's a simple application of Rao-Blackwell theorem [] at work. Making use of useful information (in this case, movie genre) makes the estimate more precise.
    • by eldavojohn (898314) * <> on Tuesday April 01, 2008 @03:29PM (#22933544) Journal
      Well, for the sake of discussion I will try to give you an example so that you might pick it apart.

      "more data"
      More data means that you understand directors and actors/actresses often do a lot of the same work. So for every movie that the user likes, you weight their stars they gave it with a name. Then you cross reference movies containing those people using a database (like IMDB). So if your user loved The Sting and Fight Club, they will also love Spy Games which had both Redford & Pitt starring in it.

      "better algorithm"
      If you naively look at the data sets, you can imagine that each user represents a taste set and that high correlations between two movies in several users indicates that a user who has not seen the second movie will most likely enjoy it. So if 1,056 users who saw 12 Monkeys loved Donnie Darko but your user has only seen Donnie Darko, highly recommend them 12 Monkeys.

      You could also make an elaborate algorithm that uses user age, sex & location ... or even a novel 'distance' algorithm that determines how far away they are from liking 12 Monkeys based on their highly ranked other movies.

      Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!
      • by Plutonite (999141)

        You could also make an elaborate algorithm that uses user age, sex & location

        That's just more data, IMHO, and nothing to do with the algorithm - you'd just be running the learner over more fields. What is a "better" algorithm? In formal terms, the "better" algorithm will classify with a higher accuracy during performance (the phase after the learner has "learnt") than another one using the same data and in a consistent manner (i.e not for some particular sample).

        I am only vaguely familiar with the netflix prize but I think you are asking a rather open-ended question here. Relevant

        • by Plutonite (999141)
          I'd also like to say that the analogy with pagerank is a little off-base. I realize this is a Stanford professor, but trust me neither the machine learning people nor the information retrieval folks know what the other side is talking about at a deep level, mostly because IR is a hack-ish, "sciencified" topic (I'm quoting a very well-known man in the field) while statistical inference is a little more formal. They each have completely different goals and challenges, though they do overlap in places.

          Simply p
    • Re:Um, Yes? (Score:3, Funny)

      by canajin56 (660655)

      I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
      So what you are saying is, to answer the question, we need more data?
  • This reminds me (Score:3, Interesting)

    by FredFredrickson (1177871) * on Tuesday April 01, 2008 @03:16PM (#22933396) Homepage Journal
    This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.

    Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)
  • What do you think? Will more data usually perform better than a better algorithm?"
    Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

    I read the article in question here and can say that I'm surprised that this is even a question.
    • by gnick (1211984) on Tuesday April 01, 2008 @03:27PM (#22933516) Homepage

      The netflix challenge is to arrive at a better algorithm with the supplied data.
      Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.
      • Re: (Score:3, Informative)

        by cavemanf16 (303184)
        I tend to agree that augmenting data helps improve the model if the model is not yet overwhelmed with data, but you have to have a decent model to begin with or it won't work. Additionally, the payoff of additional data added to the model is a diminishing return as the amount of data available begins to overwhelm any given model. In other words, the more data you collect and put into your model, the more expensive, time consuming, and difficult it becomes to continue to rely on the original model.

        In linear
    • What do you think? Will more data usually perform better than a better algorithm?"

      Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

      I read the article in question here and can say that I'm surprised that this is even a question.

      Good point. There doesn't appear to be any mention of the improvement of supplemented data AND an improved algorithm.

  • by haluness (219661) on Tuesday April 01, 2008 @03:17PM (#22933420)
    I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.

    Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.

    Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.
    • Re: (Score:3, Interesting)

      by shorti9 (307602)

      I see many people publish papers on a new method that does 1% better than preexisting methods.

      If that 1% is from 95% to 96% accuracy, it's actually a 20% improvement in error rates! I know this sounds like an example from "How to Lie With Statistics," but it is the correct way to look at this sort of problem.

      It's like n-9s uptime. Each nine in your reliability score costs geometrically more than the last; the same sort of thing holds for the scores measured in ML training.

  • A piece of pertinent data is worth a thousand (code) lines of speculation.
  • More vs Better (Score:4, Insightful)

    by Mikkeles (698461) on Tuesday April 01, 2008 @03:19PM (#22933440)
    Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.
    • by Mushdot (943219)

      I agree here, though when humans are involved I think it can be difficult to get accurate data and the skill is in asking for information which has the least subjectiveness.

      To give an example closer to the topic, I watched Beowulf last night. After watching the film I was left with a feeling that he wasn't the hero I assumed he was (never having known the real story except a vague knowledge he was some sort of kick-ass old English hero).

      I spent a while doing some research and discovered that the film is

  • by Just Some Guy (3352) <> on Tuesday April 01, 2008 @03:21PM (#22933454) Homepage Journal

    One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better nearly as well as the best algorithm on the boards for the $1 million challenge.

    And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.

  • Is it just me, or is it pretty obvious that this all just depends on the algorithm and the data?

    Like I could "augment" the data with worthless or misleading data, and get the same or worse results. If I have a huge set of really good and useful data, I can get better results without making my algorithm more advanced. And no matter how advanced my algorithm is, it won't return good results if it doesn't have sufficient data.

    When a challenge is put out to improve these algorithms, it's really because the

  • by peacefinder (469349) <alan.dewitt@gm a i l . c om> on Tuesday April 01, 2008 @03:23PM (#22933476) Journal
    "What do you think? Will more data usually perform better than a better algorithm?"

    I need more data.
    • ... or a better algorithm

      This is classic XOR thinking that permeates our society. One or the other, not both is rarely a correct option. It is mostly for boolean operations, which this is clearly not. This is clearly an AND function. More Data AND a Better Algorithm is actually the most correct answer. "Which helps more?" is a silly question except for deciding on how much resources should be split in improving both, along with how much easier is one vs the other.

    • by aug24 (38229)
      Or a better algorithm.
  • Five stars (Score:5, Insightful)

    by CopaceticOpus (965603) on Tuesday April 01, 2008 @03:24PM (#22933488)
    If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

    How to translate the entire experience of watching a movie into a lone number is a separate issue.
    • by Chris Burke (6130)
      I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low.

      I don't think this is a problem of it being a 1-5 scale instead of 1-10. It's not like there's really that much information given by scoring a movie a 7 instead of 8, since it's all subjective anyway on any given day those scores could have been reversed.

      I think it's more the extremely common situation where people don't want to give an "average" score, so you get score inflation such that only th
      • by pavon (30274)
        My interpretation is not that I don't like rating things average, but that selection bias means that I only watch things that I expect to like, and more often than not that turns out to be the case. Every now and then I'll end up disliking a movie that I had high hopes for, or watch a movie I know I won't like with someone else, but for the most part I enjoy the (few) movies I see. And since you only rate the films you've watched, the majority of ratings by the majority of people will be positive.

        But that'
    • A scale of five or ten should not make too much of a difference. The difficult part (according to a Wired article) is figuring out the anchoring effect. If you've seen a lot of good movies lately, something mediocre will rate 2 stars, but if you've seen a lot of bad movies lately (ditch that significant other!) then a mediocre movie will more likely receive a three-star rating from you.
  • I would suggest that one both go for better algorythms AND more/better data.
  • ...the algorithm wasn't 'better' enough.
  • It really depends on a number of factors. I don't think anyone can make a general claim for one over the other. A smart algorithm can beat data augmentation in some cases. Of course, creating the algorithm is the crux of the matter, one that is harder to put a definition on.

    So, the upshot is to look at both approaches and take the best course of action for your needs.

  • I mean, if we balloon up to 10,000 feet, the problem really is, where do you put the extra data? Do you encode it in an algorithm, or do you have less code but more dynamic data. Given that POV, then, it stands to reason the best place to put the extra data is outside of the code, so that it is easier and less costly to modify.
  • by jd (1658) <.imipak. .at.> on Tuesday April 01, 2008 @03:41PM (#22933690) Homepage Journal
    ...that algorithms and data are, in fact, different animals. Algorithms are simply mapping functions, which can in turn be entirely represented as data. A true algorithm represents a set of statements which, when taken as a collective whole, will always be true. In other words, it's something that is generic, across-the-board. Think object-oriented design - you do not write one class for every variable. Pure data will contain a mix of the generic and the specific, with no trivial way to always identify which is which, or to what degree.

    Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.

    The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.

    For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.

  • A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.

    So when I think of this recommendation system, a better algorithm is like having swap space enabled. It
  • by mlwmohawk (801821) on Tuesday April 01, 2008 @03:52PM (#22933812)
    I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)

    The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.

    First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)

    The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.

    Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.

    My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.

    You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.

    That being said, the data is fun to work with!!

  • The team with more data performed better, probably because their data allowed them to clearly differentiate between movies using a far significant dimension than the given ratings per movie dimension.
    The fundamental idea is to be able to identify clusters of movies, or users (who like a certain type of movie), and the idea of clusters is built on some form of distance. When you add a new dimension to your feature vector, you get a chance to identify groups of entities better, using that dimension. You may d
  • by fygment (444210) on Tuesday April 01, 2008 @04:03PM (#22933964)
    Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.

    The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.
  • Would you rather know more or be smarter?

    Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.

    You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.

    I wonder if accessing foreign web sites is legal in this competition though, be
  • When I was in college "augmented data" was a tactful way of saying "faked results"
  • For every problem, there is an optimal solution (okay... there are many optimal solutions, depending on what you are trying to optimize for). If you want to do better than that algorithm, you must break the model. That means that you must either modify the inputs or modify the assumptions of the model. For example, the fastest way to sort arbitrary data that can only be compared using takes O(n*log(n)) time. To do any better, you must break the model by making assumptions about the range and precision
  • yes this data is useful, but you can't use it in the contest: []

    see also: [] [] []

    note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options.
  • I've seen a great many cases where developing better algorithms caused better performance (and better algorithms rather than better data, in fact, account for the vast majority of Computer Science research papers out there), so certainly it can't only be better data. Additionally, what about the times when you need a better algorithm to take advantage of the additional data. Or, what about when you combine the better algorithm with the better data.

    This article is a completely false dichotomy.
  • by aibob (1035288) on Tuesday April 01, 2008 @05:09PM (#22934762)
    I am a graduate student in computer science, emphasizing the use of machine learning.

    The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.

    The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.

    But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.

    Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.

    In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.
  • It's pretty simple: If you have random noise your algorithm can be as good as you want - you still get no useful information out of it. On the other hand, if the "more data" actually contains additional information, your entropy goes up and with a given algorithm you get better results. Bent to the extreme you just get the desired output as additional information and you can reduce your algorithm to just print it (should be O(1)).
  • Most of the data shows that Newtonian Physics really explains much of the physical universe really well. So if we leave out Einstein's experiments we can usually get along just peachy. But include Einstein's rules in your algorithm's and calculations and they will ALWAYS be superior to simple Newtonian physics in those areas where "more data" proves the calculations, and the calculations themselves yield more data.

    So using an old saw --which comes first, the chicken or the egg? Or is there a superior que

  • And I didn't even get modded up back when I said that. []

Entropy requires no maintenance. -- Markoff Chaney