Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Programming

Can Machine Learning Replace Focus Groups? 93

itwbennett writes "In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing. Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons. Why don't people use this method? Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win."
This discussion has been archived. No new comments can be posted.

Can Machine Learning Replace Focus Groups?

Comments Filter:
  • OK, so... (Score:5, Insightful)

    by war4peace ( 1628283 ) on Thursday May 31, 2012 @05:44PM (#40173821)

    I have read the synopsis 4 (four) times and I didn't get shit.
    Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

    • I have read the synopsis 4 (four) times and I didn't get shit.

      Read this AC submitted summary [slashdot.org] It may (or may not) enlighten you.

    • by WrongSizeGlass ( 838941 ) on Thursday May 31, 2012 @07:52PM (#40175091)

      I have read the synopsis 4 (four) times and I didn't get shit.
      Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

      If you had just clicked the green button the machine would have understood it for you.

    • The article doesn't make any sense, either. Who, other than scammers, cares about trivial shit like one button being pressed by a random person that wandered to some web page? People write software that users use to accomplish some work. You can't recruit random people to perform random actions on a randomly changing user interface, and then collect statistics on what they accomplished.

      To think of it, if someone did that, the "best" interface would look just like GNOME3... Oh shit...

      • by mwvdlee ( 775178 )

        Anybody who wants to their users to take a certain action?

        Think of websites (as stated in TFS) or focus group testing (also stated in TFS).

        A lot of user interface testing is basically looking at how a user interacts with a UI. Things like automated testing could show you that people more easily recognize the functionality of the [OK] button over a functionally identical [Well, might as well try and go ahead with doing what I wanted to do] button.

        As for websites; even on my open source project websites I pre

        • Anybody who wants to their users to take a certain action?

          Think of websites (as stated in TFS) or focus group testing (also stated in TFS).

          My response to that is identical to the comment you are replying to.

    • Re:OK, so... (Score:5, Insightful)

      by Tarsir ( 1175373 ) on Thursday May 31, 2012 @09:42PM (#40175859)
      You know, I read the summary without understanding it, and just clicked through to read the article, but only after reading your comment did I realize just how little sense the summary really made.

      In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing.

      It starts off talking about a nobody who did something that is apparently so trivial that it can be outdone by 20 lines of code. You might think that the following sentence will answer at least one of the questions raised by this sentence: Who is Steve Hanov? What is A/B testing? What do Steve's 20 lines of code do? But you'd be wrong.

      Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons.

      Because the next sentence jumps to a topic whose banality and seeming irrelevance to the matter at hand defies belief. Three coloured buttons, one of which 'outperformed' the others, with nary a hint as to what these buttons do, or how one can outperform the others.

      Why don't people use this method?

      The third sentence appears to pick up where the first left off. Why don't people use the A/B testing method? Or are we talking about the three coloured buttons method?

      Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win.

      The final sentence is a tour-de-force of disjointed confusion. It skips from machine learning algorithms that haven't been discussed, to tools with unknown purpose, to the design of something which was never specified.

      It's like the summary is some kind of abstract art installation whose purpose is to be as uninformative as possible. It is literally the opposite of informative: Not only does it provide no information, it raises questions which you can't even be sure relate to the purported topic at hand, because you don't know what the topic at hand is.

      It is either a bizarrely confused summary or one of the most artful trolls ever to grace Slashdot's front page

      • In the time you took to complain you could have RTFA. I understood it the first time I read it yesterday. Today, listening to complaints, I read it again and still understand it. Maybe you're not bright enough to either a) read it carefully or b)understand it. No problema - there are plenty of jobs as janitors and car salesmen.
        • by Tarsir ( 1175373 )

          In the time you took to complain you could have RTFA.

          Reread my post. I clicked through and read the article before posting my comment.

          I understood it the first time I read it yesterday.

          No you didn't. As the summary contains no actual information, you filled it in with your own prejudices and preconceptions, no doubt because you are not in the habit of reading things carefully. cf My first point.

          Today, listening to complaints, I read it again and still understand it.

          What is this supposed to prove? Of course you "still" understand it after having read the full article, unless you think people habitually lose all knowledge of their previous experiences after sleeping for eight hours

      • That is probably the best post I have ever read here. Extremely insightful and hilariously written. I was in tears laughing through most of it.
    • RTFA

      I understood it the first time I read it yesterday.

      Today, listening to your complaint, I read it again and still understand it.

      Maybe you're not bright enough to either a) read it carefully or
      b)understand it. No problema - there are plenty of jobs as janitors and car salesmen.

      • Smart. Very smart. You should be proud of yourself, being part of an elite that has the inherent right of stomping less-gifted people. Gratz!

  • Translation (Score:5, Informative)

    by Anonymous Coward on Thursday May 31, 2012 @05:56PM (#40173949)

    So that you don't have to click through the slashvertisement, I have read TFA for you.

    Here is a summary: Let's say you have several different designs for a web interface that you want to test to find out which one works the best.

    One method is to have a "testing period" in which you randomly show each person one of the designs at random and identify how well it works for that person. Then, once you've shown 1,000 people each of the designs, you figure out which one is the best on average. Now the "testing period" is over, and the best design is shown to everyone from that point forward. That is the "old" method.

    The "new" method is to dispense with the testing period. Instead, you show the first person one design at random. If it works (e.g. they click on the ad), it gets bonus points. If it doesn't work, it gets a penalty. At any time, you show the design with the most points; if it is bad, it will lose points over time and eventually stop being shown.

    The goal of the "new" method is to hopefully avoid showing bad designs to 2000 people just to figure out which one is the best.

    If you care about the details then you should probably read the article. This summary is just an approximation for those who can't be bothered or who object to slashvertisements on principle.

    • ...thank you for saving me the work of slogging through it on my own.
    • Re: (Score:2, Interesting)

      by mark-t ( 151149 )

      The "new" method has the problem of immediately favoring the first design to get a positive response.

      My own experience with focus groups is that they were more interested in _WHY_ you chose something the way you did, rather than in just what you chose. I'm not entirely sure how this algorithm will determine that.

      • Re:Translation (Score:4, Informative)

        by spazdor ( 902907 ) on Thursday May 31, 2012 @06:42PM (#40174409)

        The "new" method has the problem of immediately favoring the first design to get a positive response.

        No it doesn't. The designs are ranked according to what percentage of responses have been positive so far, not by the total number of positive responses. The first design to get a positive response will get shown more, and thus it will get more positive responses, and more negative responses.

      • The "new" method has the problem of immediately favoring the first design to get a positive response.

        Only if you're stupid enough to only show the design with the highest score. Something as simple as choosing randomly among the top .75n results (where n=number of designs under test) fixes that.

    • Is there any way they can apply this to summaries and stories on /.? I'd be willing to read that summary ... and maybe even that story.
  • Comment removed based on user account deletion
    • by mwvdlee ( 775178 )

      Imagine you have 3 buttons...

      First user sees button 1, clicks it.
      Next user sees button 1 (because repeat), doesn't click it.
      Next user sees button 2, doesn't click it.
      Next user sees button 3, doesn't click it.
      Next user sees button 1, clicks it.
      Next user sees button 1 (because repeat), doesn't click it.
      Next user sees button 2, doesn't click it.
      Next user sees button 3, doesn't click it. ...repeat...

      Even though button 1 has a 50% success rate and the other buttons 0% (and is thus infinitely better), it's only s

    • Of course not. The whole point of a focus group is for the facilitator to lead the group to the conclusion he or she wants. Management can't maipulate machine learning algorithms -- only developers can.
  • by Anonymous Coward on Thursday May 31, 2012 @06:04PM (#40174021)

    This is not "machine learning" subsituting for human A/B testing. It's just changing the ratio of the number of visitors exposed to the "new" feature to be tested from 50% to 10%, while keeping the rest (90%) of the visitors using the "best so far" feature. There's also a bit of randomness thrown in when choosing which new feature the 10% of visitors get to test.

    In this scheme, the human visitors are still doing the A/B testing, it's just that determination of which human is testing which feature dynamically adapts over time.

    Now, if this guy had subsituted human A/B testing completely with a machine learning technology that could somehow determine which feature is better without any input from humans, then I'd be impressed. That's kind of what the summary and article imply. But that's not what he's done. He's just being a bit more sophisticated regarding which humans get to test which feature.

    He's also made a big fat claim regarding the effectiveness of his method with zero evidence to back it up. Theoretical results regarding one-armed bandit problems are quite a far cry for real-world results regarding website feature selection. I'm looking forward to seeing some results of the proposed method on the latter.

    • So you want to do A/B testing on whether this algorithm is better than A/B testing?

      It'd probably be better to use the epsilon-greedy method when deciding whether the A/B testing or epsilon-greedy algorithm is better.

      Or maybe not. Well have to test that too.

      It's testing all the way down.

    • by tgv ( 254536 )

      Indeed, this has no relation to machine learning, whatsoever. The summary is once again ... deceptive.

      And I'm sure the proof, that the best one gets chosen, doesn't exist. I'm also sure that this [i]way of choosing[/i] an interface has a high probability of choosing the preferred one, but there is also a big difference with A/B testing: you'll never know how big the difference between the two is. In straight-forward testing with two groups (which is not really A/B, by the way: that is alternating between A

      • Indeed, this has no relation to machine learning, whatsoever.

        Is there an algorithm? Does the machine use the algorithm to obtain the optimum result? Just because the machine uses humans as its test subjects doesn't stop it being machine learning.

        • by tgv ( 254536 )

          So ... sorting is machine learning? MS Word is machine learning? Don't think so.

          Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.

          Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as

          • So ... sorting is machine learning? MS Word is machine learning? Don't think so.

            Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.

            Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as a biased dice. The outcome of the process might be called "learned", but the knowledge is in the head of the one that runs the experiment and overlooks the outcome, not in the machine. And the "learning" doesn't generalize, so it doesn't help in improving performance on any other task than selecting between these two designs.

            That's why it's not machine learning.

            A hell of a lot of machine learning is based around giving the computer equation and let it work out the particular coefficients that give the best possible answer. There are very few machine learning tasks that don't have some sort of experimenter assumptions built in, and no machine learning algorithm is ever 100% generalisable (otherwise machine learning would be a pretty small field, as there would only be one machine learning algorithm!)

            The reason that this is classed as a machine learning problem and

    • by khipu ( 2511498 )

      Both Hanov and you are mixing up a couple of things. A/B testing is done with focus groups, not live visitors. When you test with focus groups, you don't run a live web server, and you're willing to pay for completion of some test design.

      Algorithms for use with the multiarmed bandit are already widely used in live testing. Those algorithms properly belong to the field of machine learning (reinforcement learning), but it turns out that very simple algorithms or strategies are hard to beat. You're right t

      • by Cederic ( 9623 )

        You can A/B test with live visitors. Works well too.

        I think his approach has merit, but it's really just an automatically applied implementation of the outcome of the test - at some point you'd want to switch off A or B completely anyway.

        Of course, far more interesting would be understanding why people chose A or B and offering the appropriate one based on what you know of the person involved. That's more sophisticated, but already done by people like Amazon: My amazon.co.uk web page will be very different

        • by khipu ( 2511498 )

          You can A/B test with live visitors. Works well too.

          It's still not a multi-armed bandit situation. The multi-armed bandit situation specifically means that you present either A or B, not an A/B choice. There are other machine learning techniques for optimizing A/B tests, just not the ones in the article.

  • by hondo77 ( 324058 ) on Thursday May 31, 2012 @06:09PM (#40174077) Homepage
    Throwing up banner ads with different color schemes and automatically re-weighting them based on click-through % is something I was doing well over ten years ago. This can't really be news, can it?
  • by RandCraw ( 1047302 ) on Thursday May 31, 2012 @06:15PM (#40174119)

    A/B focus testing is about observing how customers or users choose between two alternatives based on their qualitative sense of aesthetics. ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

    Predicting the outcome of a focus group is a completely different problem than multi arm slot machines. In focus groups there is no objective metric, so focus group problems are not amenable to machine learning unless your machine can define, measure, and perhaps predict aesthetic criteria.

    Now THAT I'd like to see.

    • Neither the article nor the summary says anything about A/B focus testing. Or mention focus groups at all. It refers to A/B testing, where 2 different websites are offered to customers, and the better one found according to how objectively successful it has been. (by sales, clicks or whatever numerical measure.)

      • You're right. My criticism was misdirected. The article is fine; it's not about ML or focus groups but minimizing trial size.

        It was the Slashdot summary that somehow saw it as 'ML Replaces Focus Groups'. Thee-a-culpa.

      • Somebody in the chain, probably the submitter, thinks "user trials" and "focus groups" are synonyms.

    • Predicting the outcome of a focus group is a completely different problem than multi arm slot machines.

      He isn't trying to use ML to predict the outcome of a focus group.

      ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

      ML is about many things. One thing it is about is how a learner should explore an environment in order to maximize what he learns. It is one of those techniques that Hanov refers to, and it's a good idea in principle. But he picked the

  • by HalfFlat ( 121672 ) on Thursday May 31, 2012 @08:43PM (#40175533)

    It's a 'good-enough' approximation to an optimal selection process.

    The probability of someone clicking on option A, B or C is unknown, but is expected to be constant when averaged over the population. Given the ratio of clicks versus views on any given option, the posterior distribution of that probability can be modelled as a Beta distribution. The experimental question is then: given the current estimates, which option should be presented to maximise the utility of the test?

    For simply ranking the options, the utility may be the Shannon information [wikipedia.org]. In this case though, the utility also has to incorporate the expected benefit of a click-through. One could set up a utility function which is weighted between the two outcomes, possibly varying over time.

    In practice though, Beta distributions with different means tend to converge to separate peaks quite quickly, so taking a possible 10% hit on the current best estimate click-through outcome seems an entirely plausible approximation. Bayesian experimental design though could also tell you when to stop testing and stick with the winner.

    • If you used this type of algorithm to rotate a selection of different-but-good style sheets on a website, you'd be able to go past "which one is best at the time the test was devised" and actually build sites that pre-emptively and reactively stay "fashionable", "trendy" and "cool".

      • An algorithm like this isn't going to always pick a trendy and fashionable design. It's going to pick the least bad design you have. If you make 15 designs now, they will probably all be tired in 2 years. Sure the algorithm will say design 7 is the best 2 years from now, but it's probably not as good as whatever your designer would come up with at that time. Its probably better to plan on your designer making the 15 designs over the span of the 2 years .That way you know you are submitting designs made unde
        • You're not wrong... but, there are scenarios where, for example, a designer comes up with 4 proposed designs, all of which are good, and someone need to make a decision as to which one to go with without any meaningful way to differentiate. This algorithm allows all 4 to be approved as "functional and not embarrassing" and put into place.

          And yes, 2 years later, you might decide it's a good idea to hire a designer to freshen things up, and have them deliver you a few more designs. But, with a pattern like

    • by martas ( 1439879 )
      For simple non-critical things like web design what parent describes is all well and good, but please don't use any similar method for a problem with serious consequences, be it in medicine or science or anything like that. There are statistically sound ways of doing experimental design, e.g. for deciding when to stop an experiment, and they are not Bayesian (usually).
      • I am honestly curious: why should Bayesian experimental design not be used for serious work?

        • by martas ( 1439879 )
          Put simply, because it is the wrong tool. Frequentist methods for problems like hypothesis testing and confidence set estimation were designed based on some simple assumptions that probably never really hold in the real world, but probably aren't very far from the truth. Bayesian methods rely on assumptions (and definitions of what kind of error is to be avoided) that are not suitable for many problems in science and medicine. E.g. Bayesian confidence interval estimation will tell you that "on average" over
  • To be valid, the last step (of which the author makes no mention) should be to compare the three groups to see if their differences are statistically significant. With tens of thousands of clicks, it's likely that they are, but the percentages were awfully close in the 2-3% range.

  • I do it even better with my Accelerated Market Research, which is based on Bayesian reasoning.

    http://oyhus.no/AcceleratedMarketResearch.html [oyhus.no]

  • The multiarmed bandit problem is a problem in which you simultaneously try to optimize your overall reward and still explore. As a consumer, I face that problem (switch brands or stick with the tried-and-true). However, for focus groups, maximizing rewards for participants doesn't matter; it's all about finding the best solution for the organizer of the focus group. The participants already get the products for free. That means that it is not a multiarmed bandit problem, and algorithms for solving such

  • and suddenly the button with the racial epithet on it becomes the most popular one and you lose all your real customers.

1 Dog Pound = 16 oz. of Alpo

Working...