Can Machine Learning Replace Focus Groups? 93
itwbennett writes "In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing. Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons. Why don't people use this method? Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win."
OK, so... (Score:5, Insightful)
I have read the synopsis 4 (four) times and I didn't get shit.
Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.
Re: (Score:2)
I have read the synopsis 4 (four) times and I didn't get shit.
Read this AC submitted summary [slashdot.org] It may (or may not) enlighten you.
Re:OK, so... (Score:5, Funny)
I have read the synopsis 4 (four) times and I didn't get shit.
Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.
If you had just clicked the green button the machine would have understood it for you.
Re: (Score:1)
The article doesn't make any sense, either. Who, other than scammers, cares about trivial shit like one button being pressed by a random person that wandered to some web page? People write software that users use to accomplish some work. You can't recruit random people to perform random actions on a randomly changing user interface, and then collect statistics on what they accomplished.
To think of it, if someone did that, the "best" interface would look just like GNOME3... Oh shit...
Re: (Score:2)
Anybody who wants to their users to take a certain action?
Think of websites (as stated in TFS) or focus group testing (also stated in TFS).
A lot of user interface testing is basically looking at how a user interacts with a UI. Things like automated testing could show you that people more easily recognize the functionality of the [OK] button over a functionally identical [Well, might as well try and go ahead with doing what I wanted to do] button.
As for websites; even on my open source project websites I pre
Re: (Score:2)
Anybody who wants to their users to take a certain action?
Think of websites (as stated in TFS) or focus group testing (also stated in TFS).
My response to that is identical to the comment you are replying to.
Re:OK, so... (Score:5, Insightful)
In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing.
It starts off talking about a nobody who did something that is apparently so trivial that it can be outdone by 20 lines of code. You might think that the following sentence will answer at least one of the questions raised by this sentence: Who is Steve Hanov? What is A/B testing? What do Steve's 20 lines of code do? But you'd be wrong.
Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons.
Because the next sentence jumps to a topic whose banality and seeming irrelevance to the matter at hand defies belief. Three coloured buttons, one of which 'outperformed' the others, with nary a hint as to what these buttons do, or how one can outperform the others.
Why don't people use this method?
The third sentence appears to pick up where the first left off. Why don't people use the A/B testing method? Or are we talking about the three coloured buttons method?
Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win.
The final sentence is a tour-de-force of disjointed confusion. It skips from machine learning algorithms that haven't been discussed, to tools with unknown purpose, to the design of something which was never specified.
It's like the summary is some kind of abstract art installation whose purpose is to be as uninformative as possible. It is literally the opposite of informative: Not only does it provide no information, it raises questions which you can't even be sure relate to the purported topic at hand, because you don't know what the topic at hand is.
It is either a bizarrely confused summary or one of the most artful trolls ever to grace Slashdot's front page
Re: (Score:2)
Sadly, it learned to generate summaries by reading Slashdot :-(
Re: (Score:1)
Re: (Score:2)
In the time you took to complain you could have RTFA.
Reread my post. I clicked through and read the article before posting my comment.
I understood it the first time I read it yesterday.
No you didn't. As the summary contains no actual information, you filled it in with your own prejudices and preconceptions, no doubt because you are not in the habit of reading things carefully. cf My first point.
Today, listening to complaints, I read it again and still understand it.
What is this supposed to prove? Of course you "still" understand it after having read the full article, unless you think people habitually lose all knowledge of their previous experiences after sleeping for eight hours
Re: (Score:1)
Re: (Score:1)
RTFA
I understood it the first time I read it yesterday.
Today, listening to your complaint, I read it again and still understand it.
Maybe you're not bright enough to either a) read it carefully or
b)understand it. No problema - there are plenty of jobs as janitors and car salesmen.
Re: (Score:2)
Smart. Very smart. You should be proud of yourself, being part of an elite that has the inherent right of stomping less-gifted people. Gratz!
Re: (Score:2)
The only possibly benefit I can see from this is *maybe* adjusting a site's color-scheme or layout to be more intuitive and easy to navigate.
Well, for those of us who do use testing and usability reporting on a daily basis, or have jobs that *require* us to know what is easiest for people to navigate (read: any and all web designers), this is pretty nice, and I intend to use the concept heavily.
Re: (Score:2)
Re: (Score:2)
Don't forget the sales pitch. It could help you chose between different text. Real world trials are far better than gut feel on that.
Re: (Score:2)
if I decide to add something to my cart, I'm confident I'll find the button even if it's 1.2% less optimized.
That's very well and good for you, but marketing and layout-optimization people are more interested in the question of whether one site or the other makes you more likely to decide to add something to your cart, and not whether you'll succeed once you've decided to do so.
Re: (Score:1)
For most people, myself included, I'd imagine the deciding factor is not website layout, but something much more obvious.
Money, dear boy. (spoken with an English accent, ofc)
Plus a variety of other factors like shipping speeds, general reputation of the sites, ease of RMA, etc... Whether the "buy" button is Green, Orange or White is quite simply the last on my list of priorities, and pulling metrics on it without examining the other factors will net faulty results.
Re: (Score:2)
Ah, I see. You're one of those few people whose every decision is the logical, incontrovertible result of sober factual considerations.
"Psychology" is merely the study of what forces mold the choices of everyone's mind but yours.
Re: (Score:2)
Clever, retort sir, however might I interest you in a long forgotten theory of economics that something bought or sold might possibly have greater value, than that of the mechanism by which it is sold.
Which is why advertising and marketing are such underfunded spheres of public endeavour....
Re: (Score:2)
Sorry marketing and sales department, you're fired.
You can thank jxander for proving your jobs were never useful in the first place.
But don't feel bad; since competitor A offers the same service for 0.01% less, we'll soon be bankrupt anyway.
Re: (Score:2)
I lol'd.
Re: (Score:2)
Oh FFS -- the use of button colours was what is known in technical jargon as an "example". The point of the article applies to all variables. And while you make think "layout" is less important than "shipping speeds", how do you find out shipping speeds? You have to look for them. If you can't find them, you walk. If you can't find them, chances are it's because of something we call in technical jargon "site design", which includes details such as "layout".
It's easy when you're designing something (I'm
Translation (Score:5, Informative)
So that you don't have to click through the slashvertisement, I have read TFA for you.
Here is a summary: Let's say you have several different designs for a web interface that you want to test to find out which one works the best.
One method is to have a "testing period" in which you randomly show each person one of the designs at random and identify how well it works for that person. Then, once you've shown 1,000 people each of the designs, you figure out which one is the best on average. Now the "testing period" is over, and the best design is shown to everyone from that point forward. That is the "old" method.
The "new" method is to dispense with the testing period. Instead, you show the first person one design at random. If it works (e.g. they click on the ad), it gets bonus points. If it doesn't work, it gets a penalty. At any time, you show the design with the most points; if it is bad, it will lose points over time and eventually stop being shown.
The goal of the "new" method is to hopefully avoid showing bad designs to 2000 people just to figure out which one is the best.
If you care about the details then you should probably read the article. This summary is just an approximation for those who can't be bothered or who object to slashvertisements on principle.
Re: (Score:2)
Re: (Score:2, Interesting)
The "new" method has the problem of immediately favoring the first design to get a positive response.
My own experience with focus groups is that they were more interested in _WHY_ you chose something the way you did, rather than in just what you chose. I'm not entirely sure how this algorithm will determine that.
Re:Translation (Score:4, Informative)
The "new" method has the problem of immediately favoring the first design to get a positive response.
No it doesn't. The designs are ranked according to what percentage of responses have been positive so far, not by the total number of positive responses. The first design to get a positive response will get shown more, and thus it will get more positive responses, and more negative responses.
Re: (Score:2)
Re: (Score:3)
More people will inevitably vote it down (unless it is indeed the best option), because it's getting more exposure.
Unless you're saying that display frequency will actually affect click-through rate. Are you suggesting that, for instance, a design which only gets shown 300 times and gets 100 positive responses, if it were shown 3000 times instead it should be expected to get more than 1000 positive responses? This seems unlikely if successive tests are causally independent (and given that successive tests a
Re:Translation (Score:5, Informative)
No.... I'm suggesting that the algorithm presented above, which only ever displays the single highest scoring design, is biased against designs that haven't yet had a chance to be viewed by anybody, and thus have not had an opportunity to get a positive response, when people are already showing some favor towards others.
What you're missing is the implied assumption that all of the options will fail most of the time, and that all options are initialized with maximum scores. The goal is to find the design that best motivates the user to take some action (e.g. click a link), and the assumption is that most of the time the user will not take that action. By starting all of the choices at a high value, they will all gradually converge downward to their true effectiveness rate, at which point the most effective will be chosen nearly all of the time. During the convergence process, the "leader" may change, but if the current leader isn't the true best, as it gets driven towards it's true rate, it will eventually dip under one of the others.
If, by chance, a more effective option has a really bad run early on and gets pushed below the true effectiveness rate of another option, it would never recover -- which is why the author includes an occasional randomly-selected choice. If there is a large difference between the effectiveness of the options this is really unlikely to happen, but in the rare event it happens the randomization will eventually fix it. The author also covers a method of handling the case where the audience preferences drift over time, by including the ability to "forget" old input via simple exponential decay.
The only really bad thing about this approach is that it assumes you don't have a lot of repeat visitors. If you do, they'll be annoyed by seeing different versions, apparently at random (from their perspective).
Re: (Score:2)
The only really bad thing about this approach is that it assumes you don't have a lot of repeat visitors. If you do, they'll be annoyed by seeing different versions, apparently at random (from their perspective).
What he doesn't discuss is what "one" instance of the site is. If you've got tracking cookies switched on, then you can assign one version of the site to the user at first visit and have it persist across browsing sessions.
An oversight on the author's part, but not a huge leap of logic.
Re: (Score:2)
designs that haven't yet had a chance to be viewed by anybody,
There are no such designs in this model, owing to the fact that 10% of all visitors are shown a design at random, unweighted by previous measurements.
Seriously, the algorithm presented in TFA anticipates and addresses your objection perfectly. You'd do well to check it out; AC's summary up there was good but incomplete.
Re: (Score:2)
Take a piece of paper and try to run down some scenarios. Try to find a scenario that disproves your own theory, then figure out why.
I'm sure there are edge cases where this "new" method fails, but there are also edge cases where classical focus group testing fails.
Since my job involves some A/B testing, I did the above and found some edge cases. But they're far less likely and with some job-specific optimizing (we have relatively long feedback delays) these edge cases can be mitigated.
Most interesting issu
Re: (Score:2)
Only if you're stupid enough to only show the design with the highest score. Something as simple as choosing randomly among the top .75n results (where n=number of designs under test) fixes that.
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
Imagine you have 3 buttons...
First user sees button 1, clicks it. ...repeat...
Next user sees button 1 (because repeat), doesn't click it.
Next user sees button 2, doesn't click it.
Next user sees button 3, doesn't click it.
Next user sees button 1, clicks it.
Next user sees button 1 (because repeat), doesn't click it.
Next user sees button 2, doesn't click it.
Next user sees button 3, doesn't click it.
Even though button 1 has a 50% success rate and the other buttons 0% (and is thus infinitely better), it's only s
Re: (Score:2)
1(1:1) 2(1:1) 3(1:1)
First user sees 1, clicks it:
1(2:2) 2(1:1) 3(1:1)
At this point, the algorithm could still pick any of the three.
Say it picks 1 again, and this is not clicked:
1(2:3) 2(1:1) 3(1:1)
So say it picks 2 for the next user, but the user doesn't click it:
1(2:3) 2(1:2) 3(1:1)
Well this time it has to pick 3 (unless the 10% random kicks in). Lets assume that's unsuccessful.
1(2:
Can Machine Learning Replace Focus Groups? (Score:1)
Re: (Score:2)
This is not exclusively machine learning (Score:5, Insightful)
This is not "machine learning" subsituting for human A/B testing. It's just changing the ratio of the number of visitors exposed to the "new" feature to be tested from 50% to 10%, while keeping the rest (90%) of the visitors using the "best so far" feature. There's also a bit of randomness thrown in when choosing which new feature the 10% of visitors get to test.
In this scheme, the human visitors are still doing the A/B testing, it's just that determination of which human is testing which feature dynamically adapts over time.
Now, if this guy had subsituted human A/B testing completely with a machine learning technology that could somehow determine which feature is better without any input from humans, then I'd be impressed. That's kind of what the summary and article imply. But that's not what he's done. He's just being a bit more sophisticated regarding which humans get to test which feature.
He's also made a big fat claim regarding the effectiveness of his method with zero evidence to back it up. Theoretical results regarding one-armed bandit problems are quite a far cry for real-world results regarding website feature selection. I'm looking forward to seeing some results of the proposed method on the latter.
Re: (Score:2)
So you want to do A/B testing on whether this algorithm is better than A/B testing?
It'd probably be better to use the epsilon-greedy method when deciding whether the A/B testing or epsilon-greedy algorithm is better.
Or maybe not. Well have to test that too.
It's testing all the way down.
Re: (Score:3)
Indeed, this has no relation to machine learning, whatsoever. The summary is once again ... deceptive.
And I'm sure the proof, that the best one gets chosen, doesn't exist. I'm also sure that this [i]way of choosing[/i] an interface has a high probability of choosing the preferred one, but there is also a big difference with A/B testing: you'll never know how big the difference between the two is. In straight-forward testing with two groups (which is not really A/B, by the way: that is alternating between A
Re: (Score:2)
Indeed, this has no relation to machine learning, whatsoever.
Is there an algorithm? Does the machine use the algorithm to obtain the optimum result? Just because the machine uses humans as its test subjects doesn't stop it being machine learning.
Re: (Score:2)
So ... sorting is machine learning? MS Word is machine learning? Don't think so.
Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.
Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as
Re: (Score:2)
So ... sorting is machine learning? MS Word is machine learning? Don't think so.
Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.
Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as a biased dice. The outcome of the process might be called "learned", but the knowledge is in the head of the one that runs the experiment and overlooks the outcome, not in the machine. And the "learning" doesn't generalize, so it doesn't help in improving performance on any other task than selecting between these two designs.
That's why it's not machine learning.
A hell of a lot of machine learning is based around giving the computer equation and let it work out the particular coefficients that give the best possible answer. There are very few machine learning tasks that don't have some sort of experimenter assumptions built in, and no machine learning algorithm is ever 100% generalisable (otherwise machine learning would be a pretty small field, as there would only be one machine learning algorithm!)
The reason that this is classed as a machine learning problem and
Re: (Score:2)
Both Hanov and you are mixing up a couple of things. A/B testing is done with focus groups, not live visitors. When you test with focus groups, you don't run a live web server, and you're willing to pay for completion of some test design.
Algorithms for use with the multiarmed bandit are already widely used in live testing. Those algorithms properly belong to the field of machine learning (reinforcement learning), but it turns out that very simple algorithms or strategies are hard to beat. You're right t
Re: (Score:2)
You can A/B test with live visitors. Works well too.
I think his approach has merit, but it's really just an automatically applied implementation of the outcome of the test - at some point you'd want to switch off A or B completely anyway.
Of course, far more interesting would be understanding why people chose A or B and offering the appropriate one based on what you know of the person involved. That's more sophisticated, but already done by people like Amazon: My amazon.co.uk web page will be very different
Re: (Score:2)
It's still not a multi-armed bandit situation. The multi-armed bandit situation specifically means that you present either A or B, not an A/B choice. There are other machine learning techniques for optimizing A/B tests, just not the ones in the article.
This Is News? (Score:3)
Re: (Score:2)
Maybe, given that most sites aren't doing it means it comes under "stuff that matters".
Re: (Score:2)
The article's premise is entirely wrong (Score:5, Insightful)
A/B focus testing is about observing how customers or users choose between two alternatives based on their qualitative sense of aesthetics. ML is about classifying data based on quantifying the data into defined classes or toward optimal values.
Predicting the outcome of a focus group is a completely different problem than multi arm slot machines. In focus groups there is no objective metric, so focus group problems are not amenable to machine learning unless your machine can define, measure, and perhaps predict aesthetic criteria.
Now THAT I'd like to see.
Re: (Score:3)
i don't know what the fuck a "double-blind" focus group is, since the user is clearly not blind to the design (this is the entire point).
and the reason why this is "like" a focus group, is that it is a focus group. all the information is coming from humans; it's just being used in a not-completely-idiotic way.
it's such an obvious idea it's surprising that no one has done this yet. oh, wait: http://m6d.com/about/about-us/ [m6d.com]
"Because the approach is rooted in machine learning, it continuously updates advertising
Re: (Score:2)
No, it's not a focus group. A focus group is a bunch of people talking about what they like/don't like. However, humans are very poor at judging what they like. Most living room (en_US "lounge") chairs are uncomfortable. People buy them because when they sit down on them in the showroom, they appear comfortable. Because they encourage poor posture, they take the strain off the sitting muscles. This gives the illusion of relaxation, and tricks people into believing the uncomfortable is comfortable.
A re
Re: (Score:2)
yeah i know what a real focus group is, but it's a reasonable metonymic usage imho. welcome to today's internet, where you're never more than a statistic, unless someone actually notices you, in which case god help you.
medium rare: well, it's also what i'd personally recommend to someone... it's a good starting point. imho anything more than medium is a waste of decent steak, so medium-rare is in the middle of acceptable. personally, i go for rare at most if i'm at a good place (which is none-too-often, sad
OT: steaks (Score:2)
Re: (Score:2)
that's a shame, but in line with the stereotypes of english food i suppose.
by the way, i've only read about and seen pictures of beef wellington, but it seems to me to be the culinary equivalent of an orgy, and would be, in and of itself, a total redemption of british cuisine. am i wrong here?
Re: (Score:2)
Neither the article nor the summary says anything about A/B focus testing. Or mention focus groups at all. It refers to A/B testing, where 2 different websites are offered to customers, and the better one found according to how objectively successful it has been. (by sales, clicks or whatever numerical measure.)
Re: (Score:2)
You're right. My criticism was misdirected. The article is fine; it's not about ML or focus groups but minimizing trial size.
It was the Slashdot summary that somehow saw it as 'ML Replaces Focus Groups'. Thee-a-culpa.
Re: (Score:2)
Somebody in the chain, probably the submitter, thinks "user trials" and "focus groups" are synonyms.
you got it wrong too (Score:2)
He isn't trying to use ML to predict the outcome of a focus group.
ML is about many things. One thing it is about is how a learner should explore an environment in order to maximize what he learns. It is one of those techniques that Hanov refers to, and it's a good idea in principle. But he picked the
Bayesian modelling and experiment design (Score:3)
It's a 'good-enough' approximation to an optimal selection process.
The probability of someone clicking on option A, B or C is unknown, but is expected to be constant when averaged over the population. Given the ratio of clicks versus views on any given option, the posterior distribution of that probability can be modelled as a Beta distribution. The experimental question is then: given the current estimates, which option should be presented to maximise the utility of the test?
For simply ranking the options, the utility may be the Shannon information [wikipedia.org]. In this case though, the utility also has to incorporate the expected benefit of a click-through. One could set up a utility function which is weighted between the two outcomes, possibly varying over time.
In practice though, Beta distributions with different means tend to converge to separate peaks quite quickly, so taking a possible 10% hit on the current best estimate click-through outcome seems an entirely plausible approximation. Bayesian experimental design though could also tell you when to stop testing and stick with the winner.
Re: (Score:2)
If you used this type of algorithm to rotate a selection of different-but-good style sheets on a website, you'd be able to go past "which one is best at the time the test was devised" and actually build sites that pre-emptively and reactively stay "fashionable", "trendy" and "cool".
Re: (Score:2)
Re: (Score:2)
You're not wrong... but, there are scenarios where, for example, a designer comes up with 4 proposed designs, all of which are good, and someone need to make a decision as to which one to go with without any meaningful way to differentiate. This algorithm allows all 4 to be approved as "functional and not embarrassing" and put into place.
And yes, 2 years later, you might decide it's a good idea to hire a designer to freshen things up, and have them deliver you a few more designs. But, with a pattern like
Re: (Score:2)
Re: (Score:1)
I am honestly curious: why should Bayesian experimental design not be used for serious work?
Re: (Score:2)
Er, how about statistical significance? (Score:2)
To be valid, the last step (of which the author makes no mention) should be to compare the three groups to see if their differences are statistically significant. With tens of thousands of clicks, it's likely that they are, but the percentages were awfully close in the 2-3% range.
Even better (Score:2)
I do it even better with my Accelerated Market Research, which is based on Bayesian reasoning.
http://oyhus.no/AcceleratedMarketResearch.html [oyhus.no]
wrong algorithm (Score:2)
The multiarmed bandit problem is a problem in which you simultaneously try to optimize your overall reward and still explore. As a consumer, I face that problem (switch brands or stick with the tried-and-true). However, for focus groups, maximizing rewards for participants doesn't matter; it's all about finding the best solution for the organizer of the focus group. The participants already get the products for free. That means that it is not a multiarmed bandit problem, and algorithms for solving such
Then /b finds your site... (Score:2)
and suddenly the button with the racial epithet on it becomes the most popular one and you lose all your real customers.