Why Programmers Need To Learn Statistics 572
David Gerard writes "Zed Shaw writes an impassioned plea to programmers: Programmers Need To Learn Statistics Or I Will Kill Them All. Quoting: 'I go insane when I hear programmers talking about statistics like they know s*** when it's clearly obvious they do not. I've been studying it for years and years and still don't think I know anything. ... I have taken a bunch of math classes, studied statistics in grad school, learned the R language, and read tons of books on the subject. Despite all of this I'm not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning. All of this leads to a curse since none of my colleagues have any clue about what they don't understand. I'll propose a measurement technique and they'll scoff at it. I try to show them how to properly graph a run chart and they're indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.'"
Statistics is HARD (Score:5, Informative)
(a) Probability theory, on which all practical Statistics is based it both (i) counterintuitive and (ii) difficult
(b) The very Mathematics on which it is based is obscure
And, worst of all, it is uniformly badly taught, even in good universities, and the Statistics for XXX are uniformly awful, blind leading the blind.
Lastly it is very hard to get a staight answer from a mathematical Statistician.
Zed Shaw is a tosser. (Score:2, Informative)
Nothing new to see here.
Stats? Fuck that. (Score:2, Informative)
Statstics is WAY beyond what a programmer cares about. Logic is all that matters. Statistics>logic is the problem of the software engineer, not the programmer.
Re:Reply from a programmer that knows no statistic (Score:2, Informative)
You probably still think I am a lunatic, but hear me out.
You don't qualify as a lunatic; just as someone who has no idea of what he's talking about. Absolutely no idea. Your post, my friend, is so full of ideas you obviously misunderstood that I won't even attempt to make a list.
And yes, I do statistics for a living.
Knowledge isn't the problem (Score:2, Informative)
Re:Title fail. (Score:4, Informative)
Don't you mean free()?
Everyone should learn statistics (Score:5, Informative)
Before computers stats involved using parametric tests (ttests, anova, etc) which made assumptions like "the data comes from an underlying normal distribution". BTW, in stats terms "normal" mean "Gaussian" [wikipedia.org].
Now, with cheap and fast computers, we can actually compute the confidence intervals nonparametrically through permutation tests and bootstrapping [wikipedia.org] without assuming anything about underlying distributions. In most cases, this nonparametric test is the "right thing to do". Most of the time, the results are the same as using a parametric test.
However, a HUGE disaster in empirical science has been the problem of multiple comparisons. With computers it is so easy to compute correlations and significance tests between every possible slice of your data set. Many "scientists" don't have good statistical knowledge and pray at the alter of "p < 0.05". They don't know about or understand the problem of multiple comparisons. [wikipedia.org] So they do 20 tests, find one that comes out p0.05 and write a paper about it. They don't get that if you do 20 tests you are very very very likely to find one that come out p < 0.05.
Anyone who has access to excel or matlab can do this little experiment.
samp=50 normally distributed random numbers.
for x=1:100
test=50 normally distributed random numbers (mean=0, var=1);
sig(x)=ttest(samp,test);
end
now look at the sig vector. OMG, 5% of the tests came out significant!!!
Now you are writing a paper all about how x is linked to y. But you are essentially throwing dice and then writing a paper about why it came up '33'.
Zed Shaw sounds like a douche. (Score:3, Informative)
Wikipedia on Zed Shaw (Score:3, Informative)
Re:Statistical analysis of the summary (Score:5, Informative)
I don't know if there is an invisible elephant in my kitchen, so I guess I should assign equal probability to both outcomes. I also don't really know how Baccarat works, I guess my odds are 50/50.
Without knowing something about he or his coworkers, you by definition cannot make any statistical statements. To make any statements, you would first need to make some observations. This is how statistics is different from logic. Statistics is grounded in data.
I don't agree with Zed, but you may have just proved his point.
Re:93% of Programmers Think You're Wrong (Score:3, Informative)
I'm sure it's not 50%, and not 25%
heads=1, tails = 0
00 01 10 11
so if one of them is 1, there's a 33.33% chance the other is 1 too.
i can work it out that way for 2 binary possiblities. couldn't generalize it x coins possiblities with y sides :/
Re:Very good (from someone who's taken BOTH)... ap (Score:5, Informative)
You know, that particular citation has made me wonder in the past, but not enough to actually research it. So, I went off looking for more information and found it [straightdope.com].
The statistic was generated from a July 1976 survey.
The sample group for this statistic was 1,200 dentists. These dentists were hand picked by the research company, probably with good reason.
They were asked, what advice would they give gumchewing patients
1) sugared gum
2) sugarless gum
3) no gum at all.
Sugarless gum got 85% of the vote. Not terribly surprising. I'd be fairly confident that their time had been paid for, or at very least they were told "This survey is being done for Trident Sugarless Gum." That is only speculation, so hush up.
17/20 doesn't really sound very good. It just doesn't stick in your head. 4/5 is close enough, even though it reduces your answer to 80% (ahhh, a lie). Since these are marketing folks, I'm sure they pushed all kinds of values past focus groups, until "4 in 5" was accepted as most favorable.
As the link cites, they're fairly confident that the "sugared gum" answer got at least one response. There's always someone that'll take the obvious wrong answer. If you don't believe that, look at any Slashdot poll. :)
What they don't say is how many of the 1,200 samples were dropped. I'm sure there were nonresponses, and they could have easily added any number of unfavorable answers in as nonresponses. Of course, they couldn't have 100% in their favor, so they had to keep some.
Re:Mathematicians just need to shutup. (Score:1, Informative)
I'd say that if someone has not completed calculus then any statistics in their reach is simply memorize and regurgitate.
Put things in the correct order. Finish calculus then study stats.
Horseehit. You can use a distribution without having to integrate it in a great many scenarios that would benefit many people. Also, for discrete statistics  which is probably of more immediate use to most people  you can replace that nasty integral with addition.
You don't have to know everything about a field to use parts of it. As parent said, I think many, many people would benefit from commonsense concepts combining statistics and logic, just so you can make good decisions about purchases and such. Read a book called "Innumeracy" to see the level of stupid I'm talking about  calculus is so far beyond that kind of dumb it's sad. I would settle for people being able to intuitively understand the implications of Bayes' rule. Understanding why prior probabilities are important would be a big start, and there's no calculus in that.
For reference, I have had calc and stats as part of my math minor. In my job, I use statistics daily. I use things I learned in calculus alone a lot more seldom.
The business majors understanding of statistics is the most dangerous.
Oh yeah. Are you a six sigma black belt? ;)
Re:93% of Programmers Think You're Wrong (Score:3, Informative)
I'm not assuming anything, just reading the question correctly: The question is NOT "if I flip two coins and THE FIRST ONE is heads..." (answer would then indeed be 50%), but "If I flip two coins and ONE OF THEM is heads..."
I'm listing all 4 combinations for 2 flips, and out of the 3 that satisfy the prerequisite ("one of them is heads") counting how many combinations turn up with the other one also being heads. There's one out of 3 possibilities, so that's 33%.
Re:yea (Score:1, Informative)
Unfortunately, your posts have been modded down, even though it is a valid discussion point.
Obviously, you want this discussion to handle PoV on political views and the accompanying philosophies. In my profession I'm a pragmatist: one should view vexing problems from different perspectives (categorical, logical, libertarian) and choose 'wisely' (meaning, that which will eventually reach the common good, even though it means a personal gain at the start). Also, my methods of measuring the merit of a solution are not the test for morality, although they could be used for this.
Since the original discussion was dealing with logic versus statistics, I would like to stress that even very basic math operators, such as bigoh (O) are a fundament to statistical methods as well, and in modelling often have an identical measure. Thus, when using computational theory, you could imply you are using statistical theory as well.
Re:Statistics is HARD (Score:2, Informative)
Thought experiment 1  this would be a significant finding, provided that you did not ask how many days have passed as the number you are asked for; if you ask for a number and they respond in a pattern, that is a significant finding. It would be statistically significant, however, that patterns are used if you instruct the person to follow such a specific pattern, because you removed the opportunity for variablity by your instruction. You dont have a random sample if there is no variability in a population. I already covered this.
A sample, on the other hand, would not be one person responding with a number day after day after day; rather it would at least be hundreds (if not thousands) if you wish to extrapolate to the population (either people or numbers). You can't do it with 1 data point.
Your third paragraph is confusing, because it is a paraphrase of what I said, only it doesn't fit with everything else you say in your reply.
Second thought experiment  Sex isn't male (neither are ychromosomes), you are thinking of gender. Regardless, you would find a statistically significant relationship between males having a ychromosome, because its part of the definition of how you define gender. This would be like sampling a population of red cars to determine if they are red.
Try coming up with a population that has variability, so that taking a sample makes sense, and you will see that statistical significance matters.
Re:correlation != causation (Score:1, Informative)
Statistics is like Photography. A subject hard to master, yet taking for instance a photograph of a poor man will not solve poverty
Re:Percent probability that Zed Shaw is a jerk (Score:1, Informative)
Imaginary conversation with statistician:
 So, you say that my measurements are phony because I do not have confidence intervals plotted? ... ...
 Yes, it is very unscientific!
 And what does confidence interval of 95% means? It means that there is 95% probability that the value is in that interval?
 Well, no, there is only 95% likelihood.
 And what does likelihood mean?
 Er... It is... It is a function you know, f(y) = P(Y=yX=x) for random variables X and Y
 So it is not a probability?
 No it is not.
 So it does not guarantee anything, as it is pretty meaningless from a practical viewpoint. I would need prior probabilities to be able to this number it as probability.
 Well...
 Does the confidence interval assume that my error distribution is gaussian?
 Yes, of course, that is pretty standard.
 And what if it is not?
 It is unlikely.
 Why? What guarantees existing moments and nicely behaving distributions in real world? I do not see the axioms of probability prohibiting this.
 No, but these are degenerate cases.
 Are there methods to make sure that my distribution is not some heavy tail one, but a light tail one?
 No.
 Is it true that gaussians are frequent?
 Yes because of central limit theorem.
 Then Cauchy distribution must be frequent also, as the ratio of two gaussians is Cauchy, isn't it?
 Yes.
 So can you still assert that gaussian is that likely as a generating pdf?

 So, we have a meaningless number, called likelihood, we have a meaningless section, called confidence, and a phony assumption of gaussian error.

 And you call my measurements unscientific.
In 1976... (Score:3, Informative)
Re:93% of Programmers Think You're Wrong (Score:3, Informative)
I'm not sure why I'm wasting time responding to a troll but whatever.
> The question is 1 coin is heads, what is the probability that the other coin is heads. In other words, your girlfriend is pregnant. What are the odds that my girlfriend is also pregnant?
No, you read it wrong. What it's actually asking is (if we pretend all girlfriends have exactly a 50% chance of being pregnant): "two girlfriends exist. At least one of the two is pregnant. What are the odds that both girlfriends are pregnant?"
You just read it wrong and you're too stubborn too admit that you could ever be wrong, even though this puzzle is FIFTY YEARS OLD and is well documented all over the internet. Just see the Wikipedia article on it [wikipedia.org].
Re:93% of Programmers Think You're Wrong (Score:3, Informative)
Please see this [wikipedia.org]  this is a wellknown puzzle over 50 years old, and I'm surprised that there are people on Slashdot who weren't familiar with it already.