Forgot your password?
typodupeerror
Math Programming

Why Programmers Need To Learn Statistics 572

Posted by Soulskill
from the because-they-suck-at-poker dept.
David Gerard writes "Zed Shaw writes an impassioned plea to programmers: Programmers Need To Learn Statistics Or I Will Kill Them All. Quoting: 'I go insane when I hear programmers talking about statistics like they know s*** when it's clearly obvious they do not. I've been studying it for years and years and still don't think I know anything. ... I have taken a bunch of math classes, studied statistics in grad school, learned the R language, and read tons of books on the subject. Despite all of this I'm not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning. All of this leads to a curse since none of my colleagues have any clue about what they don't understand. I'll propose a measurement technique and they'll scoff at it. I try to show them how to properly graph a run chart and they're indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.'"
This discussion has been archived. No new comments can be posted.

Why Programmers Need To Learn Statistics

Comments Filter:
  • Statistics is HARD (Score:5, Informative)

    by omb (759389) on Saturday January 09, 2010 @06:54PM (#30710694)
    Statistics is HARD, for two reasons:

    (a) Probability theory, on which all practical Statistics is based it both (i) counter-intuitive and (ii) difficult

    (b) The very Mathematics on which it is based is obscure

    And, worst of all, it is uniformly badly taught, even in good universities, and the Statistics for XXX are uniformly awful, blind leading the blind.

    Lastly it is very hard to get a staight answer from a mathematical Statistician.
  • by toby (759) * on Saturday January 09, 2010 @07:03PM (#30710766) Homepage Journal

    Nothing new to see here.

  • Stats? Fuck that. (Score:2, Informative)

    by delysid-x (18948) on Saturday January 09, 2010 @07:05PM (#30710778)

    Statstics is WAY beyond what a programmer cares about. Logic is all that matters. Statistics->logic is the problem of the software engineer, not the programmer.

  • by doublegauss (223543) on Saturday January 09, 2010 @07:35PM (#30711010)

    You probably still think I am a lunatic, but hear me out.

    You don't qualify as a lunatic; just as someone who has no idea of what he's talking about. Absolutely no idea. Your post, my friend, is so full of ideas you obviously misunderstood that I won't even attempt to make a list.

    And yes, I do statistics for a living.

  • by NitWit005 (1717412) on Saturday January 09, 2010 @07:48PM (#30711118)
    From his complaints, I can tell knowledge isn't the real issue. Testing performance takes a huge amount of time. You need to simulate other programs running, multiple users and make sure the test matches what real users might do. Generally, this requires writing completely independent test programs and charting the logging from them. People just don't want to go to that kind of effort. It can take weeks just to create proper tests for complex programs like web servers.
  • Re:Title fail. (Score:4, Informative)

    by girlintraining (1395911) on Saturday January 09, 2010 @07:48PM (#30711124)

    Don't you mean free()?

    #include <stdhumor.h>
     
    void demalloc (void *ptr);
    void demalloc(*ptr)
    {
    /* I meant to say */
        free(ptr);
    }

  • by jackchance (947926) on Saturday January 09, 2010 @08:17PM (#30711326) Homepage

    Before computers stats involved using parametric tests (t-tests, anova, etc) which made assumptions like "the data comes from an underlying normal distribution". BTW, in stats terms "normal" mean "Gaussian" [wikipedia.org].

    Now, with cheap and fast computers, we can actually compute the confidence intervals non-parametrically through permutation tests and bootstrapping [wikipedia.org] without assuming anything about underlying distributions. In most cases, this non-parametric test is the "right thing to do". Most of the time, the results are the same as using a parametric test.

    However, a HUGE disaster in empirical science has been the problem of multiple comparisons. With computers it is so easy to compute correlations and significance tests between every possible slice of your data set. Many "scientists" don't have good statistical knowledge and pray at the alter of "p < 0.05". They don't know about or understand the problem of multiple comparisons. [wikipedia.org] So they do 20 tests, find one that comes out p0.05 and write a paper about it. They don't get that if you do 20 tests you are very very very likely to find one that come out p < 0.05.

    Anyone who has access to excel or matlab can do this little experiment.

    samp=50 normally distributed random numbers.

    for x=1:100
    test=50 normally distributed random numbers (mean=0, var=1);
    sig(x)=ttest(samp,test);
    end

    now look at the sig vector. OMG, 5% of the tests came out significant!!!

    Now you are writing a paper all about how x is linked to y. But you are essentially throwing dice and then writing a paper about why it came up '3-3'.

  • by Evil Shabazz (937088) on Saturday January 09, 2010 @08:53PM (#30711600)
    So I read through his article. Yes, the whole mindless rant. The conclusion that one should REALLY draw from it is: Zed Shaw is a douche with Asperger's who clearly feels like his own personal area of expertise is underappreciated. Hey Zed, get over it.
  • by Selfbain (624722) on Saturday January 09, 2010 @08:55PM (#30711620)
    I like how the first part of his Wikipedia article says "Zed A. Shaw is a troll" with four citations.
  • by brian_tanner (1022773) on Saturday January 09, 2010 @08:56PM (#30711630)
    Wow. What class did you take that says if you don't know something you should assume equal probability?

    I don't know if there is an invisible elephant in my kitchen, so I guess I should assign equal probability to both outcomes. I also don't really know how Baccarat works, I guess my odds are 50/50.

    Without knowing something about he or his coworkers, you by definition cannot make any statistical statements. To make any statements, you would first need to make some observations. This is how statistics is different from logic. Statistics is grounded in data.

    I don't agree with Zed, but you may have just proved his point.
  • by obarthelemy (160321) on Saturday January 09, 2010 @09:57PM (#30712078)

    I'm sure it's not 50%, and not 25%

    heads=1, tails = 0

    0-0 0-1 1-0 1-1

    so if one of them is 1, there's a 33.33% chance the other is 1 too.

    i can work it out that way for 2 binary possiblities. couldn't generalize it x coins possiblities with y sides :-/

  • 1.) EASILY SKEWED (as in "4/5 dentists chew trident", oh "sure, sure", especially when they're on the corporate payroll (or paid off to say so by said corporation so their "evidence & observation looks good")

    and

    2.) IS THE SAMPLE SET LARGE & COMPREHENSIVE ENOUGH? (most?? Most are not, period)...

    You know, that particular citation has made me wonder in the past, but not enough to actually research it. So, I went off looking for more information and found it [straightdope.com].

        The statistic was generated from a July 1976 survey.

        The sample group for this statistic was 1,200 dentists. These dentists were hand picked by the research company, probably with good reason.

        They were asked, what advice would they give gum-chewing patients

        1) sugared gum
        2) sugarless gum
        3) no gum at all.

        Sugarless gum got 85% of the vote. Not terribly surprising. I'd be fairly confident that their time had been paid for, or at very least they were told "This survey is being done for Trident Sugarless Gum." That is only speculation, so hush up.

        17/20 doesn't really sound very good. It just doesn't stick in your head. 4/5 is close enough, even though it reduces your answer to 80% (ahhh, a lie). Since these are marketing folks, I'm sure they pushed all kinds of values past focus groups, until "4 in 5" was accepted as most favorable.

        As the link cites, they're fairly confident that the "sugared gum" answer got at least one response. There's always someone that'll take the obvious wrong answer. If you don't believe that, look at any Slashdot poll. :)

        What they don't say is how many of the 1,200 samples were dropped. I'm sure there were non-responses, and they could have easily added any number of unfavorable answers in as non-responses. Of course, they couldn't have 100% in their favor, so they had to keep some.

  • by Anonymous Coward on Sunday January 10, 2010 @12:05AM (#30712658)

    I'd say that if someone has not completed calculus then any statistics in their reach is simply memorize and regurgitate.

    Put things in the correct order. Finish calculus then study stats.

    Horseehit. You can use a distribution without having to integrate it in a great many scenarios that would benefit many people. Also, for discrete statistics - which is probably of more immediate use to most people - you can replace that nasty integral with addition.

    You don't have to know everything about a field to use parts of it. As parent said, I think many, many people would benefit from common-sense concepts combining statistics and logic, just so you can make good decisions about purchases and such. Read a book called "Innumeracy" to see the level of stupid I'm talking about - calculus is so far beyond that kind of dumb it's sad. I would settle for people being able to intuitively understand the implications of Bayes' rule. Understanding why prior probabilities are important would be a big start, and there's no calculus in that.

    For reference, I have had calc and stats as part of my math minor. In my job, I use statistics daily. I use things I learned in calculus alone a lot more seldom.

    The business majors understanding of statistics is the most dangerous.

    Oh yeah. Are you a six sigma black belt? ;)

  • by obarthelemy (160321) on Sunday January 10, 2010 @01:41AM (#30713024)

    I'm not assuming anything, just reading the question correctly: The question is NOT "if I flip two coins and THE FIRST ONE is heads..." (answer would then indeed be 50%), but "If I flip two coins and ONE OF THEM is heads..."

    I'm listing all 4 combinations for 2 flips, and out of the 3 that satisfy the prerequisite ("one of them is heads") counting how many combinations turn up with the other one also being heads. There's one out of 3 possibilities, so that's 33%.

  • Re:yea (Score:1, Informative)

    by Anonymous Coward on Sunday January 10, 2010 @04:26AM (#30713512)

    Unfortunately, your posts have been modded down, even though it is a valid discussion point.

    Obviously, you want this discussion to handle PoV on political views and the accompanying philosophies. In my profession I'm a pragmatist: one should view vexing problems from different perspectives (categorical, logical, libertarian) and choose 'wisely' (meaning, that which will eventually reach the common good, even though it means a personal gain at the start). Also, my methods of measuring the merit of a solution are not the test for morality, although they could be used for this.

    Since the original discussion was dealing with logic versus statistics, I would like to stress that even very basic math operators, such as big-oh (O) are a fundament to statistical methods as well, and in modelling often have an identical measure. Thus, when using computational theory, you could imply you are using statistical theory as well.

  • by kramerd (1227006) on Sunday January 10, 2010 @05:01AM (#30713624)

    Thought experiment 1 - this would be a significant finding, provided that you did not ask how many days have passed as the number you are asked for; if you ask for a number and they respond in a pattern, that is a significant finding. It would be statistically significant, however, that patterns are used if you instruct the person to follow such a specific pattern, because you removed the opportunity for variablity by your instruction. You dont have a random sample if there is no variability in a population. I already covered this.

    A sample, on the other hand, would not be one person responding with a number day after day after day; rather it would at least be hundreds (if not thousands) if you wish to extrapolate to the population (either people or numbers). You can't do it with 1 data point.

    Your third paragraph is confusing, because it is a paraphrase of what I said, only it doesn't fit with everything else you say in your reply.

    Second thought experiment - Sex isn't male (neither are y-chromosomes), you are thinking of gender. Regardless, you would find a statistically significant relationship between males having a y-chromosome, because its part of the definition of how you define gender. This would be like sampling a population of red cars to determine if they are red.

    Try coming up with a population that has variability, so that taking a sample makes sense, and you will see that statistical significance matters.

  • by Anonymous Coward on Sunday January 10, 2010 @07:06AM (#30713906)

    Statistics is like Photography. A subject hard to master, yet taking for instance a photograph of a poor man will not solve poverty

  • by Anonymous Coward on Sunday January 10, 2010 @07:45AM (#30714016)

    Imaginary conversation with statistician:

      - So, you say that my measurements are phony because I do not have confidence intervals plotted?
      - Yes, it is very unscientific!
      - And what does confidence interval of 95% means? It means that there is 95% probability that the value is in that interval?
      - Well, no, there is only 95% likelihood.
      - And what does likelihood mean?
      - Er... It is... It is a function you know, f(y) = P(Y=y|X=x) for random variables X and Y
      - So it is not a probability?
      - No it is not.
      - So it does not guarantee anything, as it is pretty meaningless from a practical viewpoint. I would need prior probabilities to be able to this number it as probability.
      - Well...
      - Does the confidence interval assume that my error distribution is gaussian?
      - Yes, of course, that is pretty standard.
      - And what if it is not?
      - It is unlikely.
      - Why? What guarantees existing moments and nicely behaving distributions in real world? I do not see the axioms of probability prohibiting this.
      - No, but these are degenerate cases.
      - Are there methods to make sure that my distribution is not some heavy tail one, but a light tail one?
      - No.
      - Is it true that gaussians are frequent?
      - Yes because of central limit theorem.
      - Then Cauchy distribution must be frequent also, as the ratio of two gaussians is Cauchy, isn't it?
      - Yes.
      - So can you still assert that gaussian is that likely as a generating pdf?
      - ...
      - So, we have a meaningless number, called likelihood, we have a meaningless section, called confidence, and a phony assumption of gaussian error.
      - ...
      - And you call my measurements unscientific.

  • In 1976... (Score:3, Informative)

    by alispguru (72689) <bane.gst@com> on Sunday January 10, 2010 @09:48AM (#30714450) Journal
    ... I ran into a professor of statistics who said that computers were going to be a passing fad in his field.
  • by Gendou (234091) on Monday January 11, 2010 @06:30AM (#30721500) Homepage

    I'm not sure why I'm wasting time responding to a troll but whatever.

    > The question is 1 coin is heads, what is the probability that the other coin is heads. In other words, your girlfriend is pregnant. What are the odds that my girlfriend is also pregnant?

    No, you read it wrong. What it's actually asking is (if we pretend all girlfriends have exactly a 50% chance of being pregnant): "two girlfriends exist. At least one of the two is pregnant. What are the odds that both girlfriends are pregnant?"

    You just read it wrong and you're too stubborn too admit that you could ever be wrong, even though this puzzle is FIFTY YEARS OLD and is well documented all over the internet. Just see the Wikipedia article on it [wikipedia.org].

  • by Gendou (234091) on Monday January 11, 2010 @06:32AM (#30721516) Homepage

    Please see this [wikipedia.org] -- this is a well-known puzzle over 50 years old, and I'm surprised that there are people on Slashdot who weren't familiar with it already.

The speed of anything depends on the flow of everything.

Working...