Become a fan of Slashdot on Facebook


Forgot your password?
Math Programming

Why Programmers Need To Learn Statistics 572

David Gerard writes "Zed Shaw writes an impassioned plea to programmers: Programmers Need To Learn Statistics Or I Will Kill Them All. Quoting: 'I go insane when I hear programmers talking about statistics like they know s*** when it's clearly obvious they do not. I've been studying it for years and years and still don't think I know anything. ... I have taken a bunch of math classes, studied statistics in grad school, learned the R language, and read tons of books on the subject. Despite all of this I'm not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning. All of this leads to a curse since none of my colleagues have any clue about what they don't understand. I'll propose a measurement technique and they'll scoff at it. I try to show them how to properly graph a run chart and they're indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.'"
This discussion has been archived. No new comments can be posted.

Why Programmers Need To Learn Statistics

Comments Filter:
  • by Greyfox ( 87712 ) on Saturday January 09, 2010 @07:37PM (#30710568) Homepage Journal
    Everything I needed to know about statistics I learned playing poker.
  • by Anonymous Coward on Saturday January 09, 2010 @07:49PM (#30710662)
  • by gardyloo ( 512791 ) on Saturday January 09, 2010 @08:02PM (#30710756)

    We know as much statistics as we need to know.

    Some know more, some less.

    That's either the most honest, insightful comment I've ever seen, or the most useless. I'm 92% sure, with an uncertainty of about +/-5%, that it's the latter.

  • by mmmmbeer ( 107215 ) on Saturday January 09, 2010 @08:13PM (#30710846)

    Let's see, we have one guy complaining about how none of his programmer coworkers understand statistics, and we have X coworkers who undoubtedly disagree with him. Since we do not know him or any of his colleagues to any meaningful degree, we have to assign equal weight to each of their opinions. Statistics then tells us there is a 1/(X+1) chance of his being right, and an X/(X+1) chance of their being right. We can assume that X >= 2 based on his ranting, therefore resulting in the odds favoring them by at least 2/3, and probably much more. Therefore it is only rational to assume they are correct.

  • by thesandtiger ( 819476 ) on Saturday January 09, 2010 @08:35PM (#30711006)

    I don't think it's hard - I just think it requires a different way of thinking than most programmers usually take to maths.

    As a programmer/developer who went into research (in social sciences, so it's really soft), I can say that in my experience stats is really closer to a programming language than it is to other maths. Here's why:

    1) You have a LOT of tools to pick from. What kind of analysis do you want to do? What kind will give you the most useful result? What kind is your data amenable to?

    2) You don't always have a clear choice as to which is the best for a given situation. Sometimes you need multiple different types of analysis to really get the full picture.

    3) Just because it's math doesn't always mean it's right. There's some crazy ass black-box magic stats stuff we use for one project of ours that, in theory, will let us figure out the demographic composition of an unknown target population. Maybe. Sometimes. If the wind is right. Or not.

    4) At the advanced levels, it's fucking insane. People who hack stuff like ultra optimized 3d engines with large quantities of assembler or whatever always wigged me out because my brain just doesn't work that way. With the really complex stats stuff it's the same way - I can plug and chug with the formulas, but I honestly have about as much comprehension of why some of the more advanced stuff works as my dog has of CPU design.

    5) If you know the basics, you know just enough to be dangerous and really piss off people who know what they're doing. Being able to run an anova or determine correlation makes some people think they actually know what's going on because, hey, it's math. But a lot of people who just do the basic stuff think their results are more meaningful than they actually are - falling prey to the whole "it's statistically significant therefore it must be IMPORTANT" fallacy (when you can certainly have things that are "statistically significant" but actually have virtually no impact on the outcome.

    6) Even when people know their shit, they disagree. A fine example of this would be the Space Shuttle failure rate - you had people saying that the shuttle would suffer a critical failure from everywhere between 1 in 5 and 1 in 50,000 launches. And depending on what tools they used to do their analysis, they were correct. Same as with programming languages - depending on the problem, equally skilled programmers might pick entirely different languages to use because they think one part or another is more critical.

    Honestly, I really enjoy stats - if I had to do it all over again I would probably have spent a LOT more time working with stats than I did as a programmer in my younger years - but I won't pretend that it's totally clear what tools to use when. The author of TFA should do well to realize that even fellow statisticians would probably slap the shit out of him over some of his beliefs about how to properly go about utilizing stats toolsets.

  • by HornWumpus ( 783565 ) on Saturday January 09, 2010 @08:38PM (#30711034)

    Statistics does not benefit everybody equally.

    I'd say that if someone has not completed calculus then any statistics in their reach is simply memorize and regurgitate.

    Put things in the correct order. Finish calculus then study stats.

    The business majors understanding of statistics is the most dangerous.

    They don't even know what they don't know.

    They can regurgitate the definition of standard deviation but don't remember what normal distribution means.

  • by Anonymous Coward on Saturday January 09, 2010 @08:44PM (#30711074)

    "Statisticians need to learn programming or I will kill them all." - by halivar (535827) on Saturday January 09, @06:43PM (#30710618) Homepage

    Well put, Halvar! Now, I'll add to it, as I have backgrounds in both areas he "bitches here" about.

    First of all:

    I'm in possession of degrees from both the business world (where I took STAT 1 & STAT 2 & "aced" both w/ A grades no less) & also Comp. Sci. & CIS concentration/minor (where you get exposure to a good deal of "higher mathematics" such as Calculus, & Discrete Math to name only a couple possibles)...

    LOL! Man... I "just loved" (not) his "logic & reasoning is inferior to evidence & observation"...

    (Especially since I know 1 VERY important thing: That stat teaches you 1 extremely IMPORTANT concept: It's ALL BASED ON SAMPLE SETS...)

    As to "sample sets"? Well, those are USUALLY either:


    1.) EASILY SKEWED (as in "4/5 dentists chew trident", oh "sure, sure", especially when they're on the corporate payroll (or paid off to say so by said corporation so their "evidence & observation looks good")


    2.) IS THE SAMPLE SET LARGE & COMPREHENSIVE ENOUGH? (most?? Most are not, period)...


    Simply because you cannot:

    A.) Sample EVERYONE

    B.) Nor can you judge the veracity & accuracy of who you are sampling!


    E.G. #1 - Let's say I had a poll question of "Are Democrats better than Republicans?" & I sampled from a PRIMARILY REPUBLICAN AREA - So, that all "said & aside"??

    What kind of answers do you think I'd get???

    Would THAT be a "good/fair & representative sample set"????

    Answer = Hell no!

    Math people sometimes make me laugh... especially when they *THINK* they "know it all".

    Lief's a BALANCE people, & there are very few "absolutes", because people are not "binary". Human beings have a LOT of "shades of grey" (or, is it "gray"?? Inquiring minds, want to know, lol!)


    P.S.=> Personally - I feel that life's REAL answers & REAL problems, in my estimation & opinion, aren't going to even be answered by "hard sciences" alone...

    I actually tend to think that the REAL ANSWERS (for the REAL problems) will come from philosophers really!

    (E.G. #2 - The serious questions to answer, like "why is man unjust to man" for example).

    Yes, THAT coming from me may sound weird, especially coming from someone with fairly extensive classical education in the business sciences & computer sciences here in myself, but I do hold to that (and, all the math that comes with them like STATS, CALC, DISCRETE MATH, etc. et al, from the 'hard sciences'? They're JUST TOOLS that others should definitely use, but not "base all" on them, either, because they too can be misused, as in the examples above I note from stats itself))... apk

  • It's the Zed Effect (Score:4, Interesting)

    by greg_barton ( 5551 ) < minus herbivore> on Saturday January 09, 2010 @09:13PM (#30711292) Homepage Journal

    The Zed Effect: Whether you're right or wrong people will disagree with you just to piss you off.

  • by jc42 ( 318812 ) on Saturday January 09, 2010 @09:15PM (#30711312) Homepage Journal

    So what if it is 1 out of 10 million that it will happen.

    When I hear this sort of reasoning, I like to point out that with modern computers, something that happens only 1 time out of a million can very easily mean thousands of occurrences per day, each of which will get us a support call. This usually ends the discussion really fast, and they agree to properly implementing the "unlikely" edge cases.

    I've also heard to observation that in computing, statistical behavior is generally referred to as "bugs".

  • Wiki (Score:1, Interesting)

    by Anonymous Coward on Saturday January 09, 2010 @09:18PM (#30711332)
    I'll be honest, I didn't know who Zed Shaw was, so I fired up google. His wikipedia entry reads thus:

    Zed A. Shaw is a troll[1][2][3][4], writer, software developer, and musician, most commonly known for creating the Mongrel web server for Ruby web applications, as well as his controversial opinion pieces on technology, business, and technical communities. He is frequently referred to simply as 'Zed'.


  • by Improv ( 2467 ) <> on Saturday January 09, 2010 @09:22PM (#30711362) Homepage Journal

    In practice, statistics is an attempt to quantify messy, uncertain events into a figure. We can even measure the extent to which this works, roughly speaking. Your hard drive has a rough time-to-failure, based on analyses of the things that tend to go wrong in that system. Sure, any time it fails, it's not statistics that broke it; it's one of the kinds of problems captured in the statistical analysis. And sure, you could break it down further for disks and note that the controller has a different failure rate than some other component, just as a bridge has a number of possible failures. Problem is, for any of those, you could break it down further and get failure rates for subcomponents, regions, etc. So what? It's still useful to have statistical measures - the real world is complex, and statistics helps us capture things we otherwise couldn't.

    Programmers (particularly but not only young programmers) might not like to acknowledge any field but their own has any depth ("Everything is simple! Just do it my way", hence Ron Paul/Ayn Rand fanboyism and all sorts of other stupidities) - I don't know if there's a lot we can do but hope they grow out of it (It took me awhile to do it, as did a number of people I knew when I was younger, but I made it out).

    Basically, if your worldview doesn't wed empiricism and a reasonably flexible practical philosophy, your worldview is (if you err on the pro-logic end) too inflexible and you're going to miss out on standing on the shoulders of giants. Neither the logician nor the mystic understands the world.

  • by Daniel Dvorkin ( 106857 ) * on Saturday January 09, 2010 @09:57PM (#30711640) Homepage Journal

    Resampling-based statistics haven't replaced parametric models, and I doubt they ever will, for one very simple reason: as the available processing power grows, so does the amount of data. In my field, bioinformatics, the size and complexity of the data sets follows a Moore's Law of its own, and I don't think bioinformatics is unique in this. "Just bootstrap it" is easy to say, and certainly there have been many times when dealing with an analytically intractable distribution when I've done just that, but if the analytical solution takes minutes and the bootstrap solution takes weeks, you have to take this into account.

    Of course, resampling isn't the only way to look at problems non-parametrically. Often a good compromise is to go with rank-based statistics, which are fast and easy to calculate -- and you may not have an analytically tractable model for the distribution of the original data, but you don't have to, since by working with ranks you can define a distribution with good analytical properties. You still need to do some reality-checking exploratory data analysis, of course, but this is an approach that generally works well in practice.

  • by Anonymous Coward on Saturday January 09, 2010 @10:03PM (#30711698)

    The issue is that the OP has a more realistic evaluation of his skills in the field. Its fine for a programmer to say "I'm not too good with statistics, could some one give me some advise?" Its not fine to say "I'm good at stats, I don't need your help" IF you're wrong. Going further, its even worse if you overestimate your abilities and then ignore good advice, or offer bad advice as a result.

    At a minimum a lesson is that its fine to ask for help. You impress people more with timely, working results than with a hacked together system because a bad understanding of the problem resulted in a lot of last minute changes when things didn't work.

  • by weston ( 16146 ) <westonsd&canncentral,org> on Saturday January 09, 2010 @10:03PM (#30711700) Homepage

    not understanding a topic that even you are unwilling to acknowledge mastery of.

    Personally, I think that little acknowledgment increases his credibility quite a bit. It suggests to me that he's actually spent some real time coming to grips not just with glossy overview you get in a high school or college course but with some of the devilish subtleties of actually using the stuff.

    The funny thing about knowledge... the more it grows, the bigger you realize the frontier is. So, how good of a heuristic is apparent confidence?

  • by dbarclay10 ( 70443 ) on Saturday January 09, 2010 @10:05PM (#30711712)
    Your comment ("the reason people ignore you is because you're a dick") is clearly a troll, but it was also moderated Insightful ... which might also be a troll :)

    Nevertheless, assuming for a moment that you're being truthful in your expression, then I have this to say:

    This is what is wrong with the world today. Billions upon millions of morons who don't know what they're doing, and people trying to show them how to (or, hell, what the fuck - people trying to beat them into) do(ing) it the right way.

    You want these assholes who can't even figure out how to correctly measure something to build the bridge you drive over twice a day? How about the building you work in?

    Or I dunno, maybe you'd prefer having _only_ people who will point out errors when they see them working on it? How about your doctor? You want your operating room filled with maybe one smart guy who recognizes an error and six people who don't know any better? And you're saying that, when the smart guy recognizes the error and tries to point it out (no matter HOW he does it, though I'm betting the original poster isn't that much of an asshat at work), he's being a dick?

    Christ, what's wrong with you? Seriously?
  • by Anonymous Coward on Saturday January 09, 2010 @10:20PM (#30711820)

    If you think that then answer the following problem:
    If I flip two coins and one of them is heads, what are the odds the other one is also heads?

  • by SuperKendall ( 25149 ) on Saturday January 09, 2010 @11:59PM (#30712388)

    He's just as arrogantly claiming that he's right and they're wrong.

    No he doesn't.

    He claims that programmers need to understand statistics more. The people he is talking about are therefore not wrong - they are ignorant.

    But that term is loaded with negative meaning, it's more accurate to say they are like a variable with named "statistics" with a value that has never been set. Basically, they don't know what they are missing.

    It's like when programmers try to argue about how a language is bad when they've never used it. How would they know? Yet many without understanding of statistics are saying the same thing, they don't need to know any more.

    I know enough to know statistics can be a valuable tool. Why would you not want another tool that could help you? The people who refuse do so are less than they could be (as a programmer).

  • by ShakaUVM ( 157947 ) on Sunday January 10, 2010 @12:03AM (#30712402) Homepage Journal

    >>Spiking vs non-spiking is something pretty easy to see when you glance at the data.

    Yeah, in fact, the way that he presents it is bad statistics. =)

    If the problem is that one out of 1000 queries is taking a minute to return instead of 0.1 seconds, then using the std deviation to describe the problem is nonsense. It is not a Gaussian distribution!

    But of course someone who "has spent his life studying statistics and even R language" would know that, right? :p

    Instead, as you point out, any programmer who did the same testing would see that one out of a thousand queries were taking far too long, and come to the same conclusion as him, without making the ghost of Gauss cry.

  • by upuv ( 1201447 ) on Sunday January 10, 2010 @01:49AM (#30712832) Journal

    I hear you, I do performance engineering of web based systems. The developers, the managers, the testers, the architects all have no clue. You are correct here.

    However if you can not present your "theory" of how to do something in a dumbed down enough format then who cares. Because the pretty graph is pointless. It will be mis-interpreted, mis-understood, and mis-used.

    All the stats theory on the planet will not get you passed the dumb manager or developer. don't loose sleep of this. There is no point. Simply find metrics in your analysis procedure that do mean something to these people. They may not be the total picture but they are something. Build a reputation for being correct by starting with simple things. You are always going to but heads with a know it all developer / architect / manager. Fine let them go off and waste money and time. They will be found out as morons in time. You do your thing and simply become the guy to ask about performance and how to do this.

    Being understated and consistently showing above average results for your work is how you will rise up. Being and A-hole about it is not going to help anyone. As a matter of fact I would can your butt for being a D#ck.

  • by Anonymous Coward on Sunday January 10, 2010 @02:41AM (#30713022)

    No. You cut out a large portion of my comment.

    I said people ignore Zed
    a) because he's a dick.
    b) not because they don't understand statistics.

    So what I am saying is Zed is not the one single 'smart guy' surrounded by a whole lot of incompetent assholes, he is one of those guys who thinks he knows better than everyone else about everything.
    He can't understand, and can't be made to understand why most of the crap he likes to bang on about is irrelevant to the problem at hand and so he is largely ignored. If he thinks women listen to him more, its simply because women are generally better at *seeming* to listen to dickheads in order to make them shut up.

    One of my degrees in in experimental physics. I have a strong understanding of the application of statistics to measured data. I often take performance measurements without considering deviation, because deviation is not relevant to what I am doing at the time. Apparently this makes Zed want to kill me.

  • by Kludge ( 13653 ) on Sunday January 10, 2010 @06:59AM (#30713744)

    I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.

    I work with a number of statisticians and I have the opposite problem. They look at the data, apply mathematical transforms to it, and come to a conclusion, whether that conclusion makes any sense or not. They make little attempt to reason that the data may flawed (which experiments often are), or does not really represent what we are trying to measure, or they are using the wrong statistic to summarize the effect. It is very frustrating.

  • by AliasMarlowe ( 1042386 ) on Sunday January 10, 2010 @11:47AM (#30714738) Journal

    Statistics are important; it is highly unlikely that anyone with an MBA will know how or why, but they want them.

    In fact, it is almost a certainty that any given MBA will either lack statistical expertise or will misapply it unthinkingly in a cook-book style. The pseudo-statistics behind Six Sigma comes immediately to mind.

    I had repeated theoretical discussions with the four MBA experts who "trained" us (a group of six PhDs in Physics & Engineering doing R&D) in the ways of Six Sigma. There were problems with the statistical theory they presented right from the start - and they were clearly unaccustomed to being contradicted along the lines of "that's not right/applicable in this case, and here's why". For instance, they failed to acknowledge that non-Gaussian distributions could exist, then refused to accept that procedures should be adapted to the data if it was non-Gaussian. Next, they adamantly refused to believe that the 1.5 Z shift hypothesis was supported only by a few studies, all relying on a single dataset from the 1950s for die-based manufacture, and totally irrelevant to most other processes. The Six Sigma books all say "many studies" over decades support the Z shift hypothesis, but fail to cite them, and our MBA experts could not cite any such studies either. Thirdly, they refused to accept that an additional mode of variability (not in the Six Sigma beliefs) existed in processes with feedback (such as recycle lines or controllers). In many cases, this mode guarantees non-Gaussian variability in the process output.

    Their advice was that to pass the course, we should ignore our knowledge of statistics (which they acknowledged was far better than theirs) and of process variability, and just "apply the documented methods". We did, and we all passed the course. Then we ignored the Six Sigma bogus statistics bullshit and got on with our jobs using proper statistics to analyze and solve problems in variability with the products we were developing.

    MBAs seem to want statistics, but the vast majority appear to lack the training in how to generate proper statistics, or how to use them competently if someone else supplies them. Most MBAs appear to think the world is described adequately using Gaussian distributions, and a few "experts" know the Weibull distribution or the t-distribution. Other distribution types (Poisson, discrete/categorical, etc.) are totally foreign, and methods of inference beyond simple unconditional analyses are also quite alien to them.

    I also understand that people who are good at it are rare.

    Perhaps not as rare as you might think. But those who have some aptitude in statistics know enough to keep their mouths shut when the data tells them to. MBAs on the other hand, ignorant of their own ignorance, are as verbally promiscuous as politicians...

If you want to put yourself on the map, publish your own map.