Forgot your password?
Programming Math

What Every Programmer Should Know About Floating-Point Arithmetic 359

Posted by Soulskill
from the gaining-understanding-bit-by-bit dept.
-brazil- writes "Every programmer forum gets a steady stream of novice questions about numbers not 'adding up.' Apart from repetitive explanations, SOP is to link to a paper by David Goldberg which, while very thorough, is not very accessible for novices. To alleviate this, I wrote The Floating-Point Guide, as a floating-point equivalent to Joel Spolsky's excellent introduction to Unicode. In doing so, I learned quite a few things about the intricacies of the IEEE 754 standard, and just how difficult it is to compare floating-point numbers using an epsilon. If you find any errors or omissions, you can suggest corrections."
This discussion has been archived. No new comments can be posted.

What Every Programmer Should Know About Floating-Point Arithmetic

Comments Filter:
  • Analog Computers (Score:3, Informative)

    by Tisha_AH (600987) <> on Sunday May 02, 2010 @11:42AM (#32064180) Journal

    I seems to me that this problem would pop up any time you worked with an irrational number.

    Back in the early days the analog computer was used for things like ballistic calculations. I would think that they would be less prone to this type of problem.

    Linearity may still be an issue (analog systems have their own set of problems).

    • Re: (Score:3, Insightful)

      by maxume (22995)

      Precision isn't that big a deal (we aren't so good at making physical things that 7 decimal digits become problematic, even on something the scale of an aircraft carrier, 6 digits is enough to place things within ~ 1 millimeter).

      The bigger issue is how the errors combine when doing calculations, especially iterative calculations.

      • Re: (Score:3, Insightful)

        by Chris Mattern (191822)

        The problem is, if you're doing a long string of calculations--say a loop that repeats calculations thousands of times with the outcome of the last calculation becoming the input for the next (approximating integrals often does this) then the rounding errors can accumulate if you're not paying attention to how the floating point works.

      • by pentalive (449155)
        Don't CAD systems represent everything in very large integers?
      • by JWSmythe (446288) <(moc.ehtymswj) (ta) (ehtymswj)> on Sunday May 02, 2010 @07:17PM (#32067320) Homepage Journal

            Well, it would depend on what you're doing the calculations for, and how you're doing them.

            Say it used diesel fired engines, and you were instructed to calculate the fuel consumption per engine revolution, and then apply that to a trip. I don't know the specifics on an aircraft carrier, so I'll just make up some numbers.

            At full speed, the ship travels at 12 nautical miles per hour (knots). The engines spin at 300rpm. It burns 1275 gallons of fuel per hour.

          That's 18,000 engine revolutions per hour, or 0.0708334 gallons per revolution.

            1,000 miles at 12 knots = 84.3333334 hours.

            If you are to travel 1,000 nautical miles, 18,000 * 83.3333334 = 1,500,000.0012 revolutino. At 0.0707334 gallons per revolution, that would be 106,100.100085 gallons.

            But knowing that it burns 1,275 gallons per hour at 12 knots, and you will be traveling for 83.3333334 hours, you will require 106,250.000085 gallons. Using the measure of gallons per revolution to try to come up with a very precise number to work with, you've actually fallen short by 150 gallons for the trip. I can imagine a slight embarrassment by having your aircraft carrier run out of fuel just 7 minutes from its destination.

            Using 7 decimal points of precision, when it's multiplied so many times, it can easily cause errors.

            I'd be pretty sure they aren't counting gallons per revolution, I only used that as an example of where errors could happen. If you're considering the full length of the ship, 0.1 inches is more than enough to believe you have a good number. :) I believe due to expansion of the metals, the total length of the ship may change more than that depending on if it's a hot or cold day. :)

        • Re: (Score:3, Insightful)

          by maxume (22995)

          Sure, you can make it a problem, but it isn't particularly insidious.

          And the part where I say "The bigger issue is how the errors combine when doing calculations" is a pretty compact version of what you said.

    • Re:Analog Computers (Score:5, Informative)

      by Anonymous Coward on Sunday May 02, 2010 @12:07PM (#32064350)

      No, irrationality has nothing to do with it. It's a matter of numeric systems, i.e. binary vs. decimal. For example, 0.2 is a rational number. Express it in binary floating point and you'll see the problem: 2/10 is 1/5 is 1/101 in binary. Let's calculate the mantissa: 1/101=110011001100... (long division: 1/5->2/5->4/5->8/5=1,r3->6/5=1,r1->2/5->4/5->8/5...)

      All numeric systems have this problem. It keeps tripping up programmers because of the conversion between them. Nobody would expect someone to write down 1/3 as a decimal number, but because people keep forgetting that computers use binary floating point numbers, they do expect them not to make rounding errors with numbers like 0.2.

      • by -brazil- (111867)

        True, but irrational numbers are those that cannot be written down exactly in *any* base - not even if you use recurring digits.

        • Re: (Score:3, Informative)

          by moonbender (547943)

          True, but irrational numbers are those that cannot be written down exactly in *any* base

          ... except the irrational number's own base. ;)

      • Re: (Score:3, Funny)

        by adonoman (624929)

        Nobody would expect someone to write down 1/3

        I use base 3, so 0.1 is a perfectly easy number to express in floating point.

      • Re:Analog Computers (Score:4, Interesting)

        by RAMMS+EIN (578166) on Sunday May 02, 2010 @03:55PM (#32066068) Homepage Journal

        ``Nobody would expect someone to write down 1/3 as a decimal number, but because people keep forgetting that computers use binary floating point numbers, they do expect them not to make rounding errors with numbers like 0.2.''

        A problem which is exacerbated by the fact that many popular programming languages use (base 10) decimal syntax for (base 2) floating point literals. Which, first of all, puts people on the wrong foot (you would think that if "0.2" is a valid float literal, it could be represented accurately as a float), and, secondly, makes it impossible to write literals for certain values that _could_ actually be represented exactly as a float.

    • Re: (Score:2, Interesting)

      by cupantae (1304123)

      Irrational numbers are not such a problem as rational numbers which can't be represented in the base used.

      Lets say our computer has 6-digit-decimal precision. If you add two irrational numbers, say pi and e, you'll get 5.85987. It's imprecise, but imprecision is necessary, since it can't be represented in any base.

      But if you add 3/7 and 2/3 you get 1.90524 which is imprecise even though a precise answer does exist.

  • by SolusSD (680489) on Sunday May 02, 2010 @11:44AM (#32064198) Homepage
    Floating point math should be properly verified using interval arithmetic: []
    • by harshaw (3140) on Sunday May 02, 2010 @02:24PM (#32065454)

      Gah. Yet another unintelligible wikipedia mathematics article. For once I did like to see an article that does a great job *teaching* about a subject. Perhaps wikipedia isn't the right home for this sort of content, but my general feeling whenever reading something is wikipedia is that the content was drafted by a bunch of overly precise wankers focusing on the absolute right terminology without focusing on helping the reader understand the content.

  • by tonywestonuk (261622) on Sunday May 02, 2010 @11:44AM (#32064210)

    Damn...Missed it! lol

  • by ameline (771895) <<moc.liamg> <ta> <enilema.nai>> on Sunday May 02, 2010 @11:45AM (#32064214) Homepage Journal

    You really need to talk about associativity (and the lack of it). ie a+b+c != c+b+a, and the problems this can cause when vectorizing or otherwise parallelizing code with fp.

    And any talk about fp is incomplete without touching on catastrophic cancellation.

    • The lack of associativity is a bigger problem than you might think, because the compiler can rearrange things. If you're using x87 FPU code, you get 80-bit precision in registers, but only 64-bit or 32-bit precision when you spill to the stack. Depending on the optimisations that are run, this spill happens at different times, meaning that you can get different results depending on how good your register allocator is. Even upgrading the compiler can change the results.

      If you are using floating point v

      • by ameline (771895)

        If you're using the x87, just give up. It is very hard to efficiently conform to IEEE on that evil beast. (even setting the control register to mung precision only affects the fraction, not the exponent, so you still have to store to memory and reload to properly set precision.)

        A former colleague described it (the entire x87 unit) as "Satan incarnate in silicon". :-)

  • strictfp (Score:4, Informative)

    by lenmaster (598077) on Sunday May 02, 2010 @11:56AM (#32064272)
    This article should mention strictfp in the section on Java.
    • by -brazil- (111867)

      Done :)

    • Re: (Score:3, Insightful)

      by owlstead (636356)

      Not really. It might point to BigDecimal, but leave strictfp out of it. Remember, this is for starting programmers, not creators of advanced 3D or math libs.

  • by Nutria (679911) on Sunday May 02, 2010 @11:57AM (#32064276)

    use BCD math. With h/w support it's fast enough...

    Why don't any languages except COBOL and PL/I use it?

    • by JamesP (688957) on Sunday May 02, 2010 @12:17PM (#32064430)

      Maybe because BCD is the worse possible way to do 'proper' decimal arithmetic, also it would absolutely be very slow.

      BCD = 2 decimal digits per 8 bits (4 bits per dd). Working 'inside' the byte sucks

      Instead you can put 20 decimal digits in 64bits (3.2 bits per db) and do math much more faster

      Why don't any languages except COBOL and PL/I use it?


      • by TheRaven64 (641858) on Sunday May 02, 2010 @12:30PM (#32064544) Journal

        also it would absolutely be very slow

        Depends on the architecture. IBM's most recent POWER and System-Z chips have hardware for BCD arithmetics.

      • by Nutria (679911)

        Maybe because BCD is the worse possible way to do 'proper' decimal arithmetic,

        "0.1 + 0.2 = 0.30000000000000004" doesn't really seem all that proper to me.

        But BCD *does* do "0.1 + 0.2 = 0.3".

        also it would absolutely be very slow.

        Without h/w support.

        Instead you can put 20 decimal digits in 64bits (3.2 bits per db) and do math much more faster

        I want accurate math, not estimates.


        Do you pride yourself a Rational Man, or a low down dirty bigot?

        • by cgenman (325138)

          How is that Hardware Support going? Just curious.

          • by Nutria (679911)

            How is that Hardware Support going?

            Very well, on machines designed for business (i.e., mainframes and VAXen).

        • Re: (Score:3, Insightful)

          by amorsen (7485)

          Instead you can put 20 decimal digits in 64bits (3.2 bits per db) and do math much more faster

          I want accurate math, not estimates.

          Math with 20 decimal digits in 64 bits is proper decimal arithmetic. It acts exactly like BCD does, it just doesn't waste tons of space and CPU power.

        • Re: (Score:3, Insightful)

          by JamesP (688957)

          You completely missed my point.

          I'm not comparing BCD to floating point, I'm comparing BCD with other ways of encoding decimal numbers in a computer

        • by AuMatar (183847) on Sunday May 02, 2010 @07:01PM (#32067182)

          If you want accuracy, BCD is still a failure. It only does base 10 instead of base 2. A truly accurate math system would use 2 integers, one for numerator and one for denominator and thus get all rational numbers. If you need irrationals you get even more complicated. But don't pretend BCD is accurate, it fails miserably on common math problems like 1/3.

    • A rational number class seems like a better solution, though there are some issues with representing every number as a ratio of integers... For instance, you need extremely large numerators as your denominators get large. For another, you need to keep computing gcfs to reduce your fractions. Still, this is the approach used in some calculator programs.

      I wonder if a continued fraction [] representation would have advantages -- and if it has ever been used as a number representation in practice? It seems lik

      • Re: (Score:2, Informative)

        by poopdeville (841677)

        Continued fractions are a straightforward way to implement exact computable real arithmetic. So yes, it's been used. And it's slow. But it is exact.

    • and Ada (Score:3, Informative)

      by krischik (781389)

      Correction: COBOL, PL/I and Ada. Ada has both fixed and floating point BCD arithmetic. And yes I too wonder why it is not in wider use. Perhaps it has to do with the ill conceived notion of "light weight languages" - most of which are not light weight at all any more once they are on the market for decade or so.


  • by renoX (11677) on Sunday May 02, 2010 @12:02PM (#32064310)
    Maybe in your list of solutions you could mention interval arithmetic [], it's not very much used, but it gives "exact" solution.
    • by -brazil- (111867)

      Why don't you write it up yourself and give me a github pull request? :)

      • by renoX (11677)

        Because I don't know how to use github and it looks to me as a really, really complicated way to make a wiki..

        • by -brazil- (111867)

          Well, I'll do it when I get around to it. Doing it as a wiki would mean that I'd have to deal with vandalism and spam, and it's really intended more as small self-contained site than a community thing.

    • by OSPolicy (1154923) on Sunday May 02, 2010 @03:50PM (#32066032) Homepage

      Internal arithmetic always includes the exact solution, but only the rarest circumstances does it actually give the exact solution. For example, an acceptable interval answer for 1/3 would be [0.33,0.34]. That interval includes the exact answer, but does not express it.

  • Before we get (Score:5, Informative)

    by toxygen01 (901511) on Sunday May 02, 2010 @12:10PM (#32064374) Journal
    to floating point, please, everyone should've read Everything you ever wanted to know about C types [] and part 2 [] (which explains fp too).

    this will save a lot of time & questions to most beginning (and maybe mediocre) programmers.
  • I'd just avoid it (Score:5, Interesting)

    by Chemisor (97276) on Sunday May 02, 2010 @12:13PM (#32064396)

    Given the great complexity of dealing with floating point numbers properly, my first instinct, and my advice to anybody not already an expert on the subject, is to avoid them at all cost. Many algorithms can be redone in integers, similarly to Bresenham, and work without rounding errors at all. It's true that with SSE, floating point can sometimes be faster, but anyone who doesn't know what he's doing is vastly better off without it. At the very least, find a more experienced coworker and have him explain it to you before you shoot your foot off.

    • Re:I'd just avoid it (Score:5, Informative)

      by -brazil- (111867) on Sunday May 02, 2010 @12:34PM (#32064576) Homepage

      The non-trivial problems with floating-point really only turn up in the kind of calculations where *any* format would have the same or worse problems (most scientific computing simply *cannot* be done in integers, as they overflow too easily).

      Floating-point is an excellent tool, you just have to know what it can and cannot do.

      • Re: (Score:3, Insightful)

        by Rogerborg (306625)

        Depends how many "integers" you use. If you need accuracy - and "scientific" computing certainly does - then don't use floats. Decide how much accuracy you need, and implement that with as many bytes of data as it takes.

        Floats are for games and 3D rendering, not "science".

        • by -brazil- (111867) on Sunday May 02, 2010 @01:49PM (#32065208) Homepage

          You've never done any scientific computing, it seems. While it's a very broad term, and floats certainly not the best tool for *all* computing done by science, anyone with even the most basic understanding knows that IEEE 754 floats *are* the best tool most of the time and exactly the result of deciding how much accuracy you need and implementing that with as many bytes of data as it takes. Hardly anything in the natural sciences needs more accuracy than a 64 bit float can provide.

    • And once you've finished writing your algorithm in manually coded fixed point to avoid the "complexities" of float-point you can sit down and tuck into a tasty plate of quiche.

  • Here's a one-page intuitive description of floating-point [], to give a feel for why multiplication and division are fairly benign, but addition and subtraction aren't. None of the other descriptions have made it as clear as this.
  • by Animats (122034) on Sunday May 02, 2010 @12:19PM (#32064454) Homepage

    The article gives the impression that base 10 arithmetic is somehow "more accurate". It's not. You still get errors for, say, 1/3 + 1/3 + 1/3. It's just that the errors are different.

    Rational arithmetic, where you carry along a numerator and denominator, is accurate for addition, subtraction, multiplication, and division. But the numerator and denominator tend to get very large, even if you use GCD to remove common factors from both.

    It's worth noting that, while IEEE floating point has an 80-bit format, PowerPCs, IBM mainframes, Cell processors, and VAXen do not. All machines compliant with the IEEE floating point standard should get the same answers. The others won't. This is a big enough issue that, when the Macintosh went from Motorola 68xxx CPUs to PowerPC CPUs, most of the engineering applications were not converted. Getting a different answer from the old version was unacceptable.

    • by -brazil- (111867)

      Actually, I tried to make it very clear in several places that base 10 has just the same problems. I am open to any suggestions for improvement, though.

    • The article gives the impression that base 10 arithmetic is somehow "more accurate". It's not. You still get errors for, say, 1/3 + 1/3 + 1/3. It's just that the errors are different.

      What kind of errors are you referring to?

  • by sunderland56 (621843) on Sunday May 02, 2010 @12:20PM (#32064458)
    Look, times are tough for programmers already. Knowing how to do things correctly - like proper floating point math - is one of the ways to separate the true CS professional from the wannabe new graduates. Articles like this just make everyone smarter, and make finding a job that much harder.
    • Re: (Score:3, Insightful)

      by sohp (22984)

      Knowing how to do things correctly - like proper floating point math - is one of the ways to separate the true CS professional from the wannabe new graduates.

      True, except that HR people and hiring managers neither know nor care about doing things correctly, they just want cheap and fast. Just make sure you have all the right TLAs on your resume, you'll get a job. You can put "IEEE 754 expert" down though. They won't recognized the reference so maybe they'll be impressed by it.

    • Re: (Score:3, Funny)

      by jellomizer (103300)

      Except for the fact that companies don't care about floating point they are looking for 3+ years on windows 7. 20 years of Linux. and 15 years of .NET.

  • by dr2chase (653338) on Sunday May 02, 2010 @12:20PM (#32064466) Homepage
    Other issues that might be worth mentioning:
    • Catastrophic cancellation in complex arithmetic.
    • Single vs double rounding, in fused vs cascaded multiply-add operations.
    • Range reduction in trig functions (Intel hardware only uses a 68-bit PI, this causes problems sometimes).
    • Double-rounding when converting FP precision (e.g., 64-bit mantissa to 53, or 53 with extended exponent to 53 with regular exponent when the mantissa goes denorm).
    • Conversion to/from string representation.
    • Issues with "constructive reals" (a=b? is not necessarily an answerable question -- you might need to look at "all" the digits in order to answer "yes").
    • Distributive law DOES NOT HOLD -- a * (b+c) != a*b + a*c
  • This sounds a lot like the stuff i heard in CS/Mathematics classes more than 20 years ago (Hackbusch, Praktische Analysis, Summer term 1987/88). That course was mandatory then for any CS, Mathematics and Physics student. Has that changed yet? It's about the differences between a number at it's binary representation (and examples about consequences).

    I completely agree, that every programmer should know about that. But this is nothing new, it was already important 40 years ago. I'm pretty sure some space prob

  • Please look here (Score:5, Informative)

    by ctrl-alt-canc (977108) on Sunday May 02, 2010 @01:18PM (#32064916)
    People interested into floating point math will find some very interesting materials and horror stories in the documents collected at the home page [] of professor William Kahan, the man behind IEEE754 standard.
    According to my personal experience the paper by David Goldberg cited in the post isn't that difficult after all. Plenty of interesting materials can also be found in the Oppenheim & Shafer [] textbook about digital signal processing.
    • I was brought in a bit after the start of a state project to write a system to track about a half billion dollars in money for elderly and disabled indigent care. I was horrified to find that all the money variables were float. After raising the issue and explaining the technical details, I proposed a Money class and if they didn't want that gave them the option of a simple fix: just change all the floats to long and keep the amounts in pennies, inserting the decimal point only when displaying numbers. The

  • by Cliff Stoll (242915) on Sunday May 02, 2010 @01:19PM (#32064930) Homepage

    Over at Evans Hall at UC/Berkeley, stroll down the 8th floor hallway. On the wall, you'll find an envelope filled with flyers titled, "Why is Floating-Point Computation so Hard to Debug whe it Goes Wrong?"

    It's Prof. Kahan's challenge to the passerby - figure out what's wrong with a trivial program. His program is just 8 lines long, has no adds, subtracts, or divisions. There's no cancellation or giant intermediate results.

    But Kahan's malignant code computes the absolute value of a number incorrectly on almost every computer with less than 39 significant digits.

    Between seminars, I picked up a copy, and had a fascinating time working through his example. (Hint: Watch for radioactive roundoff errors near singularities!)

    Moral: When things go wrong with floating point computation, it's surprisingly difficult to figure out what happened. And assigning error-bars and roundoff estimates is really challenging!

    Try it yourself at: []

  • Most of you are too young to have dealt with architectures like VAX and Gould PowerNode with their awful floating point implementations.

    IEEE 754 has problems but it's a big improvement over what came before.

  • by John Hasler (414242) on Sunday May 02, 2010 @01:41PM (#32065138) Homepage

    It's missing the irritating cutesy "humor".

  • If anyone with ACM digital libray access wants the DOI link to the original article, rather than the edited version Sun/Oracle's site it's [].

    It is an old article though, so it's a 44 page scanned PDF.

  • Thanks to Sun (Score:5, Interesting)

    by khb (266593) on Sunday May 02, 2010 @04:04PM (#32066118)

    Note that the cited paper location is; this version of the article has corrections and improvements from the original ACM paper. Sun has provided this to interested parties for 20odd years (I have no idea what they paid ACM for rights to distribute). [] is the Sun provided freely distributable libm that follows (in a roundabout way) from the paper.

    I don't recall if K.C. Ng's terrific "infinite pi" code is included (it was in Sun's libm) which takes care of intel hw by doing the range reduction with enough bits for the particular argument to be nearly equivalent to infinite arithmetic.

    Sun's floating point group did much to advance the state of the art in deployed and deployable computer arithmetic.

    Kudos to the group (one hopes that Oracle will treat them with the respect they deserve)

"Look! There! Evil!.. pure and simple, total evil from the Eighth Dimension!" -- Buckaroo Banzai