Spot the defect: rounding, part two

Posted on May 16, 2013 by ericlippert

Last time I challenged you to find a value which does not round correctly using the algorithm

Math.Floor(value + 0.5)

The value which does not round correctly is the double 0.49999999999999994, which is the largest double that is smaller than 0.5. With the given algorithm this rounds up to 1.0, even though clearly 0.49999999999999994 is less than one half, and therefore should round down.

What the heck is going on here?

As you probably know, doubles are represented internally as binary fractions. The original double in binary[1. Obviously the mantissa is in binary and the exponent is in decimal.] is

1.1111111111111111111111111111111111111111111111111111 x 2^-2

0.011111111111111111111111111111111111111111111111111111

which in decimal is

0.499999999999999944488848768742172978818416595458984375

As you can see it is extremely close to the stated value. Again, just so we are clear: when in source code you say 0.49999999999999994, that decimal fraction is rounded to the nearest binary fraction that has no more than 52 bits after the decimal place.

Already we have one rounding, and as you can see, we rounded up — but we’re still less than half.

Now suppose we took this value and added exactly 0.5 to it. We’d get

0.999999999999999944488848768742172978818416595458984375

which is still smaller than one. Therefore we’d expect this to go to zero when passed to Math.Floor, right?

Wrong; that fraction isn’t a legal double because it would require 53 bits of precision to represent exactly. A double only has 52 bits of precision, and therefore this must be rounded off when it becomes a double.

But rounded to what? Clearly it is extremely close to 1.0. What is the largest double value that is smaller than 1.0? In binary that would be

1.1111111111111111111111111111111111111111111111111111 x 2^-1

0.11111111111111111111111111111111111111111111111111111

which is

0.99999999999999988897769753748434595763683319091796875

in decimal.

So we now have three relevant values:

0.999999999999999888977697537484345957636833190917968750 (largest double smaller than 1.0)
0.999999999999999944488848768742172978818416595458984375 (exact)
1.000000000000000000000000000000000000000000000000000000 (1.0)

Which one should we round the middle one to? Well, which is closer?

It’s OK, I’ll wait while you do the math by hand.

Neither is closer. The value is exactly in the middle of the two possibilities. The difference is

0.000000000000000055511151231257827021181583404541015625

in both directions.

Of course you could have worked that out a lot more easily in binary than in decimal! The three numbers are in binary:

0.111111111111111111111111111111111111111111111111111110
0.111111111111111111111111111111111111111111111111111111
1.000000000000000000000000000000000000000000000000000000

From which you can read off that difference between both pairs is 2^-54.

When faced with this situation the floating point chip has got to make a decision and it decides to once again round up; it has no reason to believe that you really want it to round down. So the sum rounds to the double 1.0, which is then “floored” to 1.0.

Once again we see that floating point math does not behave at all like pen-and-paper math. Just because an algorithm would always work in exact arithmetic does not mean that it always works in inexact floating point. Multiple roundings of extremely small values can add up to a large difference.

I’ve posted the code I used to make all these calculations below; long-time readers will recognize it from my previous article on how to decompose a double into its constituent parts.

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SolverFoundation.Common;

class Program
{
    static void Main()
    {
        MyDouble closeToHalf_Double = 0.49999999999999994;
        Rational closeToHalf_Rational = DoubleToRational(closeToHalf_Double);

        Console.WriteLine("0.49999999999999994");
        Console.WriteLine("Actual value of double in binary:");
        Console.WriteLine("1.{0} x 2 to the {1}", 
            closeToHalf_Double.MantissaBits.Join(), 
            closeToHalf_Double.Exponent - 1023);
        Console.WriteLine("Actual value of double as fraction:");
        Console.WriteLine(closeToHalf_Rational.ToString());
        Console.WriteLine("Actual value of double in decimal:");
        Console.WriteLine(closeToHalf_Rational.ToDecimalString());
        Console.WriteLine();

        MyDouble closeToOne_Double = 0.99999999999999994;
        Rational closeToOne_Rational = DoubleToRational(closeToOne_Double);

        Console.WriteLine("0.99999999999999994");
        Console.WriteLine("Actual value of double in binary:");
        Console.WriteLine("1.{0} x 2 to the {1}",
            closeToOne_Double.MantissaBits.Join(),
            closeToOne_Double.Exponent - 1023);
        Console.WriteLine("Actual value of double as fraction:");
        Console.WriteLine(closeToOne_Rational.ToString());
        Console.WriteLine("Actual value of double in decimal:");
        Console.WriteLine(closeToOne_Rational.ToDecimalString());
        Console.WriteLine();

        // Now let's do the arithmetic in "infinite precision":

        Rational sum = closeToHalf_Rational + (Rational)0.5;

        Console.WriteLine("Sum in exact arithmetic as fraction:");
        Console.WriteLine(closeToOne_Rational.ToString());
        Console.WriteLine("Sum in exact arithmetic in decimal:");
        Console.WriteLine(closeToOne_Rational.ToDecimalString());

        // But that has more precision than will fit into a double, 
        // so we have to round off. We have two possible values:

        Rational d1 = Rational.One - sum;
        Rational d2 = sum - closeToOne_Rational;

        Console.WriteLine("Difference from one:");
        Console.WriteLine(d1.ToDecimalString());
        Console.WriteLine("Difference from closeToOne:");
        Console.WriteLine(d2.ToDecimalString());
    }

    static Rational DoubleToRational(MyDouble d)
    {
        bool subnormal = d.Exponent == 0;
        var two = (Rational)2;
        var fraction = subnormal ? Rational.Zero : Rational.One;
        var adjust = subnormal ? 1 : 0;
        for (int bit = 51; bit >= 0; --bit)
            fraction += d.Mantissa.Bit(bit) * two.Exp(bit - 52 + adjust);
        fraction = fraction * two.Exp(d.Exponent - 1023);
        if (d.Sign == 1)
            fraction = -fraction;
        return fraction;
    }
}

struct MyDouble
{
    private ulong bits;
    public MyDouble(double d)
    {
        this.bits = (ulong)BitConverter.DoubleToInt64Bits(d);
    }

    public int Sign
    {
        get
        {
            return this.bits.Bit(63);
        }
    }

    public int Exponent
    {
        get
        {
            return (int)this.bits.Bits(62, 52);
        }
    }

    public IEnumerable<int> ExponentBits
    {
        get
        {
            return this.bits.BitSeq(62, 52);
        }
    }

    public ulong Mantissa
    {
        get
        {
            return this.bits.Bits(51, 0);
        }
    }

    public IEnumerable MantissaBits
    {
        get
        {
            return this.bits.BitSeq(51, 0);
        }
    }

    public static implicit operator MyDouble(double d)
    {
        return new MyDouble(d);
    }
}

static class Extensions
{
    public static int Bit(this ulong x, int bit)
    {
        return (int)((x >> bit) & 0x01);
    }

    public static ulong Bits(this ulong x, int high, int low)
    {
        x <<= (63 - high);
        x >>= (low + 63 - high);
        return x;
    }

    public static IEnumerable<int> BitSeq(this ulong x, int high, int low)
    {
        for (int bit = high; bit >= low; --bit)
            yield return x.Bit(bit);
    }

    public static Rational Exp(this Rational x, int y)
    {
        Rational result;
        Rational.Power(x, y, out result);
        return result;
    }

    public static string ToDecimalString(this Rational x)
    {
        var sb = new StringBuilder();
        x.AppendDecimalString(sb, 50000);
        return sb.ToString();
    }

    public static string Join<T>(this IEnumerable seq)
    {
        return string.Concat(seq);
    }
}

27 thoughts on “Spot the defect: rounding, part two”

Jeffrey Sax on May 16, 2013 at 8:53 am said:

A tiny correction to your first bold statement: I think you should count the implicit leading 1 bit in the precision. Your binary representation of the number uses 54 1’s, so it makes more sense to say it needs 54 bits of precision. Your fixed point representation of the original number also uses 54 bits after the decimal point, not 52, but it does use 53 1’s. Again, saying it has 53 bits of precision makes more sense..

Also, it might be useful to explain that the decision to round up is not arbitrary but the result of long discussions in a committee. The poor chip has no say in it. The rounding is done so that the last bit of the rounded number is 0. So positive …11 is rounded up, while positive …01 is rounded down. This is also called “round to even.”

Reply ↓
- Douglas McClean on May 16, 2013 at 11:28 am said:
  
  Actually to be even more picky IEEE-754 defines 4 rounding modes. The one you mention is the default, though, and AFAIK is the only one used for .Net System.Double and System.Float.
  
  https://en.wikipedia.org/wiki/IEEE_floating_point#Rounding_rules
  
  See here for how to change modes with the Microsoft C runtime library. http://msdn.microsoft.com/en-gb/library/c9676k6h.aspx
  
  Reply ↓
  - Stilgar on May 17, 2013 at 4:44 am said:
    
    Since this subthread is all about being picky…
    It is not System.Float but System.Single
    Sorry couldn’t stop myself from posting:)
    
    Reply ↓
    - Douglas McClean on May 17, 2013 at 5:06 pm said:
      
      “To save one from a mistake is a gift of paradise”, after all. Well spotted. 🙂
      
      Reply ↓
Harald van Dijk on May 16, 2013 at 10:22 am said:

So now you have me wondering: how should you make sure to round halves up, correctly, even for the unreasonable numbers, if I can call them that? There is no Math.Round(val, MidpointRounding.Up). Math.Floor(value + nextafter(0.5, 0)) is a valid trick for positive numbers, but gives the wrong result for -0.5. Conditionally adding either 0.5 or nextafter(0.5, 0), depending on the sign of value, works, but still, it does not necessarily preserve the sign of negative numbers (they may round to +0.0 instead of -0.0). Taking a different approach, Math.Round(value < 0 ? nextafter(value, 0) : value, MidpointRounding.AwayFromZero) comes to mind, but that would give the wrong result when nextafter(value, 0) differs from value by more than one. The only thing I can come up with that I haven’t yet been able to rule out is special casing based on value – Math.Truncate(value): if it’s exactly 0.5, return value + 0.5. If it’s exactly -0.5, return -(-value – 0.5). Otherwise, just return Math.Round(value);. But I’m not sure I am not overlooking anything, and I’m convinced there must be better ways.

Reply ↓
- hacksoncode on May 16, 2013 at 1:52 pm said:
  
  It turns out that, for exactly the same reasons adding 0.5 *doesn’t* work, adding the number he references that is the largest value smaller than 0.5 *does* work. Think about all of the boundary conditions and you’ll see that this is the case (0.5 works, see above, 0.4999…etc works because it rounds correctly.
  
  Interesting hack.
  
  Reply ↓
  - Harald van Dijk on May 16, 2013 at 2:49 pm said:
    
    That does not work, as I already noted in my comment: “Math.Floor(value + nextafter(0.5, 0)) is a valid trick for positive numbers, but gives the wrong result for -0.5.” Note that nextafter(0.5, 0) is just another way of writing 0.49999999999999994, except without relying on any specific precision.
    
    Reply ↓
- Harald van Dijk on May 18, 2013 at 5:41 am said:
  Continuing with this, it appears to be simpler to just do it yourself, and not even use the default Math.Round at all:
```
static double Round(double value) {
  double intpart = Math.Truncate(value);
  double fracpart = value - intpart;
  if (fracpart < -0.5)
    return intpart - 1;
  else if (fracpart >= 0.5)
    return intpart + 1;
  else
    return intpart;
}
```
  Reply ↓
flug on May 16, 2013 at 11:11 am said:

FWIW the first ‘strange’ bug I remember encountering was a report that on an accounting system created for Apple II computers, dollar values of $8.12 were recorded as $8.11

The system programmers were stumped as to the cause and the program gave correct values for every other dollar & cents value but this one.

The problem turned out to be they used a routine to store the dollar/cents amounts like this:

int (100*$amount)

In Applesoft, 100*8.12=811.9999998 or thereabouts.

But to make matters worse, Applesoft rounded the answer for display purposes, so you get contradictory-seeming results like these:

print (8.12*100)
812

print int(8.12*100)
811

print (812-(8.12*100))
2.68220901E-07

To make matters worse, reversing the order of the multiplicands changes the answer:

print (812-(100*8.12))
0

Moral of the story: Expect the results of floating point operations to deviate from the expected in the last significant digit (or more, especially if you’re chaining operations) and to not necessarily display the discrepancy.

FYI back in the day I tested this up to many millions or billions of dollars (I can’t remember exactly but I let a loop run for pretty much all day on it) and 8.12 was the only two-digit decimal to have this problem in Applesoft.

You can still see the bug at work today on Apple II emulators like http://www.virtualapple.org/a2-fs1disk.html

Reply ↓
flug on May 16, 2013 at 11:14 am said:

> In Applesoft, 100*8.12=811.9999998 or thereabouts.

Oops, actually 100*8.12=812 whereas 8.12*100=811.9999998 (approx.). The order of the multiplicands is important.

Reply ↓
rdm on May 16, 2013 at 11:14 am said:

I think, when dealing with issues like this, we need to make a careful distinction between accuracy and precision.

Using the IEEE 754 double-precision binary floating-point format, 2^53 = 1+2^53, and this also holds for 2^54, 2^55, …

So if you need that 1+2^53 result, you should not be using numbers in that IEEE-754 format. The underlying reasons for this also hold for fixed width integers. Some results will not fit. About half the results from + and almost all the results from * have this characteristic.

As a rough rule of thumb, if the accuracy you need uses the square root of the number of bits of precision you can mostly ignore fixed width format issues. That’s not completely true (floating point subtraction of positive decimal values can be a good counter example), but it’s a plausible starting point.

Note also that we teach kids how to deal with fixed width numeric issues in grade school. We teach them how to carry when adding, borrow when subtracting, and so on. The digits we use in our decimal numbers are “fixed width” and it’s simply not the case that the sum of any two single digit numbers will fit in a single digit result.

Computer hardware often includes support for “carry”, and programming languages routinely do not expose this to the programmer. But we can work around this by splitting our numeric formats in half and use the upper half to represent “carry” (and using a sequence of “numbers” to represent our numeric values).

Reply ↓
Ralph Becket on May 16, 2013 at 4:50 pm said:

There’s a web site of quotations from the late, great Ken Iverson, http://keiapl.org/anec/, and this one seems particularly apposite:

‘In an early talk Ken was explaining the advantages of tolerant comparison. A member of the audience asked incredulously, “Surely you don’t mean that when A=B and B=C, A may not equal C?” Without skipping a beat, Ken replied, “Any carpenter knows that!” and went on to the next question.’

Reply ↓
- Joker_vD on May 17, 2013 at 1:40 am said:
  
  The problem with real values (oh, the pun) is that you, strictly speaking, never can be sure that A=B and B=C. What you know is that A≈B and B≈C, but that doesn’t imply that A≈C! “If you throw a frog into a boiling water, it will jump out, but if you throw it into a cold water and then slowly heat it, the frog won’t notice and will get boiled”.
  
  But of course, no one likes to carry all this A=B±ε, B=C±ε notation, because it’s tedious and not funny at all — though you clearly see that A=C±2ε.
  
  Reply ↓
  - Ralph Becket on May 18, 2013 at 5:33 pm said:
    
    I agree, I was always taught that comparing floats for equality should be considered a bug.
    
    That said, I worked for a long time on a constraint programming language project and I couldn’t convince anyone else on the project that including zero tolerance comparison of reals in the language was a mistake.
    
    Reply ↓
    - Brooks Moses on May 19, 2013 at 6:27 pm said:
      
      It is _often_ a bug. It is not _always_ a bug.
      
      An example of when it isn’t is a function that operates on an array to produce another array (such as a Fourier transform or matrix product), and takes an optional floating-point “scale” input that it will multiply each element of the output by. It is entirely reasonable — and common — to compare that “scale” input to 1.0 exactly. If it is exactly 1.0, which it often is because the user often supplies a literal 1.0 input, then we can skip the multiplication. But if it is anything other than _exactly_ 1.0, we need to do the multiplication.
      
      Another example is that we may have a function that uses different computation paths depending on whether the input array is aligned to an exact multiple of a SIMD vector width, and we want to test to make sure that these computation paths produce precisely the same answer.
      
      Reply ↓
    - John Payson on May 20, 2013 at 9:01 am said:
      
      Having an equivalence test for floating-point values is useful, in that if X is equivalent to Y, then for any pure function F(), one may substitute F(X) for F(Y) when convenient (e.g. if one has computed one, one need not repeat the computation for the other). Note that equivalence does not imply that the real numbers the floats are “supposed” to represent are equal, nor does non-equivalence mean the real numbers are not equal. Equivalence, however, is a useful concept in its own right even though languages and frameworks may not expose it very well (e.g. floating-point types in .NET override `Equals(Object)` in such fashion as to report some non-equivalent values as equal).
      
      As for why languages have a “numerically-equal” operator, that may tie in with the fact that floating-point types are often used to store whole-number quantities which are large, but which within the precisely-representable range. Historically they were often by far the best means of storing such values; today there are still contexts where they remain the best (e.g. when mixing real and whole-number quantities, it’s often better to use floating-point types for everything than to perform lots of integer-to-floating-point conversions).
      
      Reply ↓
Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1358
pavan on May 17, 2013 at 1:48 am said:

Wow what an information.. This article has good information. Thanks for the post..!

Reply ↓
Greg Sohl on May 17, 2013 at 7:18 pm said:

While it’s an interesting defect, it’s important to note that there are multiple precision issues with floating point types and associated math. If one needs precision without rounding error, use the Decimal type.

Reply ↓
- Eric Lippert on May 17, 2013 at 9:23 pm said:
  
  So you’re telling me that in decimal, one divided by three and then multiplied by three is one, because there are no rounding errors with decimal?
  
  There are plenty of rounding errors with decimal; they’re just different rounding errors.
  
  Reply ↓
  - John Payson on May 20, 2013 at 9:24 am said:
    
    Hear hear. No numerical representation can express all possible numerical quantities. Even using only the four basic arithmetic operators (or, for that matter, using any one of them), it will be possible to express a sequence of computations whose numerical result will not be representable, thus requiring the computation to return an imprecise result, return an error indication, throw an exception, run out of memory, or hang. For financial calculations it may be helpful to have a fixed-precision type which would always either yield exact results or report a failure, but Decimal is not such a type, since any operation on Decimal values may produce rounded results without any warning.
    
    Reply ↓
- Ralph Becket on May 18, 2013 at 5:35 pm said:
  
  Any finite precision representation is going to have these problems.
  
  Reply ↓
M Welinder on December 18, 2013 at 11:33 am said:

There another class of numbers for which +0.5 will
mess things up:

2^52+1, 2^52+3, …, 2^53-1

and the negatives of these.

Reply ↓
- Eric Lippert on December 18, 2013 at 12:56 pm said:
  
  Nice catch!
  
  Reply ↓
Gabe Prieto on March 4, 2014 at 9:17 am said:

Wall cavities especially shouldn’t be sealed back up until it is
completely dry inside. Mold problems, however, when left unattended for too long a time, can cause a severely weakened home architecture and even health problems
such as allergies and poisoning. If you hire a company that handles both processes from the start, they are likely to be able to discover that mold is going to be a
problem and take care of it before it becomes a health issue.

Reply ↓
Lol Elo Boost on April 16, 2014 at 7:49 pm said:

Bullying is a problem in the online gaming world. The info above should make sure you have a great gaming experience.
This means that you have to understand more about making the avocation as
pleasurable as it can be.

Feel free to visit my web site :: Lol Elo Boost

Reply ↓
Mr Man on August 13, 2023 at 9:32 am said:

ded

Reply ↓