Floating Point Arithmetic, Part One

A month ago I was discussing some of the issues in integer arithmetic, and I said that issues in floating point arithmetic were a good subject for another day. Over the weekend I got some questions from a reader about floating point arithmetic, so this seems like as good a time as any to address them.

Before I talk about some of the things that can go terribly wrong with floating point arithmetic, it’s helpful (and character building) to understand how exactly a floating point number is represented internally.

To distinguish between decimal and binary numbers, I’m going to do all binary numbers in fixed-width.

Here’s how floating point numbers work. A float is 64 bits. Of that, one bit represents the sign: 0 is positive, 1 is negative.

Eleven bits represent the exponent. To determine the exponent value, treat the exponent field as an eleven-bit unsigned integer, then subtract 1023. However, note that the exponent fields 00000000000 and 11111111111 have special meaning, which we’ll come to later.

The remaining 52 bits represent the mantissa.

To compute the value of a float, here’s what you do.

  • Write out all 52 bits of the mantissa.
  • Stick a 1. onto the left hand side.
  • Compute that value as a 53 bit fraction with 52 fractional places.
  • Multiply that by two to the power of the given exponent value.
  • Sign it appropriately.

So for example, the number -5.5 is represented like this: (sign, exponent, mantissa)

(1, 10000000001, 0110000000000000000000000000000000000000000000000000)

The sign is 1, so it is a negative number.

The exponent is 1025 – 1023 = 2.

Put a 1. on the top of the mantissa and you get


which in decimal is 1.375 and sure enough, -1.375 x 22 = -5.5

This system is nice because it means that every number in the range of a float has a unique representation, and therefore doesn’t waste bits on duplicates.

However, you might be wondering how zero is represented, since every bit pattern has 1. plunked onto the left side. That’s where the special values for the exponent come in.

If the exponent is 00000000000, then the float is considered a “denormal”. It gets 0. plunked onto the left side, not 1., and the exponent is assumed to be -1022.

This has the nice property that if all bits in the float are zero, it is representing zero.

Note that this lets you represent smaller numbers than you would be able to otherwise, as we’ll see, though you pay the price of lower precision. Essentially, denormals exist so that the chip can do “graceful underflow” — represent tiny values without having to go straight to zero.

If the exponent is 11111111111 and the mantissa is all zeros, that’s Infinity.

If the exponent is 11111111111 and the mantissa is not all zeros, that’s considered to be Not A Number — this is a bit pattern reserved for errors.

So the biggest positive normalized float is

(0, 11111111110, 1111111111111111111111111111111111111111111111111111)

which is

1.1111111111111111111111111111111111111111111111111111 x 21023.

The smallest positive normalized float is

(0, 00000000001, 0000000000000000000000000000000000000000000000000000)

which is

1.000 x 2-1022

The biggest positive denormalized float is:

(0, 00000000000, 1111111111111111111111111111111111111111111111111111)

which is

0.1111111111111111111111111111111111111111111111111111 x 2-1022

The smallest positive denormalized float is

(0, 00000000000, 0000000000000000000000000000000000000000000000000001)

which is

0.0000000000000000000000000000000000000000000000000001 x 2-1022 = 2-1074


Next time on FAIC: floating point math is nothing like real number math.

2 thoughts on “Floating Point Arithmetic, Part One

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s