I’ve talked a lot about floating point math over the years in this blog, but a quick refresher is in order for this episode.
A double
represents a number of the form +/- (1 + F / 252 ) x 2E-1023
, where F
is a 52 bit unsigned integer and E
is an 11 bit unsigned integer; that makes 63 bits and the remaining bit is the sign, zero for positive, one for negative. You’ll note that there is no way to represent zero in this format, so by convention if F
and E
are both zero, the value is zero. (And similarly there are other reserved bit patterns for infinities, NaN and denormalized floats which we will not get into today.)
A decimal
represents a number in the form +/- V / 10X
where V
is a 96 bit unsigned integer and X
is an integer between 0 and 28.
Both are of course “floating point” because the number of bits of precision in each case is fixed, but the position of the decimal point can effectively vary as the exponent changes.
Continue reading →