Floating point routines from <cmath> and various utilities.
IEEE floating point numbers do some trickery to get an extra bit of precision in the significand (aka mantissa). Every floating point number is represented as
f = ± sig × 2exp,
where the significand satisfies 1 ≤ sig < 2.
The function frexp(f)
returns the sig/2 and exp+1 of f.
The function ldexp(s,e)
converts the significant and exponent
back to the original floating point number.
For IEEE 64 bit floats, the sign is 1 bit, he exponent is 11 bits, and the mantissa is 53 bits. Note 1 + 11 + 53 = 65, which is 1 greater than 64. The sign bit is 0 for positive and 1 for negative. To convert the exponent bits, take the 11 bit base 2 number and subtract the bias = 1023. To convert the mantissa, put a base 2 decimal point in front and tack a 1 on the front of the 52 mantissa bits.
Unlike the mathematical real numbers we can have 1 + x =
1 with x≠0. The smallest such positive
floating point number is called machine epsilon. For 64-bit
floating point numbers it is approximately 2.22e-16 and is denoted
std::numeric_limits<double>::epsilon()
in C++, or DBL_EPSILON
in C.
This may seem annoying at first, but it is quite useful when it comes to summing series. E.g., how many terms of Σn≥0 xn/n! should we use when computing exp(x)? Stop when the terms are less than machine epsilon.
Certain bit patterns have special meaning.
If all the bits are 0 then corresponding floating point is 0. If the first bit is 1 then the floating point number is -0 which is not equal to 0.
There are many numbers less than machine epsilon that are not 0, for example 2-52 to 2-1022. There are even smaller non zero floating point numbers. If all exponent bits are 0 then the significand no longer has the implicit 1 prefixed. and the exponent is -1022.
If all the exponent bits are 1 then the number is a NaN:
not a number. There are 2 × 252 NaNs. Any arithmetical
computation with a NaN results in a NaN. It is also the case
no two NaN are equal. In fact x != x
is a way to test if
x
is a NaN.
If all the exponent bits are 1 and all the significand bits are 0 then the number is infinity. There is also a negative infinity.
One handy feature of the IEEE representation is the the order of the underlying bits, interpreted as a 64 bit integer, is the same as the ordering of the corrresponding floats. Units in the last place is the number of such integer values between two floats (plus 1).
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/