Floating Point Numbers

Floating point routines from <cmath> and various utilities.

Introduction

IEEE floating point numbers do some trickery to get an extra bit of precision in the significand (aka mantissa). Every floating point number is represented as

f = ± sig × 2^exp,

where the significand satisfies 1 ≤ sig < 2. The function frexp(f) returns the sig/2 and exp+1 of f. The function ldexp(s,e) converts the significant and exponent back to the original floating point number.

For IEEE 64 bit floats, the sign is 1 bit, he exponent is 11 bits, and the mantissa is 53 bits. Note 1 + 11 + 53 = 65, which is 1 greater than 64. The sign bit is 0 for positive and 1 for negative. To convert the exponent bits, take the 11 bit base 2 number and subtract the bias = 1023. To convert the mantissa, put a base 2 decimal point in front and tack a 1 on the front of the 52 mantissa bits.

Machine Epsilon

Unlike the mathematical real numbers we can have 1 + x = 1 with x≠0. The smallest such positive floating point number is called machine epsilon. For 64-bit floating point numbers it is approximately 2.22e-16 and is denoted std::numeric_limits<double>::epsilon() in C++, or DBL_EPSILON in C.

This may seem annoying at first, but it is quite useful when it comes to summing series. E.g., how many terms of Σ_n≥0 xⁿ/n! should we use when computing exp(x)? Stop when the terms are less than machine epsilon.

Special Numbers

Certain bit patterns have special meaning.

Positive and Negative Zero

If all the bits are 0 then corresponding floating point is 0. If the first bit is 1 then the floating point number is -0 which is not equal to 0.

Denormal Numbers

There are many numbers less than machine epsilon that are not 0, for example 2^-52 to 2^-1022. There are even smaller non zero floating point numbers. If all exponent bits are 0 then the significand no longer has the implicit 1 prefixed. and the exponent is -1022.

Not A Number

If all the exponent bits are 1 then the number is a NaN: not a number. There are 2 × 2⁵² NaNs. Any arithmetical computation with a NaN results in a NaN. It is also the case no two NaN are equal. In fact x != x is a way to test if x is a NaN.

Infinity

If all the exponent bits are 1 and all the significand bits are 0 then the number is infinity. There is also a negative infinity.

Units in the Last Place

One handy feature of the IEEE representation is the the order of the underlying bits, interpreted as a 64 bit integer, is the same as the ordering of the corrresponding floats. Units in the last place is the number of such integer values between two floats (plus 1).

References

https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
xll12 @ fe2bb99		xll12 @ fe2bb99
.gitignore		.gitignore
.gitmodules		.gitmodules
Book1.xlsx		Book1.xlsx
Header.h		Header.h
README.md		README.md
Source.cpp		Source.cpp
array.cpp		array.cpp
chgsign.cpp		chgsign.cpp
copysign.cpp		copysign.cpp
ct2966.cpp		ct2966.cpp
finite.cpp		finite.cpp
float.xls		float.xls
float_bits.cpp		float_bits.cpp
fmsroot1d.h		fmsroot1d.h
fpclass.cpp		fpclass.cpp
frexp.cpp		frexp.cpp
functional.cpp		functional.cpp
hw2.xml		hw2.xml
hw3.cpp		hw3.cpp
hw3.md		hw3.md
hw3.xlsx		hw3.xlsx
isnan.cpp		isnan.cpp
ldexp.cpp		ldexp.cpp
limits.cpp		limits.cpp
logb.cpp		logb.cpp
nextafter.cpp		nextafter.cpp
polynomial.cpp		polynomial.cpp
popcount.cpp		popcount.cpp
popcount.h		popcount.h
reverse.cpp		reverse.cpp
sqrt.xlsx		sqrt.xlsx
ulp.cpp		ulp.cpp
ulp.h		ulp.h
xllacosh.cpp		xllacosh.cpp
xllasinh.cpp		xllasinh.cpp
xllatan2.cpp		xllatan2.cpp
xllcbrt.cpp		xllcbrt.cpp
xllcos.cpp		xllcos.cpp
xllcosh.cpp		xllcosh.cpp
xllerf.cpp		xllerf.cpp
xllerfc.cpp		xllerfc.cpp
xllexp2.cpp		xllexp2.cpp
xllexpm1.cpp		xllexpm1.cpp
xllfabs.cpp		xllfabs.cpp
xllfloat.cpp		xllfloat.cpp
xllfloat.h		xllfloat.h
xllfloat.sln		xllfloat.sln
xllfloat.vcxproj		xllfloat.vcxproj
xllfloat.vcxproj.filters		xllfloat.vcxproj.filters
xllfloat.vcxproj.user		xllfloat.vcxproj.user
xllfma.cpp		xllfma.cpp
xllfmax.cpp		xllfmax.cpp
xllfmin.cpp		xllfmin.cpp
xllfmod.cpp		xllfmod.cpp
xlllogb.cpp		xlllogb.cpp
xllmodf.cpp		xllmodf.cpp
xllnearbyint.cpp		xllnearbyint.cpp
xllpow.cpp		xllpow.cpp
xllremquo.cpp		xllremquo.cpp
xllround.cpp		xllround.cpp
xllsignbit.cpp		xllsignbit.cpp
xllsin.cpp		xllsin.cpp
xllsinh.cpp		xllsinh.cpp
xlltanh.cpp		xlltanh.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Floating Point Numbers

Introduction

Machine Epsilon

Special Numbers

Positive and Negative Zero

Denormal Numbers

Not A Number

Infinity

Units in the Last Place

References

About

Releases

Packages

Contributors 9

Languages

keithalewis/xllfloat

Folders and files

Latest commit

History

Repository files navigation

Floating Point Numbers

Introduction

Machine Epsilon

Special Numbers

Positive and Negative Zero

Denormal Numbers

Not A Number

Infinity

Units in the Last Place

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages