Double-precision Floating-point
This article looks at floating-point numbers. In particular, we concentrate on double-precision
floating-point.
Review of Integers
The natural integers 1, 2, 3, … are called natural because they are used to count
objects; for example, there were seven stars in the consternation Ursa Major (Big
Dipper, or Plough) long before humans came along to count them.
Computers store information in binary form by using two-state devices (typically,
on or off) to represent information. We employ the natural binary-weighted positional
system to represent data so that the binary string 011010 is equivalent to 0 x 25
+ 1 x 24 + 1 x 23 + 0 x 22 + 1 x 21 + 0 x 20 = 0 + 16 + 8 + 0 + 2 + 0 = 26.
Computers with 8-bit data words can represent unsigned integers in the range 0 to
255. Sixteen bit integers can be represented in the range 0 to 65,535, and 32-bit
integers can be represented in the range 0 to 4,294,967,295.
We can extend the range of integers into the negative region. Since we can’t increase
the total number of integers represented by m bits (i.e., 2m unique values), we typically
use two’s complement notation to represent the signed range -2m-1 to 2m-1- 1 which
is 2m-1 negative values, zero, 2m-1-1 positive values giving 2m values in total).
Unfortunately, integers can’t be used to represent fractional values like ½ or
or √2. Such numbers are called real numbers and these have a fractional part (i.e.,
a part less than 1). Moreover, these numbers may be incapable of representation with
a finite number of digits using positional notation (for example, you can’t represent
⅓ exactly as a decimal fraction – you can write 0.333 or 0.333333333333333333, but
you can never exactly express ⅓). This inability to represent all real numbers exactly
in decimal or binary form has implications for the accuracy of calculations. In particular,
it has implications for the accuracy of chained calculation involving sequences of
arithmetic operations. Ultimately, this property of numbers can mean that the results
of some calculations are effectively meaningless. The computer scientist has to appreciate
this fact.
As well as representing fractional or real values, we have to represent a range of
values from the very large to the very small. Consider big numbers. The population
of the Earth is about 7,200,000,000 which is a moderately large number. If you want
a truly big number, then we can use the mass of the sun which is about 198,892,000,000,000,000,000,000,000,000
kg. In conventional arithmetic, we represent such large numbers by means of scientific
notation where we separate out precision and magnitude by writing numbers of the
form m x 10e, where m is a mantissa and e an exponent. In this case, the sun’s weight
can be written 1.98892 x 1030 kg. This notation requires fewer digits than integer
notation, but also provides fewer significant digits; for example, the weight is
represented by 1.98892 with provides a precision of one in 106.
As well as small numbers, we have to deal with tiny numbers; for example, the chance
of a US citizen being hit by lightning in any given year is about 1 in 500,000 or
0.000,005 or 5.0 x 10-6. The mass of an electron is 9.109,382,914 x 10-31 kg. This
notation uses a negative exponent to indicate tiny values.
So, we use scientific notation to represent large and small numbers by writing a
number in the form m x 10e. Typically, the mantissa is expressed in the range 1.000…000
to 9.999…999 when using scientific notation.
Floating-point
It was natural that our scientific notation would be converted into an appropriate
binary form for use by computers. This form is called binary floating-point. The
term indicates that the binary point is not at a fixed place in the number; its position
is variable. Although dealing with binary integers is easy because you have just
two choices: the length of the integer in bits and whether to used signed or unsigned
representation, dealing with floating-point is not. Most definitely not.
If you choose a floating-point representation of real numbers, you have at least
four factors to consider. The first is the way in which you represent positive and
negative numbers. The second choice is the number of bits dedicated to the mantissa
(this determines the precision of the number). The third choice is the way in which
the sign of the exponent is represented (e.g., 2+8 or 2-4). The fourth choice is
the number of bits for the exponent. If the exponent too short (too few bits), we
can’t handle very large and small numbers. If the exponent is needlessly large, we
waste bits that could be better used by the mantissa.
The actual situation (i.e., trade-offs in the design of floating-point numbers) is
even worse because there are additional considerations (such as guard bits and sticky
bits) that we can introduce later.
For a long time each computer manufacturer implemented their own floating-point standard;
each manufacturer choosing a different range/precision/accuracy trade-off. This situation
did not support portability in the industry and made academic research difficult,
because comparing results performed on different computers was not easy. However,
here we are talking about calculations at the cutting edge in astronomy or the numerical
simulation of airflow over an aircraft wing. The various floating-point standards
were better behaved when performing more mundane tasks (i.e., no long chains of calculations
and not relying on all the avaialbe precision).
A great step forward took place in the 1980s when the IEEE produced a standard for
the floating-point representation of real numbers. The full name of the standard
is IEEE 754-1985 Standard for Binary Floating-point Arithmetic. This standard comprised,
in fact, a family of standards for floating-point numbers represented as 32 bit,
64 bits, and 128 bits. In this article we will look at the standards for 64-bit floating-point
numbers (also called double precision).
IEEE 754 defines:
- Storage formats
- Precise specifications of how results are generated
- The handling of special (non-floating-point values)
- The handling of exceptions and illegal operations
Double Precision IEEE 754 Format
The IEEE double-precision floating-point format uses a 64-bit word and is called
double in C and REAL*8 in FORTRAN.
Floating-point numbers are stored in normalized form to make greatest use of the
available precision; that is, the mantissa is always in the range 1.000…00 to 1.111…11.
To illustrate this notion consider a 5-bit mantissa with the normalized value 1.1101.
If this number were deformalized to 0.0111, we would have lost two bits of precision.
A normalized number always has a leading 1. If it always has a leading 1, we know
it’s there and do not have to store it. Consequently, when dealing with IEEE single-
and double-precision floating point numbers, we do not store the leading 1 in front
of the mantissa. This technique saves a single bit of storage. The actual number
stored in the mantissa field of an IEEE 754 floating-point numberis the fractional
mantissa field F. When the floating-point number is read from memory (unpacked),
the leading 1 is re-inserted. In other words, the normalized mantissa is 1.F but
we store only the F.
The format of a double-precision value is:
The sign-bit, S, is the most-significant bit, bit 63, and is 1 to represent a negative
number and 0 to represent a positive number. This mechanism is, of course, the same
as sign and magnitude notation.
The exponent occupies 11 bits from bit 52 to 62. In principle, this gives an exponent
range of 0 to 211 - 1 = 0 to 2047. In practice, this is not the actual exponent range
for two reasons: first, biased representation is used and second, certain exponents
are reserved as special tokens.
Negative exponents are handed by adding a constant to all actual exponents. This
constant is 1023 (i.e., 210 - 1). Suppose the actual exponent is 1. We add the bias
of 1023 and store 1024. When we read the biased exponent we subtract the 1023 bias
to get 1024 - 1023 = 1.
Suppose that the actual exponent is -6, a negative value. Adding the 1023 bias to
-6 gives 1017; that is, we have avoided having a negative value of the exponent.
If the actual exponent is -1000, we add the bias 1023 to get -1000 + 1023 = 23.
We cannot have a true exponent of -1023 because that would require a stored exponent
of -1023 + 1023 = 0 and that exponent value is forbidden. The reason for this is
practical. The stored exponent 0 is reserved for special applications as we shall
soon see.
The smallest stored exponent Emin is 1 corresponding to an actual exponent of 21-1023
= 2-1022.
The largest stored exponent is Emax is 2046 corresponding to an actual exponent of
22046-1023=21023.
The all zeros exponent, 00000000000, does not represent 20-1023 = 2-1023. It represents
a signed zero if the mantissa is 0. Let’s say this once again: if the mantissa and
the exponent are both all zeros then the floating-point number is 0. This arrangement
makes it easy to test for the value 0.
Using a biased exponent representation for floating-point numbers means that they
are lexically ordered. Lexical ordering (or lexicographical ordering) means dictionary
order; that is, in the same order that they would be stored as words in a dictionary.
In the case of floating point numbers it means that their order in terms of size
is the same order they would have if they were interpreted as integers. Consequently,
you can compare two floating point numbers using a simple integer comparator without
having to worry about exponent or mantissa. That’s nice.
However, there are exceptions to the lexical ordering. First, the ordering excludes
the sign-bit. Second positive and negative zero are special cases. Third, the NaN
(not a number) is not included.
If the mantissa is not zero (but the exponent is) then the value is a subnormal;
that is, a number too small to be represented precisely in IEEE 754 double-precision
format.
The other special exponent is all 1s, 11111111111. If the mantissa is not zero, this
represents a NaN or not-a-number. A NaN is a value outside the floating-point system
and is programmer defined. In other words, a NaN allows programmers to manipulate
non-floating-point values.
If the exponent is all 1s and the mantissa is 0, then the number is treated as infinity.
The 52-bit fractional mantissa is stored in bits 0 to 51; for example if the true
mantissa were 1.125 (1.001 in binary) the fractional mantissa would be stored as
00100000000000000000000000000000000000000000000000. As you can see, the leading 1
has been dropped.
Example
Consider the decimal value -42.625
The sign is negative so that S = 1.
42.625 can be converted to binary form as 101010.101
We can normalize this to 1.01010101 x 25.
The exponent is 5. In biased form this is 5 + 1023 = 1028 = 10000000011
The mantissa is 1.01010101 or 01010101 in fractional form.
We can now write the double-precision IEEE floating-point version of -42.625 as
1 10000000011 0101010100000000000000000000000000000000000000000000
In hexadecimal form, this is C035500000000000.
Range
The largest normalized value we can represent in 64-bit double-precision format is
1.1111111111111111111111111111111111111111111111111111 x 22046-1023 = 1.111…111 x
21023
This is approximately equivalent to the decimal value 1.8 x 10308. It is impossible
to specify a value greater than this. If you try to, you generate an exponent overflow
(an exponent bigger than the maximum in 11 bits) and the resulting value is stored
as the special number infinity.
The minimum normalized number is 2.2 x 10-308. The corresponding decimal precision
is approximately 15 digits.
This range of values can cope with the astronomical values we mentioned at the beginning
of this section. However, 15 places of decimal precision are not always sufficient
for some scientific calculations or for applications where two nearly identical numbers
are subtracted leading to a catastrophic loss of precision.
Floating-point Characteristics
Floating-point numbers are not exactly like real-numbers for several reasons. Floating
point numbers suffer from a floating-point error caused by approximating a real number
to a floating-point number using a finite number of bits.
When a real number is approximated, the new value can be rounded to a finite number
of bits in several ways; for example, simple truncation, rounding up or rounding
down, rounding towards zero, and so on.
Floating-point numbers form a finite and discrete set. If you are using 64 bits to
represent real values, then there can be only 264 different possible values. Floating-point
numbers are not evenly distributed; that is the interval between consecutive floating-point
numbers varies with the size (magnitude) of the number.
A key parameter of a floating-point system is called machine epsilon, machine. It
is defined as the smallest number such that 1 + machine > 1. The value of machine
epsilon is determined by the size of the least-significant bit of the mantissa. In
the case of double-precision arithmetic, machine = 2-53 (i.e., approximately 2.2
x 10-16).
In double-precision floating-point arithmetic, the smallest subnormal number is 2-1074
which is approximately 4.9 x 10-324. There is only one number smaller than this,
and that is zero. The smallest subnormal is the smallest mantissa with the smallest
exponent which is
0.000000000000000000000000000000000000000000000000001 x 2-1022 = 2-52 x 2-1022 =
2-1074.
This value is, of course machine epsilon multiplied by the most negative exponent;
that is, machine x 2Emin
Note that the range of double-precision floating-point values quoted by high-level
language manuals is 4.94065645841246544 x 10-324 to 1.79769313486231570 x 10308.
This range covers the smallest representable number to the largest number prior to
exponent overflow.
Demonstration of Floating-point Distribution
If would be nice to plot each of the 264 discrete double-precision on a graph but
that would take a rather large sheet of paper. Let’s take a tiny system for demonstration
and assume that we have a floating point number represented by a 2-bit mantissa and
a 2-bit exponent (we’ll assume positive numbers only, actual mantissas rather than
fractional mantissas, and exponents biased by 2. We’ll forget about NaNs and special
exponents but will regard a 0 exponent as representing zero).
A four-bit floating point number provides valid exponents of
Binary value Exponent Multiplier
00 special special
01 -1 x 0.5
10 0 x 1.0
11 1 x 2.0
For example, if the binary value of the exponent is 11, the actual exponent is 3
- 2 = 1 and the mantissa is multiplied by 21 = 2.
The mantissas are
Binary value Mantissa
00 0.0 (zero)
01 0.5 (unnormalized)
10 1.0
11 1.5
Applying exponents to mantissas, the valid floating-point numbers are:
0, 0.25, 0.50, 0.75, 0, 0.50, 1.0, 1.5, 0, 1.0, 2.0, 3.0, 0, 2.0, 4.0, 6.0
The values in red represent unnormalized values (i.e., the leading bit is 0 rather
than 1). The following table defines all possible values of the 4-bit code.
Code Value Notes
0000 0 zero exponent is 0
0001 0 zero exponent is 0
0010 0 zero exponent is 0
0011 0 zero exponent is 0
0100 0 unnormalized zero mantissa
0101 0.25 unnormalized
0110 0.50
0111 1.00
1000 0 unnormalized zero mantissa
1001 0.50 unnormalized
1010 1.00
1011 1.50
1100 0 unnormalized zero mantissa
1101 1.00 unnormalized
1110 2.00
1111 3.00
Note that the interval between consecutive values is not constant and depends on
the exponent (because the minimum step is multiplied by the exponent). Note also
that some values are not unique; for example 0.50 can be represented by 1.0 x 2-1
or by 0.5 x 20. However, the 0.5 x 20 value is unnormalized and has only one bit
of precision.
The value 0.25 is the smallest non-zero number that can be represented by this machine
and is therefore the machine epsilon.
The following figure shows all the positive floating point values that can be represented
by this system as blue dots. The 0.25 is in red to indicate that it is not normalized
because it is between zero and the smallest normalized number.
This figure illustrates zero, the smallest number, the maximum number and demonstrates
that the interval between consecutive numbers is not constant.
Further Comments
The mantissa of a floating-point number can be extended by the addition of one or
more guard bits in order to preserve accuracy during rounding. Guard bits are not
stored as part of a floating-point number; they are used during internal operations.
In other words, a floating-point number can be expanded internally in the ALU during
operations before being repacked for storage.
The greatest problems with floating-point arithmetic occur when two similar numbers
are subtracted or two widely differing numbers are added. Consider first subtraction.
We’ll use two normalised floating-point values with 6-bit mantissas: a = 1.00111
and b = 1.00110. If we subtract b from a, we get 1.00111 - 1.00110 = 0.00001. This
has only one bit of precision which may itself be wrong due to previous rounding.
Similarly, suppose we add 1.00111 x 25 to 1.11011 x 21. In order to add these numbers
we have to equalize exponents to get:
A = 1.00111 x 25
B = 0.00011011 x 25
The sum A+B is 1.01010011 when we truncate this mantissa to six bits we get 1.01010.
We have lost the contribution made by the three least-significant bits of B.