This article looks at floating-
Review of Integers
The natural integers 1, 2, 3, … are called natural because they are used to count objects; for example, there were seven stars in the consternation Ursa Major (Big Dipper, or Plough) long before humans came along to count them.
Computers store information in binary form by using two-
Computers with 8-
We can extend the range of integers into the negative region. Since we can’t increase
the total number of integers represented by m bits (i.e., 2m unique values), we typically
use two’s complement notation to represent the signed range -
Unfortunately, integers can’t be used to represent fractional values like ½ or or √2. Such numbers are called real numbers and these have a fractional part (i.e., a part less than 1). Moreover, these numbers may be incapable of representation with a finite number of digits using positional notation (for example, you can’t represent ⅓ exactly as a decimal fraction – you can write 0.333 or 0.333333333333333333, but you can never exactly express ⅓). This inability to represent all real numbers exactly in decimal or binary form has implications for the accuracy of calculations. In particular, it has implications for the accuracy of chained calculation involving sequences of arithmetic operations. Ultimately, this property of numbers can mean that the results of some calculations are effectively meaningless. The computer scientist has to appreciate this fact.
As well as representing fractional or real values, we have to represent a range of values from the very large to the very small. Consider big numbers. The population of the Earth is about 7,200,000,000 which is a moderately large number. If you want a truly big number, then we can use the mass of the sun which is about 198,892,000,000,000,000,000,000,000,000 kg. In conventional arithmetic, we represent such large numbers by means of scientific notation where we separate out precision and magnitude by writing numbers of the form m x 10e, where m is a mantissa and e an exponent. In this case, the sun’s weight can be written 1.98892 x 1030 kg. This notation requires fewer digits than integer notation, but also provides fewer significant digits; for example, the weight is represented by 1.98892 with provides a precision of one in 106.
As well as small numbers, we have to deal with tiny numbers; for example, the chance
of a US citizen being hit by lightning in any given year is about 1 in 500,000 or
0.000,005 or 5.0 x 10-
So, we use scientific notation to represent large and small numbers by writing a number in the form m x 10e. Typically, the mantissa is expressed in the range 1.000…000 to 9.999…999 when using scientific notation.
It was natural that our scientific notation would be converted into an appropriate
binary form for use by computers. This form is called binary floating-
If you choose a floating-
The actual situation (i.e., trade-
For a long time each computer manufacturer implemented their own floating-
A great step forward took place in the 1980s when the IEEE produced a standard for
IEEE 754 defines:
Double Precision IEEE 754 Format
The IEEE double-
A normalized number always has a leading 1. If it always has a leading 1, we know
it’s there and do not have to store it. Consequently, when dealing with IEEE single-
The format of a double-
The exponent occupies 11 bits from bit 52 to 62. In principle, this gives an exponent
range of 0 to 211 -
Negative exponents are handed by adding a constant to all actual exponents. This
constant is 1023 (i.e., 210 -
Suppose that the actual exponent is -
We cannot have a true exponent of -
The smallest stored exponent Emin is 1 corresponding to an actual exponent of 21-
The largest stored exponent is Emax is 2046 corresponding to an actual exponent of
The all zeros exponent, 00000000000, does not represent 20-
Using a biased exponent representation for floating-
However, there are exceptions to the lexical ordering. First, the ordering excludes
If the mantissa is not zero (but the exponent is) then the value is a subnormal;
that is, a number too small to be represented precisely in IEEE 754 double-
The other special exponent is all 1s, 11111111111. If the mantissa is not zero, this
represents a NaN or not-
If the exponent is all 1s and the mantissa is 0, then the number is treated as infinity.
Consider the decimal value -
The sign is negative so that S = 1.
42.625 can be converted to binary form as 101010.101
We can normalize this to 1.01010101 x 25.
The exponent is 5. In biased form this is 5 + 1023 = 1028 = 10000000011
The mantissa is 1.01010101 or 01010101 in fractional form.
We can now write the double-
1 10000000011 0101010100000000000000000000000000000000000000000000
In hexadecimal form, this is C035500000000000.
The largest normalized value we can represent in 64-
1.1111111111111111111111111111111111111111111111111111 x 22046-
This is approximately equivalent to the decimal value 1.8 x 10308. It is impossible to specify a value greater than this. If you try to, you generate an exponent overflow (an exponent bigger than the maximum in 11 bits) and the resulting value is stored as the special number infinity.
The minimum normalized number is 2.2 x 10-
This range of values can cope with the astronomical values we mentioned at the beginning of this section. However, 15 places of decimal precision are not always sufficient for some scientific calculations or for applications where two nearly identical numbers are subtracted leading to a catastrophic loss of precision.
When a real number is approximated, the new value can be rounded to a finite number of bits in several ways; for example, simple truncation, rounding up or rounding down, rounding towards zero, and so on.
A key parameter of a floating-
0.000000000000000000000000000000000000000000000000001 x 2-
This value is, of course machine epsilon multiplied by the most negative exponent; that is, machine x 2Emin
Note that the range of double-
Demonstration of Floating-
If would be nice to plot each of the 264 discrete double-
Binary value Exponent Multiplier
00 special special
10 0 x 1.0
11 1 x 2.0
For example, if the binary value of the exponent is 11, the actual exponent is 3
The mantissas are
Binary value Mantissa
00 0.0 (zero)
01 0.5 (unnormalized)
Applying exponents to mantissas, the valid floating-
0, 0.25, 0.50, 0.75, 0, 0.50, 1.0, 1.5, 0, 1.0, 2.0, 3.0, 0, 2.0, 4.0, 6.0
The values in red represent unnormalized values (i.e., the leading bit is 0 rather
than 1). The following table defines all possible values of the 4-
Code Value Notes
0000 0 zero exponent is 0
0001 0 zero exponent is 0
0010 0 zero exponent is 0
0011 0 zero exponent is 0
0100 0 unnormalized zero mantissa
0101 0.25 unnormalized
1000 0 unnormalized zero mantissa
1001 0.50 unnormalized
1100 0 unnormalized zero mantissa
1101 1.00 unnormalized
Note that the interval between consecutive values is not constant and depends on
the exponent (because the minimum step is multiplied by the exponent). Note also
that some values are not unique; for example 0.50 can be represented by 1.0 x 2-
The value 0.25 is the smallest non-
The following figure shows all the positive floating point values that can be represented by this system as blue dots. The 0.25 is in red to indicate that it is not normalized because it is between zero and the smallest normalized number.
This figure illustrates zero, the smallest number, the maximum number and demonstrates that the interval between consecutive numbers is not constant.
The mantissa of a floating-
The greatest problems with floating-
Similarly, suppose we add 1.00111 x 25 to 1.11011 x 21. In order to add these numbers we have to equalize exponents to get:
A = 1.00111 x 25
B = 0.00011011 x 25
The sum A+B is 1.01010011 when we truncate this mantissa to six bits we get 1.01010.
We have lost the contribution made by the three least-