Floating-point Numbers

The IEEE-754 Standard

In order to make computation less sensitive to the idiosyncracies of different hardware architectures, the IEEE promulagated a standard in 1985. It is not adhered to universally; in fact, many architectures that claim to adher to it do not do so perfectly. One sore point is the requirement for gradual underflow, in which numbers too small for normal representation become denormal numbers. Because it is computationally expensive to deal with such numbers, some manufacturers prefer sudden underflow, in which the underflowed value is set to zero.

The standard provides for two principal types of precision, single and double. There is a third, extended, but it is not normally available under C. Constants relating to floating point values are found in float.h.

IEEE Standard 754 Floating-Point Types

type	size	max.	size	precision
	(bits)	digits	(bits)	(digits)
Single	8	38	23	6
Double	11	308	52	15

The standard provides for signed infinite numbers, and both quiet and signaling NaN's (Not a Number). For example, taking the logarithm of zero will result in a negative infinite being returned, while that of a negative number will generally be a quiet NaN. Most operations on erroneous arguments produce quiet NaN's. A signaling NaN will cause a trap (interrupt) if used as an argument, while a quiet NaN will generally propagate through without raising an exception condition.

Example

Program for examining the properties of floating-point numbers.

Download float.zip to obtain the source, executable, and sample output.

source: float.c

sample output

number  0:   0  0      0                     0
number  1:   0 7f      0                     1
number  2:   0 80      0                     2
number  3:   0 80 400000                     3
number  4:   0 81      0                     4
number  5:   0 81 200000                     5
number  6:   0 81 400000                     6
number  7:   0 81 600000                     7
number  8:   0 82      0                     8
number  9:   0 82 100000                     9
number 10:   0 82 200000                    10
number 11:   0 82 300000                    11
number 12:   0 82 400000                    12
number 13:   0 82 500000                    13
number 14:   0 82 600000                    14
number 15:   0 82 700000                    15
number 16:   0 83      0                    16
number 17:   0 83  80000                    17
number 18:   0 83 100000                    18
number 19:   0 83 180000                    19
number 20:   0 83 200000                    20
number 21:   0 83 280000                    21
number 22:   0 83 300000                    22
number 23:   0 83 380000                    23
number 24:   0 83 400000                    24
number 25:   0 83 480000                    25
number 26:   0 83 500000                    26
number 27:   0 83 580000                    27
number 28:   0 83 600000                    28
number 29:   0 83 680000                    29
number 30:   0 83 700000                    30
number 31:   0 83 780000                    31
number 32:   0 84      0                    32


number 1/2: 0 7e      0                   0.5
number 1/4: 0 7d      0                  0.25
number 1/10:  0 7b 4ccccd      0.10000000149012
FLT_EPSILON  0 68      0  1.1920928955078e-007


0.1 added 10 times:        error (1.19e-007) 0 7f      1       1.0000001192093
0.01 added 100 times:      error (6.56e-007) 0 7e 7ffff5      0.99999934434891
0.001 added 1,000 times:   error (9.30e-006) 0 7e 7fff64      0.99999070167542
0.0001 added 10,000 times: error (5.35e-005) 0 7f    1c1        1.000053524971


0  0      0                     0
0  1      0  1.1754943508223e-038
0  2      0  2.3509887016446e-038
0  3      0  4.7019774032892e-038


0 ff      0                1.#INF
0 ff 7fffff               1.#QNAN
0 fe 7fffff  3.4028234663853e+038
0 fd 7fffff  1.7014117331926e+038
0 fc 7fffff  8.5070586659632e+037

Cost of Operations

In optimizing code, it is necessary to understand the cost of operations. Table 2 is designed to compare the costs of multiplication, division, and common function evaluation for the Motorola, Intel, and Cyrix math co-processors. Note that this table is not intended to compare the processors; the timing for floating-point addition has been normalized to 1.0 for each processor. Timings are approximate, depend on memory access type, cache hit, and other hardware considerations. Furthermore, math co-processor designs are continuing to evolve and timing statistics can be expected to change.

Relative Times for Floating-Point Operations

Operation Motorola Cyrix Intel Intel

68881 486i 386

Add/Subtract 1.0 1.0 1.0 1.0

Multiply 1.4 1.3 1.6 1.7

Divide 2.0 2.0 7.3 3.0

Square root 2.1 2.0 8.5 4.0

sine 11.4 4.2 24.0 15.0

cosine 7.7 5.8 24.0 17.0

tangent 9.3 5.0 24.0 10.5

arctan 7.9 5.5 29.0 13.0

logarithm 11.4 5.8 31.0 15.4

Operation	Motorola	Cyrix	Intel	Intel
	68881		486i	386
Add/Subtract	1.0	1.0	1.0	1.0
Multiply	1.4	1.3	1.6	1.7
Divide	2.0	2.0	7.3	3.0
Square root	2.1	2.0	8.5	4.0
sine	11.4	4.2	24.0	15.0
cosine	7.7	5.8	24.0	17.0
tangent	9.3	5.0	24.0	10.5
arctan	7.9	5.5	29.0	13.0
logarithm	11.4	5.8	31.0	15.4

Note that multiplication is not much more expensive than addition, and division only slightly more so, for most processors. Thus, the replacing 2.0*x by x+x saves a bit of time, but trying the same thing with 3.0*x would not be a good idea. There is a trick often suggested for multiplying two complex numbers. It saves on a multiplication at the cost of three additions and subtractions and the storage of two intermediate results. This produces a loss, not a savings, in time for newer processors. To further complicate matters, RISC processors are now appearing with parallel units for addition and multiplication. It might be desirable on such hardware to attempt to balance the number of multiplies and adds, to keep both units occupied.

Maintained by John Loomis, last updated Jan 22, 1997