In order to make computation less sensitive to the idiosyncracies of different hardware architectures, the IEEE promulagated a standard in 1985. It is not adhered to universally; in fact, many architectures that claim to adher to it do not do so perfectly. One sore point is the requirement for gradual underflow, in which numbers too small for normal representation become denormal numbers. Because it is computationally expensive to deal with such numbers, some manufacturers prefer sudden underflow, in which the underflowed value is set to zero.
The standard provides for two principal types of precision, single and double. There is a third, extended, but it is not normally available under C. Constants relating to floating point values are found in float.h.
IEEE Standard 754 Floating-Point Types
type | size | max. | size | precision |
---|---|---|---|---|
(bits) | digits | (bits) | (digits) | |
Single | 8 | 38 | 23 | 6 |
Double | 11 | 308 | 52 | 15 |
The standard provides for signed infinite numbers, and both quiet and signaling NaN's (Not a Number). For example, taking the logarithm of zero will result in a negative infinite being returned, while that of a negative number will generally be a quiet NaN. Most operations on erroneous arguments produce quiet NaN's. A signaling NaN will cause a trap (interrupt) if used as an argument, while a quiet NaN will generally propagate through without raising an exception condition.
Program for examining the properties of floating-point numbers.
Download float.zip to obtain the source, executable, and sample output.
source: float.c
number 0: 0 0 0 0 number 1: 0 7f 0 1 number 2: 0 80 0 2 number 3: 0 80 400000 3 number 4: 0 81 0 4 number 5: 0 81 200000 5 number 6: 0 81 400000 6 number 7: 0 81 600000 7 number 8: 0 82 0 8 number 9: 0 82 100000 9 number 10: 0 82 200000 10 number 11: 0 82 300000 11 number 12: 0 82 400000 12 number 13: 0 82 500000 13 number 14: 0 82 600000 14 number 15: 0 82 700000 15 number 16: 0 83 0 16 number 17: 0 83 80000 17 number 18: 0 83 100000 18 number 19: 0 83 180000 19 number 20: 0 83 200000 20 number 21: 0 83 280000 21 number 22: 0 83 300000 22 number 23: 0 83 380000 23 number 24: 0 83 400000 24 number 25: 0 83 480000 25 number 26: 0 83 500000 26 number 27: 0 83 580000 27 number 28: 0 83 600000 28 number 29: 0 83 680000 29 number 30: 0 83 700000 30 number 31: 0 83 780000 31 number 32: 0 84 0 32 number 1/2: 0 7e 0 0.5 number 1/4: 0 7d 0 0.25 number 1/10: 0 7b 4ccccd 0.10000000149012 FLT_EPSILON 0 68 0 1.1920928955078e-007 0.1 added 10 times: error (1.19e-007) 0 7f 1 1.0000001192093 0.01 added 100 times: error (6.56e-007) 0 7e 7ffff5 0.99999934434891 0.001 added 1,000 times: error (9.30e-006) 0 7e 7fff64 0.99999070167542 0.0001 added 10,000 times: error (5.35e-005) 0 7f 1c1 1.000053524971 0 0 0 0 0 1 0 1.1754943508223e-038 0 2 0 2.3509887016446e-038 0 3 0 4.7019774032892e-038 0 ff 0 1.#INF 0 ff 7fffff 1.#QNAN 0 fe 7fffff 3.4028234663853e+038 0 fd 7fffff 1.7014117331926e+038 0 fc 7fffff 8.5070586659632e+037
In optimizing code, it is necessary to understand the cost of operations. Table 2 is designed to compare the costs of multiplication, division, and common function evaluation for the Motorola, Intel, and Cyrix math co-processors. Note that this table is not intended to compare the processors; the timing for floating-point addition has been normalized to 1.0 for each processor. Timings are approximate, depend on memory access type, cache hit, and other hardware considerations. Furthermore, math co-processor designs are continuing to evolve and timing statistics can be expected to change.
Relative Times for Floating-Point Operations
Operation | Motorola | Cyrix | Intel | Intel |
---|---|---|---|---|
68881 | 486i | 386 | ||
Add/Subtract | 1.0 | 1.0 | 1.0 | 1.0 |
Multiply | 1.4 | 1.3 | 1.6 | 1.7 |
Divide | 2.0 | 2.0 | 7.3 | 3.0 |
Square root | 2.1 | 2.0 | 8.5 | 4.0 |
sine | 11.4 | 4.2 | 24.0 | 15.0 |
cosine | 7.7 | 5.8 | 24.0 | 17.0 |
tangent | 9.3 | 5.0 | 24.0 | 10.5 |
arctan | 7.9 | 5.5 | 29.0 | 13.0 |
logarithm | 11.4 | 5.8 | 31.0 | 15.4 |
Note that multiplication is not much more expensive than addition, and division only slightly more so, for most processors. Thus, the replacing 2.0*x by x+x saves a bit of time, but trying the same thing with 3.0*x would not be a good idea. There is a trick often suggested for multiplying two complex numbers. It saves on a multiplication at the cost of three additions and subtractions and the storage of two intermediate results. This produces a loss, not a savings, in time for newer processors. To further complicate matters, RISC processors are now appearing with parallel units for addition and multiplication. It might be desirable on such hardware to attempt to balance the number of multiplies and adds, to keep both units occupied.
Maintained by John Loomis, last updated Jan 22, 1997