This chapter describes how numbers are represented, stored and processed in digital computers.
We will see that positive integer numbers are conceptually easy to handle in a computer; extend to include negative numbers and things get more difficult; finally, when you extend to the real number system a further complexity is introduced.
We will also see how other entities are stored in computers - like text characters; essentially, computers can store only numbers, so anything else must be coded as a number; and, for these codes to be any use at all, the coding scheme must be agreed by those who wish to exchange information - within the same computer and across different computers.
We have two reasons to introduce sets:
Informally, a set is a collection of objects. The objects are called elements, or ``members'' of the set.
The statement `x is a member of X' is written . is a proposition; a proposition is a statement that can be either affirmed, `is true', or denied, `is false'. The truth-values, , of such propositions may be manipulated by the propositional calculus (aka. Boolean algebra).
There are two means of defining a set:
, the base set of possible dice counts,
The notation for set comprehension is given by eqn. 2.1:
Often we describe sets in terms of some universal set - a set to which all possible entities belong, and, as above, we then define other sets using some rule.
boolean
in Java.
The closed interval:
The open interval:
The power set of set X, P X, is the set of all subsets of X.
[Here we use the term word for the contents of a location or a
register; a word is usually a collection one or more bytes, interpreted as a
unit, e.g. a Java int
is four bytes, 32 bits].
Because of finite data word size, when we represent numbers on a digital computer, we normally need to approximate. It should be clear that approximation of the natural and integer numbers is not such a big problem; you just cut off at some sufficiently large positive and negative numbers, say -Nm and +Np to replace and as the bounds.
Rationals could be handled similarly - as a pair of integers - but these are rarely ever used for general purpose arithmetic.
Real numbers are the big problem - to represent them completely requires an infinity of digits. Note: between any two real numbers, no matter how close, there is an infinite set real numbers - the reals form a continuum. However, in a digital computer they must be represented by selected examples - approximations; thus, on a very limited system, we might have 2.5, 2.55, 2.3, ...and in such a system, would be approximated by 2.5.
In summary:
For an encyclopedic account of this subject see [Knuth, 1997].
Digital computers use binary (base 2) as their `native' representation; however, in addition to binary, and decimal, it is also useful to discuss hexadecimal (base 16) and octal (base 8) representations.
Our everyday representation of numbers is decimal; it uses the 10symbols - digits. 10 is its base or radix.
Or decimal number system is a positional system: the meaning of any digit is determined by its position in the string of digits that represent the number. Thus:
In general, given a base b, and number of digits p, the number is written:
its value is:
The leftmost, dp-1, is the most significant digit, d0 , the rightmost is the least significant digit.
Base 2 requires just two symbols, , but the pattern remains the same as for decimal. are termed binary digits - bits.
Example. [Subscript: x10 signifies that x is
decimal, y2 signifies binary].
Notation: In books, and here, you will find a range of notations used. Thus:
and this is by no means an exhaustive list of possible notations.
The example of , above, shows how it is done: simply add the powers of two corresponding to digits Example. Convert 2210 to binary.
Method 1: successively divide by 2 until you get to 0; the remainders (r),
read backwards, are the binary digits.
To check, convert back to decimal:
Method 2: Same principle as method 1, but easier to do in your head. First you must remember powers of 2: 1, 2, 4, 8, 16, 32, 64, 128, .... Then:
Hexadecimal is base 16: the sixteen symbols used are . A table of the values is shown in Table 2.1. For completeness, we also give binary and octal (base 8).
When dealing with computers at a low-level it is almost essential to be able to work with hexadecimal ("hex") or octal; hex is more common now, but it may be difficult to memorise the symbol, number, bit-pattern table above. If you need to use hex in an examination or otherwise, make up the table on a sheet of paper - it takes only a few seconds.
Hex codes groups of four bits into {0 ...7}, can represent three bits.
Example. Convert 2210 to hex.
Convert 2210 to hex. For s start, convert this to binary, i.e.
101102. Then, starting from the right (least significant digit)
divide into groups of four; then use Table 2.1.
101102 =1616
Example. Convert 2510 to octal. Ans. binary: 110012; group into threes starting from right: . This is 318.
Q. Why do real programmers (who, at least in the old days, always worked in octal) confuse Halloween and Christmas? A. Because Dec 25 = Oct 31 - ha, ha!
Assuming you do decimal addition as follows: (19 + 22) 2 + 9 = 11 i.e. 1 and carry 1; ; Answer: 41.
, binary addition
(and other bases too) follows the same pattern - the addition of single digits,
and the carry rule simply need to be amended. Thus:
1 + 1 = 10 i.e. 0 and carry 1; , 0 + 1 = 1 (no carry this time).
Example:
; ; Check: 4316 = 64 + 3 = 67.
The three primary methods used to represent signed binary numbers are: sign-magnitude, twos-complement, and biased (also called excess). Of these, twos-complement is by far the most common.
In a biased system, a fixed bias is added - such that the sum of the bias and the number being represented will always be non-negative; thus, what would normally be interpreted as 0 represents the most negative number.
Twos-complement makes adding and subtracting particularly easy to implement; in a twos-complement system, negating a number is done by taking its twos-complement. Computing the twos complement of a binary number involves two steps:
Thus, .
Example: In a four bit number system, represent -3 in each of the formats.
Table 2.2 shows a comparison of the three methods of representing signed numbers.
|
Note:
*
) In twos-complement and biased, there is only one zero!
Addition is straightforward: simply add and discard any carry out from the most significant bit.
Example: 2 + (-3) = -1
Example: 3 + (-3) = 0
and the 1 carried out of the most significant bit is discarded - note, this does not signify overflow, see the next subsection.
Subtraction: negate the number to be subtracted, then add.
Overflow comes about when the result of an operation does not fit the representation being used. It is a common trap for the unwary, especially in programming languages, for example C, C++, various assembly languages, and machine language, which have no built-in checks for it.
Example: Positive overflow. In our four-bit number system, 3 + 6
Hmmm! Consulting
Table 2.2,
we find that .
Example: Negative overflow. In our four-bit number system, -3 - 6
As is normal, the 1
carried out of the most significant bit is discarded. Hmmm again! Consulting
Table 2.2,
we find that .
For unsigned numbers, detecting overflow is easy, it occurs exactly when there is carry out of the most significant bit.
For twos-complement things are trickier; overflow occurs exactly when the carry-in to the most significant bit is different from the carry-out of that bit.
If you examine Table 2.2,
you will see (obviously enough when you think about it) that you can iterate in
a positive direction along the twos-complement numbers by simply adding 1.
Finally, when you get to 710 = 0111
In other words, most
positive number + 1 = most negative number; so, we wrap-around.
Exercise. Draw the circle for four bit numbers.
Often, the quantities we want to represent and compute are not integers. Furthermore, the intermediate results in a computation may not be integer.
To represent non integers, the following are required:
In a fixed point representation, the point is assumed to be in some fixed position - and, since it is fixed, the position need not be stored. In a floating point representation, the position of the point can be varied according to the size of the number. Floating point is by far the most common and useful on general purpose computers.
To represent fractional numbers, we simply extend the positional system,
introduced in section 2.3.1,
to include digits corresponding to negative powers. We do this by introducing a
point `.
'; the integer number is as before, and is shown
to the left of the point; the fraction part is shown to the right of the point.
Therefore, generalising what was given in section 2.3.1, a number which includes a qbit fractional part - for a total of p+q bits - is written:
and its value is:
Thus, 123.4510,
In binary,
Example. Convert 11.1012 to decimal.
The the magnitude of the number depends on the fixed position of the point. The problem arises that sometimes we want most of the digits after the `point', i.e. small numbers, sometimes we want most of them before the `point', i.e. large numbers. A single fixed-point representation cannot do both.
Other methods that have been proposed for storage of fractions:
But, floating point is the widely adopted solution.
A floating point number is composed of two parts:
The floating point representation is similar to the `scientific', or `exponential' representation of very large and very small numbers on some calculators and in books; e.g. , i.e. . In this scientific representation, you can think of the exponent as representing the `point' position.
In fact, the exponent is the integer part of the logarithm to which the base must be raised in order to place the point in the right position.
Examples of decimal floating-point with five digit fraction and three digit exponent are shown in Table 2.3.
|
We will explain binary floating point using a total of 8 bits (1 for sign, 3
for exponent, 4 for fraction); most practical systems use 32 bits, but the
following explains the principles without the confusion of many bits. The bits
are arranged as follows,
The fraction (mantissa) is just a straight fraction, i.e. the (virtual) point is at the left hand side of the four bits. The exponent is stored as a signed number, using excess 4 (biased by 4) representation, i.e. biased by 4 - recall Table 2.2. Thus, the bit patterns in the exponent represent values as shown in Table 2.4.
Of course, you don't have to look up Table 2.4, simply subtract 4 from the stored number.
The value, v, of the stored floating point number is given by
eqn. 2.2.
Or, to make the excess in the exponent more explicit,
Unless otherwise stated, f is bounded as follows:
The most common practical floating point number system (IEEE Standard 754, single precision) uses 8 bits for exponent, 23 for fraction - including 1 for sign; a total of 32 bits. There is a double precision format at 64 bits and extended precision at 80 bits.
Notice that, in a normalised number, eqn 2.4, the top bit is always set; thus, we don't really have to store it - this is a trick that the IEEE standard uses to `steal' an additional bit for the fraction; actually, they work it so that the top bit is assumed to be just to the left of the `point'. We will adopt this trick for our floating-point-byte system. Finally we end up with eqn. 2.5, where we show the hidden bit, and make the sign explicit,
In special circumstances, IEEE 754 allows numbers to be stored with a denormalised fraction; for such cases they use exponent 000 which flags the special case. The major advantage is that we can store a `pure' zero.
Henceforth, we will call this small floating point number system flob - floating-point byte.
Example: Stored representation is 01101011, what is the value?
Sign. . Stored exponent = 110 =
610, actual exponent = +2. Fraction = 1011, i.e. with hidden bit,
1.1011. Therefore,
Example: Stored representation is 10111100, what is the value?
Sign. . Stored exponent = 011 =
310, actual exponent = -1. Fraction = 1100, i.e. with hidden bit,
1.1100. Therefore,
None: In the following we will assume that the only allowed denormalised number is 0; i.e. ; there is no other allowable case where stored exponent can be 000.
As mentioned in the introduction, section 2.2.4, the real numbers form a continuum. However, instead of infinity of them, exactly 225can be represented in flob. In flob there are three legal regions on the real line, and four illegal, see Figure 2.1.
The dots (...)
in Figure 2.1
are meant to show that, even within the legal regions, we do not have a
continuum, but a discrete set of numbers. So, in reality, there are many other
illegal regions, besides those mentioned above.
For actual numbers, we must compromise and represent them with a close legal number; this entails rounding or truncation.
Normalisation is introduced to make maximum use of the fraction, i.e. ensure that the leftmost digit is non-zero. Thus, as mentioned above, .
The normalisation process is simple: shift the fraction one digit left at a time, each time decreasing the exponent by 1; when the left-most digit is non-zero, normalisation is complete.
Example: .
; we show the hidden bit.
The fraction must be shifted left 3 places so the exponent has 3subtracted from it. The normalised value is . Therefore, we have,
and this would be stored as (N.B. the top bit is hidden).
Example. Convert 9 to 'flob', and express in hexadecimal format.
.
3+4 = 7 = 1112
Example. Convert to flob, and express in hexadecimal format.
.
-1+4 = 3 = 0112
Example.
See above, same except:
; normally write as 23H.
Example. 0 - exactly zero.
. Zero is a special case, see above, which is not normalised.
Example. Convert 00H from flob representations to decimal value.
You just have to remember that this is zero see above.
Example. Convert 35H from flob representations to decimal value.
Here, we merely show the principle; and, to avoid some detail we use a decimal format.
Example. . Ans. .
The general method for addition and subtraction, (A+/-B) is:
Example. .
Example. .
Strangely, these are easier than add and subtract.
Example. .
Strictly speaking, accuracy is not a feature of the number system. Accuracy is determined by the error inherent in the production of the number, for example, the error in reading a metre. Thus, ; this indicates that the actual number could be anywhere between 12.33 and 12.35. It might be possible to include better resolution e.g. express it as 12.34567, but, due to the accuracy limitation, the 567would be meaningless.
Resolution is measured by - signifying value difference. is the difference in values represented by neighbouring states. In floating-point systems, varies according to the magnitude of the number (v); in fixed-point systems remains constant across the range of v.
Example. Consider 01000 111 in flob. . Next value up is . Hence, .
At . Next value up is . Hence, . If you used the whole eight bits for fixed-point then would be constant at over the whole range.
Range or dynamic range is the difference between the largest positive number and the largest negative number. In `flob' this is 15.5 - -15.5 = 31.0, see above.
Precision is defined as the ratio of resolution to value, i.e. . Floating point gives an almost constant precision across the dynamic range.
Example. See section 2.6.13. At , .
At v = 1.1000 x 2-3 = 0.1875, .
Table 2.5 shows some commonly used integer formats and their ranges.
|
IEEE single precision (32 bits): , 7 significant digits.
.
, , .
IEEE floating-point uses the hidden bit trick mentioned above.
[Chapter 3 is a prerequisite for this section].
In most assembly languages, and in C
and C++
there
are binary operations which logical operations on each corresponding pair of
bits.
Let us say that we have two four bit numbers packed into a byte. How do we disentangle them. Take the byte ; the two four bit numbers are 7and 9. First, the lower four bits; here we mask with .
For the higher four bits, we shift right by four bits (see subsection2.8.3, and use the same mask.
Thus, you sometimes find assembly language programmers using shift for multiplying (shift left) and dividing (shift right) by powers of two.
Rotate is like shift, only bits can wrap-around.
Text characters, as well as numbers must be binary coded to get them into a computer memory. A code that copes with alphabetic characters, {a ...z, A ...Z} as well as digits {0 ...9 } is called alphanumeric.
Up to recently, the most frequently used code was the ASCII (American Standard Code for Information Interchange) code; ASCII uses 7-bits, (0 to 127); this includes printable and control characters.
Table 2.6 gives some codes worth remembering.
|
ASCII has proved adequate for English text. But what about languages with
accents? And those with funny shaped characters? UNICODE is a new international
standard, governed by the UNICODE consortium, aimed at allowing a richer
character set. In fact, Java uses UNICODE a 16-bit code - so that Java
char
type is 16-bit, compared to 8 bit in C and C++. According to
[Tanenbaum,
1999] the total of known characters in the world's languages number some
200,000 - thus, there is already pressure on UNICODE.
41H could be `A' or the number 41H. "AB" together in a 16-bit word could be part of a string, or the number 4142H, or part of an instruction, or .... The computer program must know beforehand what each memory cell represents - thus, types.
When information is transferred between various parts of a computer system it is possible that errors may be introduced. These errors take the form of transmitted 0s inadvertently being received as 1s or vice-versa.
Parity provides single error detecting capability. The scheme involves appending an additional bit to the string of 1s and 0s - the parity bit. Depending on the chosen convention, the appended bit is chosen so that the total number of 1s in the string is always even even parity or always odd odd parity.
Bit 7 is often used - pure ASCII uses 7 bits, bits ..
Example.. ASCII character `A' using odd parity.
with odd parity bit placed in bit 7.
boolean
; what is the
set of values for it? Identify some legitimate operations on
boolean
values (Hint: how would you write an if
statement to do something if the integers i
and j
satisfy the following conditions: (a) i
between 1 and 10 and
j
is between 21 and 30, (b) i
is between 1 and 10 or
it is between 21 and 30, (c) i
is any value except those
between 50 and 60. Identify some operations on boolean
values
i.e. that will cause compiler syntax errors; hint: int
, see
below.
int
; what, roughly, is the set of
values for int
? Identify some legitimate operations on
int
values. Identify some operations on int
values
i.e. that will cause compiler syntax errors; hint: boolean
, see
below.
Section 2.3.
Section 2.5. Note: in an exam, I'd expect you to be able to describe the three methods of representing signed numbers - with examples and brief compare/contrast.
68656C6C6f20776f726C640D0A4279650D0A
; if necessary,
see Deitel & Deitel p. 1242, but my Table 2.6 should be adequate, i.e.
count ahead, 61 = 'a', 62 = 'b', 63 = 'c' etc.
30313233
. Make some suggestions as to what is
represented. In the context of types comment on this state of affairs.
68656C6C6f20
.
int
) in Java are usually 32 bits. Explain how you
would construct an integer number system that would handle 256 bits (say),
i.e. which would give a range: .