Next: Digital Logic and Boolean Up: Computer Architecture - CSC Previous: Introduction and Overview

Subsections

Computer Number Systems and Arithmetic

Introduction

Overview

This chapter describes how numbers are represented, stored and processed in digital computers.

We will see that positive integer numbers are conceptually easy to handle in a computer; extend to include negative numbers and things get more difficult; finally, when you extend to the real number system a further complexity is introduced.

We will also see how other entities are stored in computers - like text characters; essentially, computers can store only numbers, so anything else must be coded as a number; and, for these codes to be any use at all, the coding scheme must be agreed by those who wish to exchange information - within the same computer and across different computers.

Numbers and Types

Sets

 

We have two reasons to introduce sets:

Informally, a set is a collection of objects. The objects are called elements, or ``members'' of the set.

The statement `x is a member of X' is written $ x \in X$. $ x \in X$ is a proposition; a proposition is a statement that can be either affirmed, `is true', or denied, `is false'. The truth-values, $\{true,
false\}$, of such propositions may be manipulated by the propositional calculus (aka. Boolean algebra).

There are two means of defining a set:

1.
By enumeration: e.g. X = 1, 2, 3;

2.
By comprehension, i.e. in terms of an existing set using a `filter' predicate for selecting. For example, the set of dice counts less than or equal to three:

$ D = \{1, 2, 3, 4, 5, 6 \}$, the base set of possible dice counts,

$ X = \{\ x \colon D \mid x \le 3 \bullet x\}$

The notation for set comprehension is given by eqn. 2.1:

 \begin{displaymath}\{\ x \colon S \mid p(x) \bullet f(x)\}
\end{displaymath} (2.1)

This denotes the set of elements x in S which satisfy predicate p(.), with, finally, the function f(.) applied to each element. If we are interested in just the elements x, as is often the case, the ` $\bullet f(x)$' may be left absent.

Important Sets

Universal Set

Often we describe sets in terms of some universal set - a set to which all possible entities belong, and, as above, we then define other sets using some rule.

Subset
If Y and X are sets whose elements are such that membership of Yimplies membership of X, then Y is a subset of X, denoted by $Y
\subseteq X$.

Empty Set
The set which contains no element is called the empty set, signified by $\{\}$ or $\emptyset$.

Natural Numbers
$ \mathbf{N} = \{0, 1, 2, \ldots +\infty \}$, the set of natural numbers or non-negative integers.

Integers
$ \mathbf{Z} = \{-\infty, \ldots 0, 1, 2, \ldots +\infty \}.$

Rational numbers
$\frac{n}{d}$, where $n, d \in \mathbf{N}\
\text{and}\ d \neq 0$; the rationals can be represented by a finite set of recurring decimals, e.g. $\frac{3}{7} = 0.\overline{428571} \ldots$.

Real numbers
R, the real number line. Real numbers are truly representable only by an infinity of nonrecurring decimals, e.g. $\sqrt{2} = 0.414213562 \ldots$.

Truth values
$\{false, true \}$, corresponding to the type boolean in Java.

Interval Subsets

The closed interval: $[a, b] = \{\ x \colon \mathbf{R} \mid a \le x \le b \bullet x\}.$

The open interval: $(a, b) = \{\ x \colon \mathbf{R} \mid a < x < b \bullet x\}.$

Power Set

 

The power set of set X, P X, is the set of all subsets of X.

Operations on Sets

 

Union

$S \cup T = \{ x \vert x \in S \lor x \in T \} $

Intersection

$S \cap T = \{ x \vert x \in S \land x \in T \} $

Complement

The complement of Y with respect to X, signified by X - Y (or in some usage, $X \sim Y$), is the subset of X of all elements of X that are not members of Y:

\begin{displaymath}X - Y = \{y': X \vert y' \notin Y \bullet y'\}. \end{displaymath}

If X is understood to be the universe, then the complement can simply be written $\sim Y $; alternatively, $\bar{Y}$.

Representation of Values in Finite Data Word Sizes

 

[Here we use the term word for the contents of a location or a register; a word is usually a collection one or more bytes, interpreted as a unit, e.g. a Java int is four bytes, 32 bits].

Because of finite data word size, when we represent numbers on a digital computer, we normally need to approximate. It should be clear that approximation of the natural and integer numbers is not such a big problem; you just cut off at some sufficiently large positive and negative numbers, say -Nm and +Np to replace $-\infty$ and $+\infty$ as the bounds.

Rationals could be handled similarly - as a pair of integers - but these are rarely ever used for general purpose arithmetic.

Real numbers are the big problem - to represent them completely requires an infinity of digits. Note: between any two real numbers, no matter how close, there is an infinite set real numbers - the reals form a continuum. However, in a digital computer they must be represented by selected examples - approximations; thus, on a very limited system, we might have 2.5, 2.55, 2.3, ...and in such a system, $2.512345 \ldots$ would be approximated by 2.5.

In summary:

1.
Integers are easy to simulate. Normal integer arithmetic and computer integer arithmetic are effectively the same. Handling a sign introduces some extra complexity.

2.
Reals are more difficult to simulate. You have to be careful with the computer approximations of real arithmetic; even the best approximation - floating point numbers and floating point arithmetic - are inherently inaccurate.

For an encyclopedic account of this subject see [Knuth, 1997].

Representations of Numbers - Bases

General

 

Digital computers use binary (base 2) as their `native' representation; however, in addition to binary, and decimal, it is also useful to discuss hexadecimal (base 16) and octal (base 8) representations.

Our everyday representation of numbers is decimal; it uses the 10symbols $\{0,1,2,3,4,5,6,7,8,9 \}$ - digits. 10 is its base or radix.

Or decimal number system is a positional system: the meaning of any digit is determined by its position in the string of digits that represent the number. Thus:


\begin{align*}123 &= 1 \times 10^2 + 2 \times 10^1 + 3 \times 10^0 \\
&= 1 \times 100 + 2 \times 10 + 3 \times 1 \\
&= 100 + 20 + 3
\end{align*}

In general, given a base b, and number of digits p, the number is written:

$d_{p-1}d_{p-2}\ldots d_2d_1d_0$

its value is:

$value = d_{p-1} \times b^{p-1} + d_{m-2} \times b_{p-2} + \ldots d_2 \times
b_{2} + d_1 \times b + d_0$

The leftmost, dp-1, is the most significant digit, d0 , the rightmost is the least significant digit.

Binary Representation

Base 2 requires just two symbols, $\{0, 1\}$, but the pattern remains the same as for decimal. $\{0, 1\}$ are termed binary digits - bits.

Example. [Subscript: x10 signifies that x is decimal, y2 signifies binary].
\begin{align*}101_2 &= 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^0 \\
&= 4 + 0 + 1 \\
&= 5_{10}
\end{align*}

Notation: In books, and here, you will find a range of notations used. Thus:

$101_2 = 101_B = 101\ Bin = Bin\ 101$

$22_{10} = 22_D = 22\ Dec = Dec\ 22$

and this is by no means an exhaustive list of possible notations.

Conversion - Binary-Decimal, Decimal-Binary

The example of $101_B \rightarrow 5_D$, above, shows how it is done: simply add the powers of two corresponding to digits Example. Convert 2210 to binary.

Method 1: successively divide by 2 until you get to 0; the remainders (r), read backwards, are the binary digits.
\begin{align*}\frac{22}{2} &= 11\ r\ 0 \leftarrow \text{least significant digit}...
...\\
\frac{1}{2} &= 0\ r\ 1 \leftarrow \text{most significant digit}
\end{align*}

2210 = 101102.

Clearly, this will work for any base, and, actually, is the method that you would use to extract the digits (base 10, or any base) from a number to enable printing.

To check, convert back to decimal:
\begin{align*}10110_2 &= 1 \times 16 + 0 \times 8 + 1 \times 4 + 1 \times 2 + 0 \times 1 \\
&= 16 + 0 + 4 + 2 + 0 \\
&= 22
\end{align*}

Method 2: Same principle as method 1, but easier to do in your head. First you must remember powers of 2: 1, 2, 4, 8, 16, 32, 64, 128, .... Then:

1.
Find power of two just below the number (22), i.e. $16\ 16 = 2^4,$so, bit 4 is set;

2.
Subtract 16, to get 6;

3.
Loop back with 6 as number; repeat until done.

Hexadecimal and Octal

Hexadecimal is base 16: the sixteen symbols used are $\{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F\}$. A table of the values is shown in Table 2.1. For completeness, we also give binary and octal (base 8).


 
Table 2.1: Decimal Binary, Octal, Hexadecimal
Dec Bin Oct Hex
0 0 0 0
1 1 1 1
2 10 2 2
3 11 3 3
4 100 4 4
5 101 5 5
6 110 6 6
7 111 7 7
8 1000 10 8
9 1001 11 9
10 1010 12 A
11 1011 13 B
12 1100 14 C
13 1101 15 D
14 1110 16 E
15 1111 17 F
 

When dealing with computers at a low-level it is almost essential to be able to work with hexadecimal ("hex") or octal; hex is more common now, but it may be difficult to memorise the symbol, number, bit-pattern table above. If you need to use hex in an examination or otherwise, make up the table on a sheet of paper - it takes only a few seconds.

Hex codes groups of four bits into $\{0 \ldots F\}, whilst an octal digit,
${0 ...7}, can represent three bits.

Conversion - Decimal-Hexadecimal

Example. Convert 2210 to hex.

Method 1

Successively divide by 16 until you get to 0; the remainders (r), read backwards, are the hex digits.


\begin{align*}\frac{22}{16} &= 1\ r\ 6 \leftarrow \text{least significant digit}...
...frac{1}{16} &= 0\ r\ 1 \leftarrow \text{most significant digit} \\
\end{align*}

2210 = 1616.

To check, convert back to decimal:
\begin{align*}16_{16} &= 1 \times 16^1 + 6 \times 16^0 \\
&= 1 \times 16 + 6 \times 1 \\
&= 16 + 6 \\
&= 22
\end{align*}

Method 2

Convert 2210 to hex. For s start, convert this to binary, i.e. 101102. Then, starting from the right (least significant digit) divide into groups of four; then use Table 2.1.
\begin{align*}1\ &0110 \\
1\ &6 \\
\end{align*}
101102 =1616

Octal

Example. Convert 2510 to octal. Ans. binary: 110012; group into threes starting from right: $11\ 001$. This is 318.

Q. Why do real programmers (who, at least in the old days, always worked in octal) confuse Halloween and Christmas? A. Because Dec 25 = Oct 31 - ha, ha!

Binary Arithmetic

Addition

Assuming you do decimal addition as follows: (19 + 22) 2 + 9 = 11 i.e. 1 and carry 1; $2 + 1 = 3 + \text{carry}\ 1 = 4$; Answer: 41.
\begin{align*}&19\\
+&22\\
=&41
\end{align*}
, binary addition (and other bases too) follows the same pattern - the addition of single digits, and the carry rule simply need to be amended. Thus:
\begin{align*}&101\\
+&001\\
=&110
\end{align*}
1 + 1 = 10 i.e. 0 and carry 1; $0 + 0 = 0 + \text{carry}\ 1 = 1$, 0 + 1 = 1 (no carry this time).

Hexadecimal addition

Example: $1E_{16} + 25_{16}\ (30_{10} + 37_{10} = 67_{10})$
\begin{align*}&1E\\
+&25\\
=&43
\end{align*}
$ 5 + E = 3\ \text{and carry}\ 1$; $ 2 + 1 = 3 + \text{carried}\ 1 = 4$; Check: 4316 = 64 + 3 = 67.

Representation of Sign in Binary Numbers

The three primary methods used to represent signed binary numbers are: sign-magnitude, twos-complement, and biased (also called excess). Of these, twos-complement is by far the most common.

Sign-magnitude

In sign-magnitude, one bit is used for sign, typically 1 = -, 0 = +; usually the most significant bit is sign, and and the low-order p-1 bits are the magnitude of the number - where we have a total of p bits.

Biased

 

In a biased system, a fixed bias is added - such that the sum of the bias and the number being represented will always be non-negative; thus, what would normally be interpreted as 0 represents the most negative number.

Twos-complement

Twos-complement makes adding and subtracting particularly easy to implement; in a twos-complement system, negating a number is done by taking its twos-complement. Computing the twos complement of a binary number involves two steps:

1.
Complement each bit, i.e. ones complement; for example, in a four bit number: $0011 \rightarrow 1100$;

2.
Add 1, $1100 \rightarrow 1101$.

Thus, $0011_{2} = 3_{10} \rightarrow 1101$.

Example: In a four bit number system, represent -3 in each of the formats.

Table 2.2 shows a comparison of the three methods of representing signed numbers.


 
Table 2.2: Comparison of signed number representation (four bit number)
Value Sign-mag. Twos-comp. Biased (8)
-8 not poss. 1000 0000
-7 1111 1001 0001
-6 1110 1010 0010
-5 1101 1011 0011
-4 1100 1100 0100
-3 1011 1101 0101
-2 1010 1110 0110
-1 1001 1111 0111
-0 1000 * *
0 0000 0000 1000
+1 0001 0001 1001
+2 0010 0010 1010
+3 0011 0011 1011
+4 0100 0100 1100
+5 0101 0101 1101
+6 0110 0110 1110
+7 0111 0111 1111
 

Note:

1.
(*) In twos-complement and biased, there is only one zero!

2.
In both biased and twos-complement there is one more negative number than positive.

3.
In a p-bit number, the set of possible twos-complement numbers is $\{-2^{p-1}, \ldots 2^{p-1} - 1\}$.

Twos-complement arithmetic

Addition is straightforward: simply add and discard any carry out from the most significant bit.

Example: 2 + (-3) = -1

\begin{eqnarray*}\ &0010\\
+ &1101\\
0 &1111\\
\end{eqnarray*}


Example: 3 + (-3) = 0

\begin{eqnarray*}\ &0011\\
+ &1101\\
1 &0000\\
\end{eqnarray*}


and the 1 carried out of the most significant bit is discarded - note, this does not signify overflow, see the next subsection.

Subtraction: negate the number to be subtracted, then add.

Overflow

Overflow comes about when the result of an operation does not fit the representation being used. It is a common trap for the unwary, especially in programming languages, for example C, C++, various assembly languages, and machine language, which have no built-in checks for it.

Example: Positive overflow. In our four-bit number system, 3 + 6
\begin{align*}0011\\
0110\\
1001\\
\end{align*}
Hmmm! Consulting Table 2.2, we find that $3 + 6 \rightarrow
-1$.

Example: Negative overflow. In our four-bit number system, -3 - 6
\begin{align*}\ \ 1101\\
\ \ 1010\\
1\ 0111\\
\end{align*}
As is normal, the 1 carried out of the most significant bit is discarded. Hmmm again! Consulting Table 2.2, we find that $-3 -6 \rightarrow
+7$.

Detection of Overflow

For unsigned numbers, detecting overflow is easy, it occurs exactly when there is carry out of the most significant bit.

For twos-complement things are trickier; overflow occurs exactly when the carry-in to the most significant bit is different from the carry-out of that bit.

The twos-complement circle, wrap-around

If you examine Table 2.2, you will see (obviously enough when you think about it) that you can iterate in a positive direction along the twos-complement numbers by simply adding 1. Finally, when you get to 710 = 0111
\begin{align*}0111\\
0001\\
1000\\
\end{align*}
In other words, most positive number + 1 = most negative number; so, we wrap-around.

Exercise. Draw the circle for four bit numbers.

Fractions and Floating Point

Introduction

Often, the quantities we want to represent and compute are not integers. Furthermore, the intermediate results in a computation may not be integer.

To represent non integers, the following are required:

1.
Representation of the digits of the number - on both sides of the `point';

2.
Representation of the position of the `point'.

In a fixed point representation, the point is assumed to be in some fixed position - and, since it is fixed, the position need not be stored. In a floating point representation, the position of the point can be varied according to the size of the number. Floating point is by far the most common and useful on general purpose computers.

Fixed Point

To represent fractional numbers, we simply extend the positional system, introduced in section 2.3.1, to include digits corresponding to negative powers. We do this by introducing a point `.'; the integer number is as before, and is shown to the left of the point; the fraction part is shown to the right of the point.

Therefore, generalising what was given in section 2.3.1, a number which includes a qbit fractional part - for a total of p+q bits - is written:

$d_{p-1}d_{p-2}\ldots d_2d_1d_0.d_{-1}d_{-2}\ldots d_{-q+1}d_{-q}$

and its value is:

$value = d_{p-1} \times b^{p-1} + d_{m-2} \times b_{p-2} + \ldots d_2 \times
b_{...
...d_{-2} \times b_{-2} +
\ldots + d_{-q+1} \times b_{-q+1} + d_{-q} \times b_{-q}$

Thus, 123.4510,
\begin{align*}123.45_{10} &= 1 \times 10^{2} + 2 \times 10^{1} + 3 \times 10^{0}...
... 0.1 + 5 \times 0.01\\
&= 100 + 20 + 3 + 0.4 + 0.05\\
&= 123.45
\end{align*}

In binary,
\begin{align*}0.1 &= 1 \times 2^{-1} = 1 \times 0.5 = 0.5 \\
0.01 &= 1 \times ...
... \\
0.001 &= 1 \times 2^{-3} = 1 \times 0.125 = 0.125 \\
\ldots
\end{align*}

Example. Convert 11.1012 to decimal.
\begin{align*}11.101 &= 1 \times 2^1 + 1 \times 2^0 + 1 \times 2^{-1} + 0 \times...
...2}
+ 1 \times 2^{-3}\\
&= 2 + 1 + 0.5 + 0 + 0.125 \\
&= 3.625
\end{align*}

The the magnitude of the number depends on the fixed position of the point. The problem arises that sometimes we want most of the digits after the `point', i.e. small numbers, sometimes we want most of them before the `point', i.e. large numbers. A single fixed-point representation cannot do both.

Other methods that have been proposed for storage of fractions:

But, floating point is the widely adopted solution.

Floating Point

A floating point number is composed of two parts:

1.
The digits of the number, the fraction; the fraction is often called the mantissa, but, strictly, this is an abuse of the term.

2.
The position of the point - the exponent.

The floating point representation is similar to the `scientific', or `exponential' representation of very large and very small numbers on some calculators and in books; e.g. $123.56 = 0.12345\ E\ +03$, i.e. $0.12345
\times 10^3$. In this scientific representation, you can think of the exponent as representing the `point' position.

In fact, the exponent is the integer part of the logarithm to which the base must be raised in order to place the point in the right position.

Examples of decimal floating-point with five digit fraction and three digit exponent are shown in Table 2.3.


 
Table 2.3: Examples of floating-point

fraction

exponent floating-point fixed-point
12345 003 $0.12345
\times 10^3$ 123.45
12345 002 $0.12345 \times 10^2$ 12.345
12345 001 $0.12345 \times 10^1$ 1.2345
12345 000 $0.12345 \times 10^0$ 0.12345
12345 -001 $0.12345 \times 10^{-1}$ 0.012345

     

 


Binary Floating Point

We will explain binary floating point using a total of 8 bits (1 for sign, 3 for exponent, 4 for fraction); most practical systems use 32 bits, but the following explains the principles without the confusion of many bits. The bits are arranged as follows,

\begin{displaymath}\underbrace{sign}_{1\ bit}\ \underbrace{exp}_{3 bits}\ \underbrace{frac}_{4
bits}
\end{displaymath}

The fraction (mantissa) is just a straight fraction, i.e. the (virtual) point is at the left hand side of the four bits. The exponent is stored as a signed number, using excess 4 (biased by 4) representation, i.e. biased by 4 - recall Table 2.2. Thus, the bit patterns in the exponent represent values as shown in Table 2.4.


 
Table 2.4: 3-bit, excess 4

bits

111 110 101 100 011 010 001 000
value 3 2 1 0 -1 -2 -3 *

               

 

Of course, you don't have to look up Table 2.4, simply subtract 4 from the stored number.

The value, v, of the stored floating point number is given by eqn. 2.2.

 \begin{displaymath}v = f \times r^e
\end{displaymath} (2.2)

where f is the fraction, r is the radix (base) of the exponent and e is the exponent.

Or, to make the excess in the exponent more explicit,

 \begin{displaymath}v = f \times r^{e - q}
\end{displaymath} (2.3)

where q is the excess.

Normalisation

Unless otherwise stated, f is bounded as follows:

 \begin{displaymath}.1000 \leq f \leq .1111 f \times r^{e - q}
\end{displaymath} (2.4)

i.e. the top bit of the fraction is always set.

The most common practical floating point number system (IEEE Standard 754, single precision) uses 8 bits for exponent, 23 for fraction - including 1 for sign; a total of 32 bits. There is a double precision format at 64 bits and extended precision at 80 bits.

Hidden Bit

Notice that, in a normalised number, eqn 2.4, the top bit is always set; thus, we don't really have to store it - this is a trick that the IEEE standard uses to `steal' an additional bit for the fraction; actually, they work it so that the top bit is assumed to be just to the left of the `point'. We will adopt this trick for our floating-point-byte system. Finally we end up with eqn. 2.5, where we show the hidden bit, and make the sign explicit,


 \begin{displaymath}v = (-1)^s \times 1.f \times r^{e - q}
\end{displaymath} (2.5)

Denormalised Numbers

In special circumstances, IEEE 754 allows numbers to be stored with a denormalised fraction; for such cases they use exponent 000 which flags the special case. The major advantage is that we can store a `pure' zero.

Henceforth, we will call this small floating point number system flob - floating-point byte.

Example: Stored representation is 01101011, what is the value?

Sign. $ 0 \rightarrow +$. Stored exponent = 110 = 610, actual exponent = +2. Fraction = 1011, i.e. with hidden bit, 1.1011. Therefore,

\begin{eqnarray*}v & = & 1.1011 \times 2^{+2} \\
& = & 110.11 \times 2^0 \ \ \...
...
& = & 110.11 \\
& = & 4 + 2 + 0 + 0.5 + 0.25 \\
& = & 6.75
\end{eqnarray*}


Example: Stored representation is 10111100, what is the value?

Sign. $ 1 \rightarrow -$. Stored exponent = 011 = 310, actual exponent = -1. Fraction = 1100, i.e. with hidden bit, 1.1100. Therefore,

\begin{eqnarray*}v & = & -1.1100 \times 2^{-1} \\
& = & -.11100 \times 2^0 \ \...
... & = & -.11100 \\
& = & -(0.5 + 0.25 +0.125) \\
& = & -0.875
\end{eqnarray*}


None: In the following we will assume that the only allowed denormalised number is 0; i.e. $0000000 \rightarrow 0$; there is no other allowable case where stored exponent can be 000.

Exercises

1.

(a)
What is the largest positive number that can be represented using our 'flob' number system? Ans. 15.5. Explain.

(b)
Largest negative number?

(c)
Smallest positive number? Ans. 0.125. Explain.

(d)
Smallest negative number?

(e)
Smallest number in magnitude?

2.

(a)
Explain how many individual numbers can be represented by flob? Hint 1: how many positive, how many negative, and don't forget zero. Hint 2: for the positives (say), how many different fraction patterns, how many exponent patterns; multiply to get the total. Ans. 225. Explain.

(b)
Normally a byte can represent 256 different values; explain how we have lost the remaining (256-225) = 31 values?

Rounding and Truncation

As mentioned in the introduction, section 2.2.4, the real numbers form a continuum. However, instead of infinity of them, exactly 225can be represented in flob. In flob there are three legal regions on the real line, and four illegal, see Figure 2.1.


  
Figure 2.1: The Seven Regions of The Real Number Line.
\begin{figure}\begin{tex2html_preform}\begin{verbatim}-infinity -15.5 -0.125 0 0...
...low Underflow Underflow Overflow\end{verbatim}\end{tex2html_preform}\end{figure}

The dots (...) in Figure 2.1 are meant to show that, even within the legal regions, we do not have a continuum, but a discrete set of numbers. So, in reality, there are many other illegal regions, besides those mentioned above.

For actual numbers, we must compromise and represent them with a close legal number; this entails rounding or truncation.

Rounding

Rounding is probably the best - it takes the closest; e.g. in a one digit representation, $0.34 \rightarrow 0.3$, $0.36 \rightarrow 0.4$; therefore, the maximum error = 0.05, i.e. half the step between legal numbers (0.1).

Truncation

Truncation chops off the insignificant digits, e.g. $0.39 \rightarrow 0.3$; in this case, the maximum error is 0.09.

Normalisation

Normalisation is introduced to make maximum use of the fraction, i.e. ensure that the leftmost digit is non-zero. Thus, as mentioned above, $1.0 \leq f \leq 1.1111$.

The normalisation process is simple: shift the fraction one digit left at a time, each time decreasing the exponent by 1; when the left-most digit is non-zero, normalisation is complete.

Example: $0.0010 \times 2^0$.

$fraction = 0\ 0010,\ exponent = 000$; we show the hidden bit.

The fraction must be shifted left 3 places so the exponent has 3subtracted from it. The normalised value is $1.0000 \times 2^{-3}$. Therefore, we have,

$fraction= 1\ 0000,\ exponent = 001\ (-3)$ and this would be stored as $0\
001\ 0000$ (N.B. the top bit is hidden).

Conversion to Floating Point

Example. Convert 9 to 'flob', and express in hexadecimal format.

1.
Sign = 0.

2.
Convert fraction to binary: 1001.

3.
Normalise, i.e. shift the `point' left until you have 1.xxx, $1.0010 \times 2^3$ - point shifted 3 left = divide by 23, therefore we need to add 3 to exponent - which starts off at 0.

4.
Discard the (hidden) `1' - before the point.

$.0010 \times 2^3$.

5.
Convert the exponent to excess representation (excess = bias = 4); i.e. $stored\ exponent = actual\ exponent + 4$.

3+4 = 7 = 1112

6.
Assemble full representation,

\begin{displaymath}\overbrace{0}^{s}\ \overbrace{111}^{exp}\ \overbrace{0010}^{frac}
\end{displaymath}

7.
Group into fours for conversion to Hexadecimal, $0111\ 0010$

8.
Convert to Hex.: 72.

Example. Convert $-0.625 = - \frac{5}{8}$ to flob, and express in hexadecimal format.

1.
Sign = 1.

2.
Convert fraction to binary: .1010.

3.
Normalise, i.e. shift the `point' left until you have 1.xxx, $1.0100 \times 2^{-1}$ - point shifted 1 right = divide by 21 = multiply by 2-1, therefore we need to subtract 1 from exponent - which starts off at 0.

4.
Discard the (hidden) `1' - before the point.

$.0100 \times 2^3$.

5.
Convert the exponent to excess representation (excess = bias = 4); i.e. $stored\ exponent = actual\ exponent + 4$.

-1+4 = 3 = 0112

6.
Assemble full representation,

\begin{displaymath}\overbrace{1}^{s}\ \overbrace{011}^{exp}\ \overbrace{0100}^{frac}
\end{displaymath}

7.
Group into fours for conversion to Hexadecimal, 1011 0100

8.
Convert to Hex.: B4.

Example. $0.625 = \frac{5}{8}$

See above, same except: $sign \rightarrow 0$

$0011\ 0100 = 34_{16}$; normally write as 23H.

Example. 0 - exactly zero.

$0000\ 0000 = 00H$. Zero is a special case, see above, which is not normalised.

Example. Convert 00H from flob representations to decimal value.

You just have to remember that this is zero see above.

Example. Convert 35H from flob representations to decimal value.

1.
Express in binary: $0011\ 0101$.

2.
Pick off sign, bit 7: $ 0 \rightarrow +$.

3.
Extract stored exponent (bits 6,5,4): 011 = 3.

4.
Compute $actual\ exponent = stored\ exponent - 4 = 3-4 = -1$.

5.
Extract fraction: $0101\ \ (.0101)$.

6.
Now add in 1 before the point, i.e. the one that was `hidden': 1.0101.

7.
Express as `fixed point', i.e. make the exponent 0 - reverse the normalisation process: $1.0101 \times 2^{-1} = 0.10101 \times 2^0$.

8.
Convert to decimal: $0 + 1 \times 2^{-1} + 0 \times 2^{-2} + 1
\times2^{-3} + 0 \times 2^{-4} + 1 \times 2^{-5} = 0.5 + .125 +.03125 =
0.65625$

Addition and Subtraction in Floating Point

Here, we merely show the principle; and, to avoid some detail we use a decimal format.

Example. $2 \times 10^6 + 13 \times 10^3$. Ans. $2.013 \times 10^6 = 2013
\times 10^3 = 2013000$.

The general method for addition and subtraction, (A+/-B) is:

1.
Line up points by making exponents equal:

(a)
Identify the operand with the smaller exponent, say A - expA;
(b)
Do. Successively shift A's fraction right while increasing its exponent;

(c)
until expA = expB.

2.
Add (or subtract) the fractions.

3.
Normalise the result.

Example. $0.28 \times 10^1 + 0.12 \times 10^0$.

\begin{eqnarray*}&\ 0.280 \times 10^1 \\
&+ 0.012 \times 10^1 \\
&= 0.292 \times 10^1.
\end{eqnarray*}


Example. $0.28 \times 10^2 - 0.24 \times 10^2$.

\begin{eqnarray*}&\ 0.280 \times 10^2 \\
&- 0.240 \times 10^2 \\
&= 0.040 \times 10^2 \\
&= 0.400 \times 10^1\ \ \ \text{after normalisation}.
\end{eqnarray*}


Multiplication and Division

Strangely, these are easier than add and subtract.

Multiplication

1.
Multiply the fractions;

2.
Add the exponents;

3.
Normalise the result.

Division

1.
Divide the fractions;

2.
Subtract the exponents;

3.
Normalise the result.


\begin{eqnarray*}x &= &f_x \times r^{e_x} \\
y &= &f_y x r^{e_y} \\ \\
x \time...
...x+e_y)}\\ \\
\frac{x}{y} &= &\frac{f_x}{f_y} \times r^{(e1-e2)}
\end{eqnarray*}


Example. $0.120 \times 10^1 \times 0.253 \times 10^2 = 1.2 \times 25.3$.


\begin{eqnarray*}0.120 \times 10^1 \times 0.253 \times 10^2 &= & 0.030360 \times...
...cant digits},\\
& = &0.300 \times 10^2 \ \ \ \text{normalised}.
\end{eqnarray*}


Fixed versus Floating Point

Speed:
where there is no dedicated floating-point hardware, fixed - point arithmetic can be performed faster and easier, see below;

Ease of Implementation:
fixed point may be easier to implement easier; in the past, many computers - like the one we encounter in later chapters - have only integer arithmetic units, and so must use software (using algorithms like those shown above) to do floating point. Up to the Intel 80386 series - great-great-grand-daddy of the Pentium III - you had to have a special purpose floating-point co-processor (the 80387) for Intel XX86 microprocessors; modern digital-signal-processors (DSPs) may not have any floating-point;

Ease of programming:
generally, floating point saves the programmer any worry about over/under-flow;

Range:
fixed point limits the range of the numbers being represented. Floating point gives greater range - at the cost of resolution; see below for definitions of range and resolution.

Accuracy

Strictly speaking, accuracy is not a feature of the number system. Accuracy is determined by the error inherent in the production of the number, for example, the error in reading a metre. Thus, $12.34 \pm 0.01$; this indicates that the actual number could be anywhere between 12.33 and 12.35. It might be possible to include better resolution e.g. express it as 12.34567, but, due to the accuracy limitation, the 567would be meaningless.

Resolution

 

Resolution is measured by $\delta v$ - signifying value difference. $\delta v$ is the difference in values represented by neighbouring states. In floating-point systems, $\delta v$ varies according to the magnitude of the number (v); in fixed-point systems $\delta v$ remains constant across the range of v.

Example. Consider 01000 111 in flob. $v_1= 1.1000_2 \times 2^3 = (1.5
\times 8)_{10} = 12.0$. Next value up is $v_2 = 1.1001 \times 2^3 = (1.5625
\times 8)_{10} = 12.5$. Hence, $\delta v = v_2 - v_1 = 0.5$.

At $v_1 = 1.1000 \times 2^{-3} = 1.5 \times \frac{1}{8} = 0.1875$. Next value up is $v_2 = 1.1001 \times 2^{-3} = 1.5625 \times \frac{1}{8} = 0.1953125$. Hence, $\delta v = v_2 - v_1 = 0.0078125 (= \frac{1}{128})$. If you used the whole eight bits for fixed-point then $\delta v$ would be constant at $2^{-7} = \frac{1}{128} = 0.0078125$ over the whole range.

Range

Range or dynamic range is the difference between the largest positive number and the largest negative number. In `flob' this is 15.5 - -15.5 = 31.0, see above.

Precision

Precision is defined as the ratio of resolution to value, i.e. $\frac{\delta v}{v}$. Floating point gives an almost constant precision across the dynamic range.

Example. See section 2.6.13. At $v = 1.1000 \times 2^3 = 12.0$, $\frac{\delta v}{v} = \frac{0.5}{12.0} = 0.042$.

At v = 1.1000 x 2-3 = 0.1875, $\frac{\delta v}{v} =
\frac{\frac{1}{128}}{0.1875} = 0.04$.

Exercise

. What about eight bit fixed point? What is the precision at: (a) .00000001? (b) .11111111? Note the large difference.

Common Number Formats Used in Computers

Various Sizes of Integers and their Ranges

Table 2.5 shows some commonly used integer formats and their ranges.


 
Table 2.5: Commonly Used Integer Formats

Size Name in Java Minimum Maximum
8-bit signed byte -128 127
16-bit signed short -32768 32767
32-bit signed int -2,147,483,648 2,147,483,647

     

 

Floating Point

IEEE single precision (32 bits): $\pm 1.2 \times 10^{-38}\ \text{to}\ 3.4
\times 10^{38}$, 7 significant digits.

$value = (-1)^s \times 1.f \times 2^{exp-127}$.

$bit\ {31} = sign$, $bits\ 30\ \text{to}\ bit\ 23 = 8\ bit exponent$, $bits\ 22
\text{to}\ bit 0 = fraction$.

IEEE floating-point uses the hidden bit trick mentioned above.

Bitwise Logical Operations

[Chapter 3 is a prerequisite for this section].

In most assembly languages, and in C and C++ there are binary operations which logical operations on each corresponding pair of bits.

AND


\begin{eqnarray*}a & = & 0011\ 0000 \\
b & = & 0101\ 0111 \\
a\ \&\ b& = & 0001\ 0000
\end{eqnarray*}


Masking

Let us say that we have two four bit numbers packed into a byte. How do we disentangle them. Take the byte $1001\ 0111$; the two four bit numbers are 7and 9. First, the lower four bits; here we mask with $0000\
1111$.


\begin{eqnarray*}a & = & 1001\ 0111 \\
lowMask & = & 0000\ 0111\\
a\ \& lowMask & = & 0000\ 0111
\end{eqnarray*}


For the higher four bits, we shift right by four bits (see subsection2.8.3, and use the same mask.

OR


\begin{eqnarray*}a & = & 0011\ 0000 \\
b & = & 0101\ 0111 \\
a\ \mid b& = & 0111\ 0111
\end{eqnarray*}


   
Shift


\begin{eqnarray*}a & = & 0011\ 0000\ (= 16 + 32 = 48_{10}) \\
a\ \text{shiftlef...
...ght}\ 1& = & 0001\ 1000\ (= 24_{10})\ \text{i.e.}\ \frac{48}{2}
\end{eqnarray*}


Thus, you sometimes find assembly language programmers using shift for multiplying (shift left) and dividing (shift right) by powers of two.

Rotate


\begin{eqnarray*}a & = & 1011\ 0000 \\
a\ \text{rotateleft}\ 1 & = & 0110\ 0001\ \\
a\ \text{rotateright}\ 1 & = & 0101\ 1000\
\end{eqnarray*}


Rotate is like shift, only bits can wrap-around.

Alphanumeric codes

Text characters, as well as numbers must be binary coded to get them into a computer memory. A code that copes with alphabetic characters, {a ...z, A ...Z} as well as digits {0 ...9 } is called alphanumeric.

Up to recently, the most frequently used code was the ASCII (American Standard Code for Information Interchange) code; ASCII uses 7-bits, (0 to 127); this includes printable and control characters.

Table 2.6 gives some codes worth remembering.


 
Table 2.6: ASCII table

Name

Hex if control

bel

07 ctrl-G
line-feed 0a ctrl-J
carriage-return 0d ctrl-M
blank-space 20  
digits    
0 30  
...    
9 39  
alphabetic upper-case    
A 41  
...    
Z 5a  
alphabetic lower-case    
a 61  
...    
z 7a  

   

 


UNICODE

ASCII has proved adequate for English text. But what about languages with accents? And those with funny shaped characters? UNICODE is a new international standard, governed by the UNICODE consortium, aimed at allowing a richer character set. In fact, Java uses UNICODE a 16-bit code - so that Java char type is 16-bit, compared to 8 bit in C and C++. According to [Tanenbaum, 1999] the total of known characters in the world's languages number some 200,000 - thus, there is already pressure on UNICODE.

Types

41H could be `A' or the number 41H. "AB" together in a 16-bit word could be part of a string, or the number 4142H, or part of an instruction, or .... The computer program must know beforehand what each memory cell represents - thus, types.

Error Detection

When information is transferred between various parts of a computer system it is possible that errors may be introduced. These errors take the form of transmitted 0s inadvertently being received as 1s or vice-versa.

Parity provides single error detecting capability. The scheme involves appending an additional bit to the string of 1s and 0s - the parity bit. Depending on the chosen convention, the appended bit is chosen so that the total number of 1s in the string is always even even parity or always odd odd parity.

Bit 7 is often used - pure ASCII uses 7 bits, bits $0 \ldots 6$..

Example.. ASCII character `A' using odd parity.

$A = 41 = P100\ 0001 = 1100\ 0001$ with odd parity bit placed in bit 7.

Exercises

1.
(Section 2.2.1) Java has a type boolean; what is the set of values for it? Identify some legitimate operations on boolean values (Hint: how would you write an if statement to do something if the integers i and j satisfy the following conditions: (a) i between 1 and 10 and j is between 21 and 30, (b) i is between 1 and 10 or it is between 21 and 30, (c) i is any value except those between 50 and 60. Identify some operations on boolean values i.e. that will cause compiler syntax errors; hint: int, see below.

2.
Java has a type int; what, roughly, is the set of values for int? Identify some legitimate operations on int values. Identify some operations on int values i.e. that will cause compiler syntax errors; hint: boolean, see below.

3.
Is type checking worthwhile?

Section 2.3.

4.
Does the Roman number system fit into the model given in section 2.3.1? Discuss.

5.
The set of integers ${0, 1, 2, \ldots}$ are called natural numbers. Why? Is 0 really natural?

6.
Convert the following 4-bit binary numbers to decimal (assume unsigned): 0000, 1111, 0001, 0101, 0111, 1000, 1001.

7.
Convert the following decimal numbers to binary - use as many bits as you need (assume unsigned): 2, 4, 7, 8, 23, 127, 129, 192.

8.
Convert the following 4-bit binary numbers to hexadecimal (assume unsigned): 0000, 1111, 0001, 0101, 0111, 1000, 1001.

9.
Convert the following 8-bit binary numbers to hexadecimal (assume unsigned): 0000 0001, 0001 0000, 1111 0001, 1001 1000.

10.
Convert the following decimal numbers to hexadecimal - use as many digits as you require (assume unsigned): 2, 4, 7, 8, 23, 127, 129, 192; Hint: it may be easier to complete a previous question first.

11.
(a) Suggest symbols for a 'base-12' number system. (b) What is 33 base-12 in decimal? (c) What is 53 decimal in base-12? (d) What is 59 decimal in base-12?

12.
(a) Suggest symbols for a 'base-36' number system. (b) What is 22 base-36 in decimal? (c) What is 73 decimal in base-36? (d) What is 71 decimal in base-36?

13.
How many states are possible with: (a) 12 bits, (b) 16 bits, (c) 4 bits, (d) 1 bit.

14.
How many bits are required to uniquely represent: (a) 2048 things, (b) 1000 things, (c) 100 things, (d) the decimal digits, (e) the letters 'a' to 'z', (f) 1 thing.

15.
Perform the following arithmetic on 4-bit binary values: (a) 0111 + 0110; (b) 0111 + 0110; (c) 0101 + 0101 -- this is $101 \times 2$, notice what multiplying by 2 does; what do you thing division by 2 does? (d) (problems here! -- explain) 0111 + 1101.

16.
Perform arithmetic on the following two digit hexadecimal values (assume that they must fit into 8-bits unsigned - point out any invalid results/overflows): (a) 2E + 15 (see section 2.4.2 if you need help); (b) 7E + 25; (c) 1F + 31; (d) FF + 01.

17.
(a) Give the amended rules for base-36 arithmetic. (b) Add 11 base-36 to 22 base-36. (c) Check your work in decimal.

Section 2.5. Note: in an exam, I'd expect you to be able to describe the three methods of representing signed numbers - with examples and brief compare/contrast.

18.
For this question, assume 4-bit twos-complement representation. In each case check for overflow and in your answer, note the validity of the result. Convert the following decimal numbers to binary: 2, 4, 7, -2, -4, -7, 8, -8.

19.
For an 8-bit 2-s complement format construct, and (where appropriate) give decimal value of: (a) the most negative number; (b) the most positive; (c) -1; (d) most negative plus 1; (e) zero.

20.
For this question, assume 4-bit twos-complement representation. In each case check for overflow and in your answer, note the validity of the result. Perform the following arithmetic: (a) 0001 + 0101; (b) 0001 + 1101; (c) 1101 + 0101; (d) 1000 + 0111; (e) 0111 + 0110; (f) 1001 + 1001; (g) 1001 - 1001; (h) 1001 + 0001; (i)1101 + 1101.

21.
(Section 2.5.6) Draw the circle formed by 4-bit twos-complement representation.

22.
(Section 2.6). Convert the following binary numbers with fractional parts to decimal: (a) 11.11, (b) 101.111;

23.
Convert the following decimal numbers with fractional parts to binary: (a) 5.5; (b) 2.25; (c) 3.625; (d) 5.875.

24.
Section 2.8. Perform the following bit-wise logical operations: (a) $0101\ 1001\ \mathbf{and}\ 0000 1111$; (b) $0101\ 1001\ \mathbf{and}\ 0000
0110$; (c) $0101\ 1001\ \mathbf{or}\ 0000 1111$; b) $0101\ 1001\
\mathbf{or}\ 0000 0110$.

25.
Perform the following shift operations: (a) $0000\ 0110$shiftright 1; (b) $0000\ 0110$shiftleft 1; (c) convert the originals and the shifted to decimal; hence deduce the arithmetic equivalent of shift-right? shift-left?

26.
Translate the following ASCII (8-bit) code string (in hexadecimal) to text characters: 68656C6C6f20776f726C640D0A4279650D0A; if necessary, see Deitel & Deitel p. 1242, but my Table 2.6 should be adequate, i.e. count ahead, 61 = 'a', 62 = 'b', 63 = 'c' etc.

27.
(a) Translate the text string "Good day" to ASCII (hex.) (b) Add three 'beeps' to it.

28.
If you did a hex print out of memory and found the following four contiguous bytes: 30313233. Make some suggestions as to what is represented. In the context of types comment on this state of affairs.

29.
Insert odd parity bits (top bit) to the following character string 68656C6C6f20.

30.
What is the major difference between UNICODE and ASCII?

31.
You have a file containing unsigned bytes. You want to transmit it to another computer using a line normally used for ASCII characters: (i) identify the major problems that you will encounter; (ii) design a failsafe (and fairly efficient) solution (involving a simple encoding and decoding); (iii) if your file starts off at 1000 bytes, how many characters will you transmit; (iv) how many bits.

32.
Suggest appropriate number systems for each of the following: (use integer where possible, use a few bits as possible, consider the use of code if appropriate; when I say code I mean that the 'real-world' quantity may be represented by other than a 'straight' representation, e.g. if I had student marks in the range 33 to 63 (out of 100), I could code this into 5-bits, using a biased representation -- see notes above): (a) speeds of vehicles (e.g. in a radar speed trap machine); (b) identifiers for all the students in this class; (c) identifiers for all the people in Northern Ireland; (d) identifiers for all the people in the world; (e) identifiers for all the computers in the world (well, those connected to the Internet).

33.
Integers (int) in Java are usually 32 bits. Explain how you would construct an integer number system that would handle 256 bits (say), i.e. which would give a range: $-2^{255}\ to\ +2^{255} - 1$.

34.
Explain a simple manual way of generating random numbers in the range 0 to 255; note, I don't consider any of the following simple enough: (i) construction of a 256 sided dice; (ii) putting 256 labelled cards in a hat and drawing lots. (Hint: think of it as a binary number);

35.
Explain how many bits are required for a completely faithful representation of a Real number; let's say it we are restricted to the range the range 0.0 to 1000.0 -- but that doesn't affect the answer at all! Hint: it's an awful lot!

36.
Explain how many bits are required to store a signed rational number $\frac{N}{D}$, where the numerator N, and the denominator D are restricted to the ranges $[0 \ldots 32767]$, $[1 \ldots 32768]$, respectively. Ans. 31 (explain).


next up previous contents
Next: Digital Logic and Boolean Up: Computer Architecture - CSC Previous: Introduction and Overview
jc
2000-11-13