Addendum 2.1: Floating-Point (Real) Numbers

We already know the basic layout for the data type float (from figure 2.20):


1-bit Sign	7-bit Exponent	24-bit Mantissa

We also know that (generally), the left-most bit is the sign, and can take-on either ‘0’ or ‘1’ values.

Let's store the value 7.89, OR .789E+1 (i.e., 0.789 * 10¹). First, let's talk about how we would store the Exponent vale (+1).

The characteristic of the exponent is not that difficult to understand either. Since we have 7-bits, we have a total of 2⁷ = 128 combinations, but since we know that the exponent can be either negative or non-negative, we really only have ½ that number.

Not quite. The characteristic of the exponent is stored as a biased exponent. That means that rather than storing the sign and the value separately, we add (bias) a constant term to the true value. In our case, we would add the value 64 (which is ½ of 128) to the true value. The exponent value –17₁₀ (= 10001₂, or 0010001₂ on 7-bits) would actually be stored as the value –17 + 64 = 47₁₀ (= 101111₂, or 0101111₂ on 7-bits); the exponent value 23₁₀ (=10111₂ or 0010111₂ on 7-bits) would actually be stored as the value 23 + 64 = 87₁₀ (=1010111₂). In our case, the exponent value 1 would actually be stored as 1 + 64 = 65₁₀ (= 1000001₂)

There are some technical reasons, which we need not go into, but of course, it does circumvent the step of having to store the sign and the value separately.

One final quick note on characteristics: the range of exponent values is actually –(2⁶ – 1) through +(2⁶ – 1), or –63 through +63. The binary representation 0000000₂ (the decimal value 0) is reserved for other uses.

Converting the mantissa to binary requires a somewhat different algorithm than we used to convert integers to binary, but it still has to do with exponent position. For example, the integer 456 would have the exponent positions shown at the left (in other words, 456 = 4*10² + 5*10¹ + 6*10⁰). If we were to consider the real number 456.789, however, the exponent positions would appear as the do on the right (in other words, 456.789 = 4*10² + 5*10¹ + 6*10⁰ + 7*10^-1 + 8*10^-2 + 9*10^-3). Notice that the exponents for the mantissa are the inverse of the positions for the integer portion of the number.

The procedure we need to use is also the inverse of the procedure we used when converting a decimal integer to binary. Previously, we divided the integer portion by two and kept track of the remainders and collected reverse order received. Now, we need to multiply the mantissa by two and keep track of the quotients and collect in order received. In both cases, however, we can stop when the value to be multiplied or divided is 0.

The mantissa .789, for example could be converted to binary as follows:

*Mantissa 2**	=	Result	®	Quotient	*Mantissa 2**	=	Result	®	Quotient
0.789 * 2	=	1.578		1	0.744 * 2	=	1.488		1
0.578 * 2	=	1.156		1	0.488 * 2	=	0.976		0
0.156 * 2	=	0.312		0	0.976 * 2	=	1.952		1
0.312 * 2	=	0.624		0	0.952 * 2	=	1.904		1
0.624 * 2	=	1.248		1	0.904 * 2	=	1.808		1
0.248 * 2	=	0.496		0	0.808 * 2	=	1.616		1
0.496 * 2	=	0.992		0	0.616 * 2	=	1.232		1
0.992 * 2	=	1.984		1	0.232 * 2	=	0.464		0
0.984 * 2	=	1.968		1	0.464 * 2	=	0.928		0
0.968 * 2	=	1.936		1	0.928 * 2	=	1.856		1
0.936 * 2	=	1.872		1	0.856 * 2	=	1.712		1
0.872 * 2	=	1.744		1	0.712 * 2	=	1.424		1

And COLLECTING FROM THE TOP, the mantissa .789 would be stored (on 24-bits) as: 110010011111101111100111

and the value 7.89 would be stored as: 0 1000001 110010011111101111100111 (on 32-bits)

This is true. We previously noted that, unlike integers, not all mantissas can be converted to binary. Sometimes, they become an infinite series.

Much in the same way we checked integers. For example we know that the binary representation of the integer 456₁₀ = 111001000₂ can be checked as:

2⁸ + 2⁷ + 2⁶ + 2³ = 256 + 128 + 64 + 8 = 456

Since we know that the positions of the mantissa are the inverse of the integer positions, we know that binary mantissa 110100111111011111001110 (= .789₁₀) can be associated with the positions:

-1	-2	-3	-4	-5	-6	-7	-8	-9	-10	-11	-12	-13	-14	-15	-16	-17	-18	-19	-20	-21	-22	-23	-24
1	1	0	0	1	0	0	1	1	1	1	1	1	0	1	1	1	1	1	0	0	1	1	1

Meaning that the expression could be checked as:

2^-1 + 2^-2 + 2^-5 + 2^-8 + 2^-9 + 2^-10 + 2^-11 + 2^-12 + 2^-13 + 2^-15 + 2^-16 + 2^-17 + 2^-18 + 2^-19 + 2^-22 + 2^-23 + 2^-24

First, let’s approach it as a decimal mantissa. The decimal mantissa .789 could be rewritten as:

10^-1

10^-2

10^-3

10¹

10²

10³

100

1000

.08

.009

.789

The binary mantissa (as with integers, ‘0’ bits are ignored) could be rewritten in the same fashion

2^-1

2^-2

2^-5

2^-8

2^-9

2^-10

2^-11

2^-12

2^-13

2^-15

2^-16

···

···

2¹

2²

2⁵

2⁸

2⁹

2¹⁰

2¹¹

2¹²

2¹³

2¹⁵

2¹⁶

···

256

512

1024

2048

4096

8192

···

.25

.0313

. 0039

. 0019

. 0098

. 0005

. 0002

.0001

···

» .789

We must first normalize the number:

456.789 = 45.6789 * 10¹

= 4.56789 * 10²

= .456789 * 10³

Where the last notation is what we will use to store the number:

Sign	Exponent	Mantissa
1-bit	7-bits	24-bits
0 (positive)	+3	.456789

Storing the first 8-bits (the sign and the characteristic) is relatively easy:

Sign: 0 (positive)

Characteristic: 3₁₀ + 64₁₀ = 67₁₀ = 1000011₂ (on 7-bits)

Sign and characteristic lay-out: 01000011 (using the first 8-bits)

Converting the mantissa .456789 as we did before

*Mantissa 2**	=	Result	®	Quotient	*Mantissa 2**	=	Result	®	Quotient
0.456789 * 2	=	0.9136		0	0.007744 * 2	=	0.0155		0
0.913578 * 2	=	1.8272		1	0.015488 * 2	=	0.0310		0
0.827156 * 2	=	1.6543		1	0.030976 * 2	=	0.0620		0
0.654312 * 2	=	1.3086		1	0.061952 * 2	=	0.1239		0
0.308624 * 2	=	0.6172		0	0.123904 * 2	=	0.2478		0
0.617248 * 2	=	1.2345		1	0.247808 * 2	=	0.4956		0
0.234496 * 2	=	0.4690		0	0.495616 * 2	=	0.9912		0
0.468992 * 2	=	0.9380		0	0.991232 * 2	=	1.9825		1
0.937984 * 2	=	1.8760		1	0.982464 * 2	=	1.9649		1
0.875968 * 2	=	1.7519		1	0.964928 * 2	=	1.9299		1
0.751936 * 2	=	1.5039		1	0.929856 * 2	=	1.8597		1
0.503872 * 2	=	1.0077		1	0.859712 * 2	=	1.7194		1

And COLLECTING FROM THE TOP, the mantissa (on 24-bits) is: 011101001111000000011111

Therefore, the real number +456.789 would be stored as:

01000011011101001111000000011111

(on 32-bits)

Sometimes, but not very often. Naturally, increasing the number of bits we allocate to the mantissa helps increase the precision of the mantissa, but it doesn’t always assure that we will be able to represent values exactly.

Yes.

That is beyond the realm of our discussion. However, there is enough information provided to allow the truly die-hard student to figure it out.

They are. As we noted previously, that is why some supercomputers indicate operating speed in terms of flops (floating-point operations per second).

Addendum 2.1: Floating-Point (Real) Numbers

Mantissa * 2

=

Result

®

Quotient

Mantissa * 2

=

Result

®

Quotient

0.789 * 2

=

1.578

1

0.744 * 2

=

1.488

1

0.578 * 2

=

1.156

1

0.488 * 2

=

0.976

0

0.156 * 2

=

0.312

0

0.976 * 2

=

1.952

1

0.312 * 2

=

0.624

0

0.952 * 2

=

1.904

1

0.624 * 2

=

1.248

1

0.904 * 2

=

1.808

1

0.248 * 2

=

0.496

0

0.808 * 2

=

1.616

1

0.496 * 2

=

0.992

0

0.616 * 2

=

1.232

1

0.992 * 2

=

1

0.232 * 2

=

0.464

0

0.984 * 2

=

1.968

1

0.464 * 2

=

**Mantissa * 2**

**Mantissa * 2**

**Mantissa * 2**

**Mantissa * 2**