# FP Numbers Explained

How Do They Affect Our Music?
By David Nash

A reference guide to understanding the terms used in SOS articles and technical documentation.

Many of the articles in SOS, not to mention the specifications for audio hardware and software, use the terms fixed point numbers, floating point numbers and decibels. But what do terms like this actually mean? And what are the consequences for our music? This guide is intended to help SOS readers who would like to know a bit more about the relationship between the sounds that we hear, and the numbers that are pushed around inside our PCs.

## Sound Intensities And Decibels

The range of sound intensities which can be processed by the human ear covers about 14 powers of 10 (1014). If we exclude all outside sounds, we can derive a threshold intensity at which sound can barely be heard. This is of the order of 10-12 Watts per square metre at frequency of 1kHz, but it's not very practical for audio engineers and musicians to use these units from physics. It is more useful to have a measure which uses the threshold level (or some other reference level) as a starting point, and states the factor by which a given sound is larger than the threshold. That is, if Ip is the particular sound intensity of interest, and I0 is the threshold intensity, we can use the ratio Ip/I0 as a measure. In order to bring the large numbers given by powers of 10 into a more manageable range with our subjective perception of sound, we take logarithms to base 10 of the ratio, and use the following formula to measure sound in terms of its intensity level.

Intensity level = 10 log(Ip/I0)

This unit of intensity is the decibel, abbreviated dB, and measures the amount of sound power per unit area. From this definition we can obtain a useful scale with which to measure sounds as shown in Table 1. This puts the level of the threshold of hearing as zero decibels (0dB).

 Table 1: Sound intensities expressed in decibels Intensity (Watts/m2) Ratio (Ip/I0) Level (dB) Threshold of hearing 10-12 1 0 ppp (very soft) 10-8 104 40 p 10-6 106 60 f 10-4 108 80 fff (very loud) 10-2 1010 100 Threshold of pain 100 1012 120 Very painful 102 1014 140

We now have a manageable range of numbers (0 to 140) which covers the range of our aural processing. It can be seen that the range from fff to ppp is 100−40 or 60dB, a very manageable set of numbers to use in music.

Doubling the intensity of a sound increases its measured level by 3dB, regardless of its current intensity:

dB = 10 log(2/1) ie. 10 x 0.3010
= 3

and halving the intensity decreases the measured level by 3dB:

dB = 10 log(1/2) ie. 10 x −0.3010
= -3

Microphones respond to pressure amplitude. The intensity (power) of a sound wave is proportional to the square of the sound pressure level (amplitude, analogous to voltage), therefore when used to measure sound pressure level (SPL), the decibel becomes:

dBSPL = 10 log(SPLp/SPL0)2
= 20 log(SPLp/SPL0)

Psychologists have discovered that at a level of 30dB using a 1kHz signal, the smallest perceived change in sound level is about 1dB.

## Audio And Decibels

In audio engineering, the decibel still describes how much larger or smaller one value is than the other. It can also be used as an absolute unit of measurement for lining up audio equipment if the reference is fixed and known.

The decibel is now defined as 10 multiplied by the logarithm to base 10 of the ratio between the powers P of two signals:

(Equation 1)
dB = 10 x log(P2/P1)

As before with sound intensities, a doubling of the power will be an increase of 3dB. If we want to use the decibel to measure increases/decreases in signal levels using voltage or current, we need to take into account the relationship between power (W) and voltage (V) or current (I). Ohm's law gives W = V2/R, or I2R, so to compare two voltages:

(Equation 2)
dB = 10 x log(V2/V1)2
= 20 x log(V2/V1)

If we double the voltage of a signal from V1 to V2, the gain in dB will be:

(Result 1)
dB = 20 x log(V2/V1)
= 20 x log(2/1)
= 20 x 0.3010
= 6

That is to say, the doubling of a voltage (in a permitted range) is a change of +6dB in the level of the audio signal that voltage represents. A halving of the voltage will be −6dB change. We can now use decibels to describe the voltage gain of an amplifier. If the gain is quoted as 40dB, we use equation 2 to find the ratio R by which the input signal is multiplied.

(taking anti logs)
40 = 20 x log(R)
2 = log(R)
102 = R
100 = R

Therefore a 40dB gain is the equivalent of multiplying the input voltage by 100.

Computer Storage

In the digital domain, we typically capture sound by sampling an incoming signal voltage and converting the sample value to a digital value to be stored in some binary form. So we need to know how computer storage systems can store numbers in different forms to represent these signal samples, and how we can increase or decrease them by a specified decibel value.

## Fixed-point Systems

Storing integers in a computer is a relatively simple matter. A 16-bit storage unit has 216 combinations of values. If we are not interested in storing negative numbers, then the largest number that can be held in 16 bits is 216-1, which is 65,535. If we need to store negative numbers, then we split the available combination into two halves. The lower half where the leftmost bit is zero is used to store positive numbers, and the upper half where the leftmost bit is always set to one is used to store negative numbers. Thus we still have the same combination of numbers, but instead of being in the range 0 to 216-1, they are now in the range -215 to 215-1 (-32,768 to 32,767). Table 2 shows how various numbers are stored.

 Table 2: How numbers are stored Decimal value Binary value +12 0000000000001100 +255 0000000011111111 -1 1111111111111111 -21 1111111111101011

To find the positive value of a negative binary number, we invert all the bits, and then add 1. This representation of negative numbers is called the two's complement system. Fractions can also be represented if space is allocated: binary fractions to the right of the binary point take the values 1/2, 1/4, 1/8, 1/16 and so on, so decimal 20.625 is binary 10100.101, because 0.625 is 5/8, which is 1/2 + 0/4 + 1/8. Unfortunately, unless the denominator of the decimal fraction is a power of 2, the binary fraction will not convert exactly.

In the decimal system, we can increase a number by a factor of 10 by shifting it one place to the left. For example, 21.6 becomes 216.0. In the binary system, we increase a number by a factor of 2 by shifting it one place to the left. For example, 1100.1 (12.5) becomes 11001.0 (25.0).

It can also be seen that in order to double the capacity of a binary storage system, one extra binary digit is needed. As we have seen, to double the value of a number in that system, we shift it one place to the left. From Result 1, the doubling of a signal is an increase of 6dB, therefore if we use a binary integer system to store a 16-bit sample, say, each binary digit represents a potential change (increase or decrease) of 6dB. A 16-bit storage unit can therefore store values in a (16 x 6) = 96dB range. This number is known as the dynamic range of the storage system, since whatever actual (sensible) values we use to record signal levels, they can be changed within a 96dB range. We could decide, for example, that when all the bits are set to 1 (as this is the only valid reference point), this represents a level of 0dB, the maximum value handled by the system (here we are talking about 0dBFS, where FS means 'full scale'). When all bits are set to 1 there is no more headroom to store any further increases. All other values are counted down from this level, and when the value of just 1 is present, this could represent −96dB, the minimum value handled by the system. As we shall see later, it is wise to leave some bits for headroom overflow, as well as some bits to accommodate fractional values of numbers.

It is clearly a relatively simple operation to change the value in a 16-bit storage system by 6dB by shifting the number one place to the left (+6dB), or to the right (-6dB).

For example, let's consider a 16-bit sample containing the value 27.

 0000000000011011 (27) shifting one place left gives 0000000000110110 (54)

The gain in decibels is:

dB = 20 x log(54/27)
= 20 x log(2)
= 6

Note that if we attempt to reduce the value 27 by 6dB, by shifting one place to the right, the value becomes:

0000000000001101

This is 13 in decimal representation. The rightmost bit has been lost; we should have 1101.1, which is 13.5. Moreover, a reduction of 12dB (two shifts right) would have lost two bits, resulting in a value of 6 instead of 6.75. Further arithmetic on these numbers could introduce even more loss of accuracy. What can we do about it? And how can we make changes which are not multiples of 6dB, for example, 5 dB, or −4dB?

Table 1 shows that our ears can manage a dynamic range of about 140dB. In order to store values of this order, we need storage units of (140/6), ie. 24 bits. But both 16-bit and 24-bit integer storage can still only record changes in multiples of 6dB. Special-purpose hardware like digital mixing desks, and some applications development languages, allow integer arithmetic on 32-bit, fixed-point numbers, where the hardware determines the position of the binary point (if there is one), or permits the management of this to be left to the programmer.

Table 3 shows how 24 bits (three bytes) may be assigned to store the results of signal processing. Such an assignment would normally have to be managed by the applications programmer.

Table 3:

4 bits

16 bits

4 bits

>

¥ Binary point

The 4-bit headroom allows for 'overflow' calculations of up to 24dB. The 4-bit fraction permits calculations to be made to an accuracy of 1 decimal digit. The theoretical dynamic range, including the headroom is 20 x 6 = 120dB.

Let's assume that the above 24 bits contain the integer value 9, and that we want to increase its value by 4dB. The ratio for this is clearly less than the ratio 2 which gives a 6dB increase; it is 1.585. The required result is 9 x 1.585 = 14.264. The integer value 14 can easily be accommodated in the allocated 16 bits. But what about the fractional part 0.264? When we convert this to a binary fraction, it has the following bit pattern (up to 24 bits):

.010000110110110010001011... and cannot convert exactly

If we store the most significant (leftmost) bits (0100) of the fraction result in the 4 bits allocated, the actual value stored is:

(0 x 2-1) + (1 x 2-2) + (0 x 2-3) + (0 x 2-4) = 1/4 = 0.25

The resulting fraction is accurate to one place of decimal, but it has already lost 0.014 from the actual result. Further arithmetic processing will cause more loss of accuracy in the results.

## Floating-point Systems

The basic problem in the fixed-point integer/fraction system for a computer implementation is how many binary digits to allocate to the integer and fractional parts respectively. An astronomer might want to record numbers as large as 1 billion. A physicist might want to use numbers as small as 1 millionth. Audio engineers use 16, 24, or 32-bit systems to store numbers in a relatively small range. To store numbers as large as 1035 would require some 128 bits just for the integer part of the number. Another 128 bits might be allocated to store the fractional part of the number. This is undesirable in several respects. The accuracy (discussed later) of 76 decimal digits afforded by the system is larger than most normal requirements; the system would use too many storage units (32 bytes); arithmetic operations on such units would be slow, and could still incur unacceptable rounding and truncation errors.

A practical solution to storing numbers which may vary over a large range of non-integer values is to store the number as two components: a fractional part, always justified in a particular manner, and an exponent which describes the justification of the fractional part (ie. the number of places it should be shifted to the left or right). The number of storage units allocated to the fractional part determines the accuracy of the system. The range of numbers that can be stored is determined by the number of bits allocated to the exponent, but since there is the requirement to store very small and very large numbers in the same system, the maximum value possible by the exponent is split into two, the upper half storing numbers greater than or equal to the smallest value of the normalised fraction (ie. exponents that shift the fraction to the left), and the lower half storing numbers less than the smallest value of the normalised fraction (ie. exponents that shift the fraction to the right).

Computer systems, especially the PC, have to be all things to all men, and implement their floating-point storage systems in different ways with regard to the assignment of storage units for the fraction and exponent. However, using 32 bits for the entire number is very common, and this can give a generally useful guarantee of range and accuracy. If you know what this is, you can decide whether its implementation is suitable for the numbers used within your work. Two such implementations are described. Note that outboard equipment often deals with numbers in a special range on which limited arithmetic operations will be performed, and can therefore have specialised storage systems to meet the particular processing requirement. In this case, scaled fixed-point systems may be preferable to floating-point systems, as the implementation of arithmetic calculations is far faster than it is in floating-point systems.

## What Do They Look Like?

In the decimal system, the number 326.42 can be represented as:

0.32642 x 103 (103 = 1000)

In other words, it becomes a fraction, 0.32642, multiplied by an exponent, 3, to an exponent base, 10. This is the basic form of a floating-point number in a computer, except of course, the components of the number are stored as binary numbers. For numbers less than 1, the exponent would be negative, thus the number 0.00234 can be represented as 0.234 x 10-2 (as 10-2 = 1/100).

Two computer implementations are described. Implementation 1 is relatively straightforward, and facilitates an understanding of the theory and implementation of floating-point numbers. It was used by various third-generation computers in the '60s and '70s. Implementation 2 is defined by IEEE Floating Point Standard 754. This implementation is more complex. It has a 'hidden 1' bit to give the fraction more storage, and the exponent reserves particular (extreme) values to cope with overflow, underflow, true zero and infinity. Single-, double- and extended-precision forms are defined, the latter, perhaps, anticipating 64-bit hardware. Microsoft's Visual Basic, Steinberg's Cubase and Wavelab , Fortran and C++ use this implementation.

Implementation 1

A 32-bit system (four bytes) could have its bits assigned from left to right, as follows, to store a useful floating-point number:

• Bit 0 Sign bit (set to 1 if the number is negative)
• Bits 1 to 7 E, the exponent to base 16, in excess 64 format, of the number
• Bits 8 to 31 F, the fractional part of the number, always kept in the range 1/16²F<1

A number is stored in the form F multiplied by 16E. Excess 64 means that the stored value of the exponent E is always 64 larger than the actual value. This allows the storing of small numbers — see example 2 below. The fraction is normalised by hex (4 bit) shifts, so that prior to the start of, and after the completion of arithmetic operations on the number, the value of the fraction will always lie in this range. To lie in this range, the leftmost bits will contain the most significant digits of the number, thus retaining as many as possible other significant digits arising from a calculation in the rightmost bits.

Implementation 2 (IEEE 754)

This implementation was specified so that processor manufacturers like Motorola and Intel could produce standardised architectures to process floating-point operations by hardware.

The single-precision system uses 32 bits as follows:

• Bit 0 Sign bit (set to 1 if number is negative
• Bits 1 to 8 E, the exponent to base 2, in excess 127 format, of the number

Double-precision and extended-precision systems are also defined. See Appendix A for a fuller description of this implementation.

The following examples show how various numbers would be stored under implementation 1. The | symbol is used to indicate the 4-bit hexadecimal groupings.

Example 1. How would +12.625 be stored? The fractional part 0.625 (5/8) converts to 0.101 as a binary fraction. The whole number as a fixed-point binary number is:

ie. 1x23 + 1x22 + 0x21 + 0x20 + 1x2-1 + 0x2-2 + 1x2-3

8 4 0 0 1/2 0 1/8

To put it into the range permitted by the fraction, one hexadecimal (4-bit) shift right is needed. For each of these shifts, 1 is added to the exponent excess.

0 (160 = 1)

(shift right 4 bits) = .1100101 x 161 (161 = 24)

Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve number)
• Bits 1 to 7 1000001 (this is 65, ie. excess + 1) representing 161
• Bits 8 to 31 1100|1010|0000|0000|0000|0000|

1, multiplied by the fraction 101/128 = 12.625. The original binary point now lies between (has floated to) bits 11 and 12. In later examples, it will 'float', and lie in different positions.

Example 2. How would a small number like 3/1024 be stored?

3/1024 = 0/2-1 + 0/2-2 + 0/2-3 + 0/2-4 + 0/2-5 + 0/2-6 + 0/2-7 + 0/2-8 + 1/2-9 + 1/2-10

3/1024 = 0.|0000|0000|1100| (ie. 1/512 + 1/1024).

Note that in this case, to normalise the fraction, it is shifted to the left, and for each shift, 1 is subtracted from the exponent excess. It takes two hex shifts left to normalise the fraction in bits 8 to 31.

For every hex shift to the left to normalise the fraction, we subtract 1 from the exponent excess. Bits 1 to 7 will contain 64 - 2, ie. 62 (binary 0111110, this represents a value of 16-2). Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve number)
• Bits 1 to 7 0111110 (this is 62, ie. excess - 2) representing 16-2 (ie. 1/256)
• Bits 8 to 31 1100|0000|0000|0000|0000|0000|

The normalised fraction in bits 8 to 31 now has the absolute value 1/2 + 1/4 = 3/4. The original binary point now lies in imaginary space, 8 bits to the left of bit 8, but no significant digits are involved there. To check the value stored, the exponent 1/256, multiplied by the fraction 3/4 = 3/1024.

The purpose of the 'excess 64' form of the exponent is now clearer. It splits the exponent into two parts. The upper part (ie. with leftmost bit set to 1) is used to store numbers >1/16, and the lower part is used to store numbers < 1/16.

Range And Accuracy

The largest positive number that can be stored in implementation 1 is represented by the following 32-bit pattern:

• Bit 0 0 (+ve number)
• Bits 1 to 7 1111111 (representing 1663)
• Bits 8 to 31 111111111111111111111111 (almost 1)

This represents approximately 7.2 x 1075. (To derive this value, we solve for x the equation 1663 = 10x).

The smallest positive number that can be stored is:

• Bit 0 0 (+ve number)
• Bits 1 to 7 0000000 (this is 16-64)
• Bits 8 to 31 000100000000000000000000 (1/16, or 16-1)

This represents 16-65 which is approximately 5.9 x 10-78. Note that the exponent values of implementation 1 provide, from the smallest value to the largest value, a range of approximately 10153.

The accuracy is determined by the number of bits used for the fraction. 3.32 bits are needed to store 1 decimal digit, therefore the accuracy in decimal digits is 24/3.32, that is 7. We obtain the value 3.32 by solving for x the equation, 2x − 1 = 101 − 1, where each side of the equation represents the largest value that can be represented by x binary digits and one decimal digit respectively. Note, however, that 10 bits/3.32 gives 3.012, indicating that some bits are 'left over'; some 4-digit decimal numbers, from 1000 to 1023, can be represented in 10 bits, since 210 − 1 = 1023.

Negative numbers, depending upon the actual implementation, give the same accuracy and approximately the same negative range of values as positive numbers.

## Processing Numbers In A Floating-point System

In the following example, implementation 1 is used, and the requirement is to increase the number 16,386.46875 (the fraction being 15/32, chosen for ease of representation) by 4dB. The ratio for 4dB is 1.585. The decimal product of 16,386.46875 x 1.585 is 25,972.55297, accurate to 10 decimal digits. The storage system below will now have a problem deriving this product accurately, as the fraction only has (24/3.32) = 7 decimal digits accuracy.

The binary value of 16,386.46875 is:

To store this in floating-point form, it can be seen that four right shifts are required to normalise the number. Leftmost zeros are inserted when necessary.

Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve number)
• Bits 1 to 7 1000100 (this is 68, i.e. excess + 4) representing 164
• Bits 8 to 31 0100|0000|0000|0010|0111|1000|

Note that all of the 10 decimal digits of the number can be stored (fortuitously) in this system, because the fraction converts exactly using only 5 binary digits (bits 29 to 31 are not needed).

Compared with example 1, bit 28 now represents the value 1/32 in this fraction. The value of the exponent has determined this, and the binary point of the original number now lies between bits 23 and 24 compared with bits 11 and 12 in example 1.

The above result of 25,972.55297 as a fixed-point binary number is:

In this example, 16 bits are used for both the integer and fraction, but note that the fraction does not convert exactly. The hardware's floating-point working registers would need to operate in a more extended form, possibly up to 64 bits, in order to store intermediate results accurately.

So how do we store the result 25,972.55297 in this 32-bit floating-point system when it will not all fit because of the never-ending binary fraction? Do we lose some of the integer or the fraction? In the earlier introduction to floating-point numbers, the fraction was normalised to keep the most significant digits. The system 'looks after' the high-order digits, any loss of accuracy always occurring in the rightmost digits.

The actual calculations are done in floating-point format by specialised hardware, but a fixed-point representation of the result using 16 bits for each of the integer and the fraction is:

To put it into floating-point form will require four hex shifts right to normalise the fraction. Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve number)
• Bits 1 to 7 1000100 (this is 68, i.e. excess + 4) representing 164
• Bits 8 to 31 0110|0101|0111|0100|1000|1101| The 8 rightmost bits |1000|1111| are lost.

The binary point of the original number lies between bits 23 and 24, and bit 31 represents 2-8, with this value of the exponent. The integer part occupies bits 8 to 23, and there are now only 8 bits (24 to 31) used to store the fraction. If we convert this result back to decimal using the hex groupings we have:

(6 x 163)+(5 x 162)+(7 x 161)+(4 x 160)+(8 x 16-1)+(13 x 16-2)

= 24,576 + 1,280 + 112 + 4 + 0.5 + 0.05078125

= 25,972.55078

Compared with the correct answer of 25,972.55297, we have an absolute error of 0.00219, but accuracy of 7 decimal digits has been achieved from left to right — the most significant digits have been saved.

At last, we can now see the origin of the term 'floating' point. The original (now binary) point floats up and down the fraction for each calculation, finally coming to a position where the value of the exponent wants it. Thus the bits in the fraction are continually reassigned different values depending on the requirement for the fraction to stay normalised with respect to the exponent value. Of course to store a number of the order 108 or more,given that only 7 decimal digits accuracy is provided, the binary point will always be somewhere to the right of bit 31, because 24 bits can only accommodate six hex shifts, ie. 166 is less than 108. Likewise, with a number of the order of 10-8, the binary point will be somewhere to the left of bit 8, but in both of these cases, the exponent value determines the location of the point exactly.

The floating-point system can clearly deal with very large increases in value, so that if a signal value is raised by 200dB, that value will be stored with the stated accuracy. There will be no overflow, unless trapped by the system, but there will be the problem of how to fix the number into the final fixed bit size of the external hardware device. Its real value in digital audio work is that it can maintain a high accuracy of calculation over a very wide range of numbers by continually reassigning its component bits to store the most significant parts of the processing.

## Appendix A

William Kahan was the primary architect behind the IEEE 754-1985 standard for floating-point computation. He's often referred to as 'The Father of Floating Point,' having been instrumental in creating the original IEEE 754 specification.A Brief Description Of The IEEE 754 Standard For Floating-point Numbers

Formats

Three of the formats defined by the standard are:

• Single precision 32 bits.
• Double precision 64 bits.
• Extended precision 80 bits.

To be compatible with the earlier described 32-bit system, the single-precision format will be used for the following examples.

The 32-bit allocation for a single-precision number is:

• Bit 0 Sign bit (set to 1 if the number is negative).
• Bits 1 to 8 E, the exponent to base 2, in excess 127 format, of the number.
• Hidden 1 bit (Not stored, but has an implied value of 1.)
• Bits 9 to 31 F, the fractional part of the number.

The term significand is used to avoid confusion with the earlier use of the term binary fraction, which begins with a binary point followed by a 1 bit when normalised in the range 1/22 binary fraction < 1. This means that in a normalised binary fraction, the bit to the right of the binary point must always be 1. (For example, 1/2 as a binary number is 0.1, and any number greater than this, but less than 1, will always have this bit set.) As it must always be 1, the standard does not store it (wasted space!), it assumes it to be present with an implied value of 1, and is often called the 'hidden 1' bit. The significand therefore comprises:

• the hidden 1 bit, with an implied value of 1
• the implied binary point
• the remaining 23 bits of the binary fraction

This arrangement allows 1 + 23 = 24 bits for the full significand. Its value is therefore 1 plus the value of the binary fraction in the remaining 23 bits.

To be normalised, the significand lies in the range 1 ≥ significand < 2, which means that its value must be greater than or equal to 1, and less than 2. In the following examples, the hidden 1 bit is to the left of the binary point.

With all the F bits set to 1: 1.11111111111111111111111 is just less than 2.

With the F rightmost bit set to 1: 1.00000000000000000000001 is just greater than 1.

With all the F bits set to 0: 1.00000000000000000000000 is equal to 1.

In the basic notation, numbers are stored in the form 1.F x 2E. For example, the decimal number 24, which as a binary number is 11000, could be represented as 1.1000 x 24. In the IEEE standard, the exponent E is stored in excess 127 format, therefore the actual value of the exponent would be 127 + 4 = 131.

To normalise a number in the IEEE standard, we shift it left or right by 1 bit until it is a significand in the range 1 ² significand < 2. It is shifted by 1 bit because the base of the exponent is 2, not 16 as in implementation 1 where each shift was 4 bits. For each of these 1 bit shifts, we add or subtract 1 to or from the exponent excess of 127.

Using the earlier example of +12.625, how would this be stored in the IEEE standard?

+12.625 = 1100.101 x 20

= 1.100101 x 23

130 (when the exponent excess of 127 is added)

To derive the normalised significand, we shift the binary number three places right, and for each of these shifts we add 1 to the exponent excess obtaining the value 130.

Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve)
• Bits 1 to 8 10000010 (130)
• Hidden 1 bit (Not stored, but with the implied value of 1)
• Bits 9 to 31 10010100000000000000000

To check the value stored, we have a significand of ( 1 + {1/2 + 0/4 + 0/8 + 1/16 + 0/32 + 1/64}) and an exponent of 2130-127.

= (1 + 37/64) x 23

= 101/64 x 8

= 12.625

Using the earlier example of a small number, how would 3/1024 be stored?

3/1024 = 0.0000000011, as a binary fraction (ie. 1/512 + 1/1024)

To derive the normalised significand, we shift the fraction 9 places left to obtain 1.1, and for each of these shifts, we subtract 1 from the exponent excess. Thus the 32 bits will be assigned as follows:

• Bit 0 0 (+ve)
• Bits 1 to 8 01110110 (127 − 9 = 118, representing 2-9)
• Hidden 1 bit (Not stored, but with the implied value of 1)
• Bits 9 to 31 10000000000000000000000

To check the value of the stored number, we have a significand of 1.1, and an exponent of 2-9.

= 3/1024

Reserved Values Of The Exponent

The exponent absolute binary values 0 and 255 are not permitted for normalised numbers, and are reserved for special occurrences thus:

1. True zero

Sign Exponent Fraction

 + or - 0 0

2. Denormalised number

 + or - 0 Non zero bit pattern

3. Infinity

 + or - 11111111 0

NAN (Not A Number)

 + or - 11111111 Non zero bit pattern

Denormalised numbers: In computation, problems arise when the result of a calculation is less than the smallest normalised number that can be represented in the system, but which is far enough from zero to be useful in further calculations. Essentially, denormalised numbers expand the range, and gives a gradual underflow in the range 2minimum value to zero. A denormalised number has an exponent of zero, but the fraction is non-zero, and there is no 'hidden 1' to the left of the binary point. The smallest value that can be stored is represented by an exponent of value 2-126 , and a 1 in the rightmost bit of the fraction having the value 2-23. The smallest positive number that can be represented is therefore 2-126 x 2-23 = 2-149, which is approximately 1.4 x 10-45.

NAN (Not a number): These are values that do not represent a real number, for example, when an operation is not defined, as in the case of the square root of a negative number.

Range And Accuracy

The decimal digit accuracy of the single-precision system is the number of bits in the significand divided by 3.32, ie. 24/3.32 = 7 decimal digits.

The largest positive number that can be stored is represented by the following bit assignment:

• Bit 0 0 (+ve)
• Bits 1 to 8 11111110 (254, representing 2127)
• Hidden 1 bit (Not stored, but being an implied 1)
• Bits 9 to 31 11111111111111111111111 (representing almost 2 with the hidden 1 bit)

Thus the largest value is approximately 2127 x 21 = 2128, which is approximately 3.4 x 1038.

The smallest positive number has already been identified as a denormalised number whose value is 2-149, approximately 1.4 x 10-45.

Double-precision Format

Operations in double-precision format follow the same principles as those for single-precision format. In this case, the number of bits used to store the number is 64, and both the exponent and significand are extended using the following bit assignments:

• Bit 0 0 (+ve)
• Bits 1 to 11 Exponent in excess 1023 format
• Hidden 1 bit (Not stored, but being an implied 1)
• Bits 12 to 63 Fraction, normalised as before

The accuracy in decimal digits of the system is 52/3.32 = 15.

The largest positive number that can be stored is represented by the following bit assignment:

• Bit 0 0 (+ve)
• Bits 1 to 11 01111111111 (representing 21023)
• Hidden 1 bit (Not stored, but being an implied 1)
• Bits 12 to 63 [all 63 bits set to 1] (representing almost 2 with the hidden 1 bit)

Thus the largest value is 21023 x 21 = 21024, which is approximately 1.8 x 10308.

The smallest positive value is the denormalised number 2-1074 , which is approximately 4.9 x 10-324.

Other Formats

The standard also includes Extended and Quadruple formats. The interested reader can find further information about these and the IEEE Standard in general by placing the text, IEEE754, in an Internet search engine.

Published June 2004