Google
 

Wednesday, December 13, 2006

IEEE 754 - Floating Point number representation at a Glance

IEEE 754 at a Glance
A floating-point number representation on a computer uses something similar to a scientific notation with a base and an exponent. A scientific representation of 30,064,771 is 3.0064771 x 107, whereas 1.001 can be written as 1.001 x 100.

In the first example, 3.0064771 is called the mantissa, 10 the exponent base, and 7 the exponent.

IEEE standard 754 specifies a common format for representing floating-point numbers in a computer. Two grades of precision are defined: single precision and double precision. The representations use 32 and 64 bits, respectively. This is shown in Figure 2.


(Click to enlarge)

Figure 2: IEEE floating-point formats

In IEEE 754 floating-point representation, each number comprises three basic components: the sign, the exponent, and the mantissa. To maximize the range of possible numbers, the mantissa is divided into a fraction and leading digit. As I'll explain, the latter is implicit and left out of the representation.

The sign bit simply defines the polarity of the number. A value of zero means that the number is positive, whereas a 1 denotes a negative number.

The exponent represents a range of numbers, positive and negative; thus a bias value must be subtracted from the stored exponent to yield the actual exponent. The single precision bias is 127, and the double precision bias is 1,023. This means that a stored value of 100 indicates a single-precision exponent of -27. The exponent base is always 2, and this implicit value is not stored.

For both representations, exponent representations of all 0s and all 1s are reserved and indicate special numbers:

  • Zero: all digits set to 0, sign bit can be either 0 or 1
  • ±∞: exponent all 1s, fraction all 0s
  • Not a Number (NaN): exponent all 1s, non-zero fraction. Two versions of NaN are used to signal the result of invalid operations such as dividing by zero, and indeterminate results such as operations with non-initialized operand(s).

The mantissa represents the number to be multiplied by 2 raised to the power of the exponent. Numbers are always normalized; that is, represented with one non-zero leading digit in front of the radix point. In binary math, there is only one non-zero number, 1. Thus the leading digit is always 1, allowing us to leave it out and use all the mantissa bits to represent the fraction (the decimals).

Following the previous number examples, here is what the single precision representation of the decimal value 30,064,771 will look like:

The binary integer representation of 30,064,771 is 1 1100 1010 1100 0000 1000 0011. This can be written as 1.110010101100000010000011 x 224. The leading digit is omitted, and the fraction—the string of digits following the radix point—is 1100 1010 1100 0000 1000 0011. The sign is positive and the exponent is 24 decimal. Adding the bias of 127 and converting to binary yields an IEEE 754 exponent of 1001 0111.

Putting all of the pieces together, the single representation for 30,064,771 is shown in Figure 3.


Figure 3: 30,064,771 represented in IEEE 754 single-precision format

Gain Some, Lose Some
Notice that you lose the least significant bit (LSB) of value 1 from the 32-bit integer representation—this is because of the limited precision for this format.

The range of numbers that can be represented with single precision IEEE 754 representation is ±(2-2-23) x 2127, or approximately ±1038.53. This range is astronomical compared to the maximum range of 32-bit integer numbers, which by comparison is limited to around ±2.15 x 109. Also, whereas the integer representation cannot represent values between 0 and 1, single-precision floating-point can represent values down to ±2-149, or ±~10-44.85. And we are still using only 32 bits—so this has to be a much more convenient way to represent numbers, right?

The answer depends on the requirements.

  • Yes, because in our example of multiplying 30,064,771 by 1.001, we can simply multiply the two numbers and the result will be extremely accurate.
  • No, because as in the preceding example the number 30,064,771 is not represented with full precision. In fact, 30,064,771 and 30,064,770 are represented by the exact same 32-bit bit pattern, meaning that a software algorithm will treat the numbers as identical. Worse yet, if you increment either number by 1 a billion times, none of them will change. By using 64 bits and representing the numbers in double precision format, that particular example could be made to work, but even double-precision representation will face the same limitations once the numbers get big—or small enough.
  • No, because most embedded processor cores ALUs (arithmetic logic units) only support integer operations, which leaves floating-point operations to be emulated in software. This severely affects processor performance. A 32-bit CPU can add two 32-bit integers with one machine code instruction; however, a library routine including bit manipulations and multiple arithmetic operations is needed to add two IEEE single-precision floating-point values. With multiplication and division, the performance gap just increases; thus for many applications, software floating-point emulation is not practical.
Keywords: IEEE 754, Floating point representaion

1 comment:

kdkariya said...

source: http://www.dspdesignline.com/howto/showArticle.jhtml;jsessionid=BGKFV5MRKBIVEQSNDLRCKHSCJUNN2JVN?articleId=196603215&pgno=2