Connexions

You are here: Home » Content » Fixed-Point Number Representation
Content Actions

Fixed-Point Number Representation

Module by: Douglas L. Jones

Summary: Specialized DSP hardware typically uses fixed-point number representations for lower cost and complexity and greater speed. Interpretation of two's-complement binary numbers as signed fractions between -1 and 1 allows integer arithmetic to be used for DSP computations, but introduces quantization and overflow errors.

Fixed-point arithmetic is generally used when hardware cost, speed, or complexity is important. Finite-precision quantization issues usually arise in fixed-point systems, so we concentrate on fixed-point quantization and error analysis in the remainder of this course. For basic signal processing computations such as digital filters and FFTs, the magnitude of the data, the internal states, and the output can usually be scaled to obtain good performance with a fixed-point implementation.

Two's-Complement Integer Representation

As far as the hardware is concerned, fixed-point number systems represent data as BB-bit integers. The two's-complement number system is usually used: k=binary integer representationif0k2B-1-1bit-by-bit inverse-k+1if-2B-1k0 k binary integer representation 0 k 2 B 1 1 bit-by-bit inverse k 1 2 B 1 k 0
fig1FixedPoint.png
Figure 1
The most significant bit is known at the sign bit; it is 0 when the number is non-negative; 1 when the number is negative.

Fractional Fixed-Point Number Representation

For the purposes of signal processing, we often regard the fixed-point numbers as binary fractions between -11 -1 1 , by implicitly placing a decimal point after the sign bit.
fig2FixedPoint.png
Figure 2
or x=- b 0 +i=1B-1 b i 2-i x b 0 i B 1 1 b i 2 i This interpretation makes it clearer how to implement digital filters in fixed-point, at least when the coefficients have a magnitude less than 1.

Truncation Error

Consider the multiplication of two binary fractions
fig3FixedPoint.png
Figure 3
Note that full-precision multiplication almost doubles the number of bits; if we wish to return the product to a BB-bit representation, we must truncate the B-1 B 1 least significant bits. However, this introduces truncation error (also known as quantization error, or roundoff error if the number is rounded to the nearest BB-bit fractional value rather than truncated). Note that this occurs after multiplication.

Overflow Error

Consider the addition of two binary fractions;
fig4FixedPoint.png
Figure 4
Note the occurence of wraparound overflow; this only happens with addition. Obviously, it can be a bad problem.
There are thus two types of fixed-point error: roundoff error, associated with data quantization and multiplication, and overflow error, associated with data quantization and additions. In fixed-point systems, one must strike a balance between these two error sources; by scaling down the data, the occurence of overflow errors is reduced, but the relative size of the roundoff error is increased.
Note: Since multiplies require a number of additions, they are especially expensive in terms of hardware (with a complexity proportional to B x B h B x B h , where B x B x is the number of bits in the data, and B h B h is the number of bits in the filter coefficients). Designers try to minimize both B x B x and B h B h , and often choose B x B h B x B h !

Comments, questions, feedback, criticisms?

Send feedback