Skip to content Skip to navigation

Connexions

You are here: Home » Content » Binary Codes: Huffman Codes for Source Coding

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual Connexions member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Featured Content display tagshide tags

    This module is included inLens: Connexions Featured Content
    By: ConnexionsAs a part of collection:"A First Course in Electrical and Computer Engineering"

    Comments:

    "A First Course in Electrical and Computer Engineering provides readers with a comprehensive, introductory look at the world of electrical engineering. It was originally written by Louis Scharf […]"

    Click the "Featured Content" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.

Binary Codes: Huffman Codes for Source Coding

Module by: Louis Scharf. E-mail the author

User rating (How does the rating system work?)
Ratings

Ratings allow you to judge the quality of modules. If other users have ranked the module then its average rating is displayed below. Ratings are calculated on a scale from one star (Poor) to five stars (Excellent).

How to rate a module

Hover over the star that corresponds to the rating you wish to assign. Click on the star to add your rating. Your rating should be based on the quality of the content. You must have an account and be logged in to rate content.

:
(0 ratings)

Note: Your browser may not currently support MathML. See our browser support page for additional details. You can always view the correct math in the PDF version.

Note:

This module is part of the collection, A First Course in Electrical and Computer Engineering. The LaTeX source files for this collection were created using an optical character recognition technology, and because of this process there may be more errors than usual. Please contact us if you discover any errors.

In 1838, Samuel Morse was struggling with the problem of designing an efficient code for transmitting information over telegraph lines. He reasoned that an efficient code would use short code words for common letters and long code words for uncommon letters. (Can you see the profit motive at work?) In order to turn this reasoned principle into workable practice, Morse rummaged around in the composition trays for typeface in a printshop. He discovered that typesetters use many more e'se's than s's. He then formed a table that showed the relative frequency with which each letter was used. His ingenious, variable-length Morse code assigned short codes to likely letters (like “dot” for e ' e ' ) and long codes to unlikely letters (like “dash dash dot dot” for z ' z ' ). We now know that Morse came within about 15% of the theoretical minimum for the average code word length for English language text.

A Variation on Morse's Experiment. In order to set the stage for our study of efficient source codes, let's run a variation on Morse's experiment to see if we can independently arrive at a way of designing codes. Instead of giving ourselves a composition tray, let's start with a communication source that generates five symbols or letters S0,S1,S2,S3,S4S0,S1,S2,S3,S4. We run the source for 100 transmissions and observe the following numbers of transmissions for each symbol:

50 S 0 ' s 20 S 1 ' s 20 S 2 ' s 5 S 3 ' s 5 S 4 ' s . 50 S 0 ' s 20 S 1 ' s 20 S 2 ' s 5 S 3 ' s 5 S 4 ' s . (1)

We will assume that these “source statistics” are typical, meaning that 1000 transmissions would yield 500 S0'sS0's and so on.

The most primitive binary code we could build for our source would use three bits for each symbol:

S 0 000 S 1 001 S 2 010 S 3 011 S 4 100 x 101 x 110 x 111 . S 0 000 S 1 001 S 2 010 S 3 011 S 4 100 x 101 x 110 x 111 . (2)

This code is inefficient in two ways. First, it leaves three illegal code words that correspond to no source symbol. Second, it uses the same code word length for an unlikely symbol (like S 4 S 4 ) that it uses for a likely symbol (like S 0 S 0 ). The first defect we can correct by concatenating consecutive symbols into symbol blocks, or composite symbols. If we form a composite symbol consisting of M M source symbols, then a typical composite symbol is S1S0S1S4S2S3S1S2S0S1S0S1S4S2S3S1S2S0. The number of such composite symbols that can be generated is 5 M 5 M . The binary code for these 5 M 5 M composite symbols must contain N N binary digits where

2N-1<5M<2N(NMlog25).2N-1<5M<2N(NMlog25).(3)

The number of bits per source symbol is

NMlog25=2.32.NMlog25=2.32.(4)

This scheme improves on the best variable length code of Table 2 from "Binary Codes: From Symbols to Binary Codes" by 0.08 bits/symbol.

Exercise 1

Suppose your source of information generates the 26 lowercase roman letters used in English language text. These letters are to be concatenated into blocks of length M. Complete the following table of N (number of bits) versus M (number of letters in a block) and show that NMNM approaches log226log226.

Table 1
  M
  1 2 3 4 5 6
N 5 10        
N / M N / M 5 5        

Now let's reason, as Morse did, that an efficient code would use short codes for likely symbols and long codes for unlikely symbols. Let's pick code #5 from Table 2 from "Binary Codes: From Symbols to Binary Codes" for this purpose:

S 0 S 1 S 2 S 3 S 4 1 01 001 0001 0000 . S 0 S 1 S 2 S 3 S 4 1 01 001 0001 0000 . (5)

This is a variable-length code. If we use this code on the 100 symbols that generated our source statistic, the average number of bits/symbol is

1 100 [ 50 ( 1 ) + 20 ( 2 ) + 20 ( 3 ) + 5 ( 4 ) + 5 ( 4 ) ] = 1 . 90 bits / symbol . 1 100 [ 50 ( 1 ) + 20 ( 2 ) + 20 ( 3 ) + 5 ( 4 ) + 5 ( 4 ) ] = 1 . 90 bits / symbol . (6)

Exercise 2

Use the source statistics of Equation 1 to determine the average number of bits/symbol for each code in Table 2 from "Binary Codes: From Symbols to Binary Codes".

Entropy. So far, each ad hoc scheme we have tried has produced an improvement in the average number of bits/symbol. How far can this go? The answer is given by Shannon's source coding theorem, which says that the minimum number of bits/symbol is

N M - i = 1 M p i log 2 p i N M - i = 1 M p i log 2 p i (7)

where pi is the probability that symbol Si is generated and -pilog2pi-pilog2pi is a fundamental property of the source called entropy. For our five-symbol example, the table of pi and -logpi-logpi is given in Table 2. The entropy is 1.861, and the bound on bits/symbol is

NM1.861.NM1.861. (8)

Code #5#5 comes within 0.039 of this lower bound. As we will see in the next paragraphs, this is as close as we can come without coding composite symbols.

Table 2: Source statistics for Five-Symbol Source
Symbol Probability ~ Log Probability
S 0 S 0 0.5 1
S 1 S 1 0.2 2.32
S 2 S 2 02 2.32
S 3 S 3 0.05 4.32
S 4 S 4 0.05 4.32

Exercise 3

Select an arbitrary page of English text. Build a table of source statistics containing p i p i (relative frequencies) and-log p i p i for a a through z z. (Ignore distinction between upper and lower case and ignore punctuation and other special symbols.) Compute the entropy -i=126pilog2pi-i=126pilog2pi.

Huffman Codes. In the late 1950s1950s, David Huffman discovered an algorithm for designing variable-length codes that minimize the average number of bits/symbol. Huffman's algorithm uses a principle of optimality that says, “the optimal code for M M letters has imbedded in it the optimal code for the M-1M-1 letters that result from aggregating the two least likely symbols.” When this principle is iterated, then we have an algorithm for generating the binary tree for a Huffman code:

  1. label all symbols as “children”;
  2. “twin” the two least probable children and give the twin the sum of the probabilities:
    Figure 1
    A binary tree labeled (0.10). Two branches extend down from the upper point and end at points labeled (s_4)(0.05) on the left and (S_3)(0.05) on the right.
  3. (regard the twin as a child; and
  4. repeat steps (ii) and (iii) until all children are accounted for.

This tree is now labeled with 1's and 0's to obtain the Huffman code. The labeling procedure is to label each right branch with a 1 and each left branch with a 0. The procedure for laying out symbols and constructing Huffman trees and codes is illustrated in the following examples.

Example 1

Consider the source statistics

Symbol S 0 S 1 S 2 S 3 S 4 Probability 0.5 0.2 0.2 0.05 0.05 Symbol S 0 S 1 S 2 S 3 S 4 Probability 0.5 0.2 0.2 0.05 0.05 (9)

for which the Huffman algorithm produces the following binary tree and its corresponding code:

Figure 2
Figure 2 (pic012.png)

Example 2

The Huffman code for the source statistics

Symbol S 0 S 1 S 2 S 3 S 4 Probability 0.75 0.075 0.075 0.05 0.05 Symbol S 0 S 1 S 2 S 3 S 4 Probability 0.75 0.075 0.075 0.05 0.05 (10)

is illustrated next:

Figure 3
Figure 3 (pic013.png)

Exercise 4

Generate binary trees and Huffman codes for the following source statistics:

Symbol S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 Probability 1 0.20 0.20 0.15 0.15 0.1 0.1 0.05 0.05 Probability 2 0.3 0.25 0.1 0.1 0.075 0.075 0.05 0.05 . Symbol S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 Probability 1 0.20 0.20 0.15 0.15 0.1 0.1 0.05 0.05 Probability 2 0.3 0.25 0.1 0.1 0.075 0.075 0.05 0.05 . (11)

Coding a FAX Machine. Symbols can arise in unusual ways and be defined quite arbitrarily. To illustrate this point, we consider the design of a hypothetical FAX machine. For our design we will assume that a laser scanner reads a page of black and white text or pictures, producing a high voltage for a black spot and a low voltage for a white spot. We will also assume that the laser scanner resolves the page at 1024 lines, with 1024 spots/line. This means that each page is represented by a two-dimensional array, or matrix, of pixels (picture elements), each pixel being 1 or 0. If we simply transmitted these l's and O'sO's, then we would need 1024×1024=1,059,5761024×1024=1,059,576 bits. If these were transmitted over a 9600 baud phone line, then it would take almost 2 minutes to transmit the FAX. This is a long time.

Let's think about a typical scan line for a printed page. It will contain long runs of 0's, corresponding to long runs of white, interrupted by short bursts of l's, corresponding to short runs of black where the scanner encounters a line or a part of a letter. So why not try to define a symbol to be “a run of k k 0's" and code these runs? The resulting code is called a “run length code.” Let's define eight symbols, corresponding to run lengths from 0 to 7 (a run length of 0 is a 1):

S 0 = run length of 0 zeros ( a 1 ) S 1 = run length of 1 zero S 7 = run legnth of 7 zeros. S 0 = run length of 0 zeros ( a 1 ) S 1 = run length of 1 zero S 7 = run legnth of 7 zeros. (12)

If we simply used a simple three-bit binary code for these eight symbols, then for each scan line we would generate anywhere from 3×10243×1024 bits (for a scan line consisting of all 1's) to 3×1024/74003×1024/7400 bits (for a scan line consisting of all 0's). But what if we ran an experiment to determine the relative frequency of the run lengths S 0 S 0 through S 7 S 7 and used a Huffman code to “run length encode” the run lengths? The following problem explores this possibility and produces an efficient FAX code.

Exercise 5

An experiment conducted on FAXed documents produces the following statistics for run lengths of white ranging from 0 to 7:

Symbol S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 Probability 0.01 0.06 0.1 0.1 0.2 0.15 0.15 0.2 Symbol S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 Probability 0.01 0.06 0.1 0.1 0.2 0.15 0.15 0.2 (13)

These statistics indicate that only 1% of a typical page is black. Construct the Huffman code for this source. Use your Huffman code to code and decode these scan lines:

Figure 4
Figure 4 (pic014.png)

Content actions

Give Feedback:

E-mail the module author | Rate module ( How does the rating system work?)

Rating system

Ratings

Ratings allow you to judge the quality of modules. If other users have ranked the module then its average rating is displayed below. Ratings are calculated on a scale from one star (Poor) to five stars (Excellent).

How to rate a module

Hover over the star that corresponds to the rating you wish to assign. Click on the star to add your rating. Your rating should be based on the quality of the content. You must have an account and be logged in to rate content.

(0 ratings)

Download:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.

| A lens (?)

Definition of a lens

Lenses

A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual Connexions member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks