Skip to content Skip to navigation

Connexions

You are here: Home » Content » Source Coding

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

      What are tags? tag icon

      Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

    • External bookmarks
  • E-mail the author
  • Rate this module (How does the rating system work?)

    Rating system

    Ratings

    Ratings allow you to judge the quality of modules. If other users have ranked the module then its average rating is displayed below. Ratings are calculated on a scale from one star (Poor) to five stars (Excellent).

    How to rate a module

    Hover over the star that corresponds to the rating you wish to assign. Click on the star to add your rating. Your rating should be based on the quality of the content. You must have an account and be logged in to rate content.

    (0 ratings)

Recently Viewed

This feature requires Javascript to be enabled.

Source Coding

Module by: Behnaam Aazhang

Summary: An introduction to the concept of typical sequences, which lie at the heart of source coding. The idea of typical sequences leads to Shannon's source-coding Theorem.

Note: Your browser may not currently support MathML. See our browser support page for additional details. You can always view the correct math in the PDF version.

As mentioned earlier, how much a source can be compressed should be related to its entropy. In 1948, Claude E. Shannon introduced three theorems and developed very rigorous mathematics for digital communications. In one of the three theorems, Shannon relates entropy to the minimum number of bits per second required to represent a source without much loss (or distortion).

Consider a source that is modeled by a discrete-time and discrete-valued random process X 1 X 1 , X 2 X 2 , …, X n X n , … where x i a 1 a 2 a N x i a 1 a 2 a N and define p X i x i = a j = p j p X i x i a j p j for j= 1 , 2 , , N j 1 , 2 , , N , where it is assumed that X 1 X 1 , X 2 X 2 ,… X n X n are mutually independent and identically distributed.

Consider a sequence of length nn

X= X 1 X 2 X n X X 1 X 2 X n (1)
The symbol a 1 a 1 can occur with probability p 1 p 1 . Therefore, in a sequence of length nn, on the average, a 1 a 1 will appear n p 1 n p 1 times with high probabilities if nn is very large.

Therefore,

PX=x= p X 1 x 1 p X 2 x 2 p X n x n P X x p X 1 x 1 p X 2 x 2 p X n x n (2)
PX=x p 1 n p 1 p 2 n p 2 p N n p N =i=1N p i n p i P X x p 1 n p 1 p 2 n p 2 p N n p N i 1 N p i n p i (3)
where p i =P X j = a i p i P X j a i for all jj and for all ii.

A typical sequence X X may look like

X= a 2 a 1 a N a 2 a 5 a 1 a N a 6 X a 2 a 1 a N a 2 a 5 a 1 a N a 6 (4)
where a i a i appears n p i n p i times with large probability. This is referred to as a typical sequence. The probability of X X being a typical sequence is
PX=xi=1N p i n p i =i=1N2log2 p i n p i =i=1N2n p i log2 p i =2ni=1N p i log2 p i =2-nHX P X x i 1 N p i n p i i 1 N 2 2 p i n p i i 1 N 2 n p i 2 p i 2 n i 1 N p i 2 p i 2 n H X (5)
where HX H X is the entropy of the random variables X 1 X 1 , X 2 X 2 ,…, X n X n .

For large nn, almost all the output sequences of length nn of the source are equally probably with probability2-nHX probability 2 n H X . These are typical sequences. The probability of nontypical sequences are negligible. There are Nn N n different sequences of length nn with alphabet of size NN. The probability of typical sequences is almost 1.

k=1# of typical seq.2-nHX=1 k 1 # of typical seq. 2 n H X 1 (6)

Figure 1
Figure 1 (Figure7-17.png)

Example 1

Consider a source with alphabet {A,B,C,D} with probabilities { 12 1 2 , 14 1 4 , 18 1 8 , 18 1 8 }. Assume X 1 X 1 , X 2 X 2 ,…, X 8 X 8 is an independent and identically distributed sequence with X i ABCD X i A B C D with the above probabilities.

HX=-12log21214log21418log21818log218=12+24+38+38=4+4+68=148 H X 1 2 2 1 2 1 4 2 1 4 1 8 2 1 8 1 8 2 1 8 1 2 2 4 3 8 3 8 4 4 6 8 14 8 (7)

The number of typical sequences of length 8

28148=214 2 8 14 8 2 14 (8)

The number of nontypical sequences 48214=216214=21441=3×214 4 8 2 14 2 16 2 14 2 14 4 1 3 2 14

Examples of typical sequences include those with A appearing 812=4 8 1 2 4 times, B appearing 814=2 8 1 4 2 times, etc. {A,D,B,B,A,A,C,A}, {A,A,A,A,C,D,B,B} and much more.

Examples of nontypical sequences of length 8: {D,D,B,C,C,A,B,D}, {C,C,C,C,C,B,C,C} and much more. Indeed, these definitions and arguments are valid when n is very large. The probability of a source output to be in the set of typical sequences is 1 when n n . The probability of a source output to be in the set of nontypical sequences approaches 0 as n n .

The essence of source coding or data compression is that as n n , nontypical sequences never appear as the output of the source. Therefore, one only needs to be able to represent typical sequences as binary codes and ignore nontypical sequences. Since there are only 2nHX 2 n H X typical sequences of length nn, it takes nHX n H X bits to represent them on the average. On the average it takes HX H X bits per source output to represent a simple source that produces independent and identically distributed outputs.

Theorem 1: Shannon's Source-Coding

A source that produced independent and identically distributed random variables with entropy HH can be encoded with arbitrarily small error probability at any rate RR in bits per source output if RH R H . Conversely, if R<H R H , the error probability will be bounded away from zero, independent of the complexity of coder and decoder.

The source coding theorem proves existence of source coding techniques that achieve rates close to the entropy but does not provide any algorithms or ways to construct such codes.

If the source is not i.i.d. (independent and identically distributed), but it is stationary with memory, then a similar theorem applies with the entropy HX H X replaced with the entropy rate H=limnH X n | X 1 X 2 X n-1 H n H X n | X 1 X 2 X n-1

In the case of a source with memory, the more the source produces outputs the more one knows about the source and the more one can compress.

Example 2

The English language has 26 letters, with space it becomes an alphabet of size 27. If modeled as a memoryless source (no dependency between letters in a word) then the entropy is HX=4.03 H X 4.03 bits/letter.

If the dependency between letters in a text is captured in a model the entropy rate can be derived to be H=1.3 H 1.3 bits/letter. Note that a non-information theoretic representation of a text may require 5 bits/letter since 25 2 5 is the closest power of 2 to 27. Shannon's results indicate that there may be a compression algorithm with the rate of 1.3 bits/letter.

Although Shannon's results are not constructive, there are a number of source coding algorithms for discrete time discrete valued sources that come close to Shannon's bound. One such algorithm is the Huffman source coding algorithm. Another is the Lempel and Ziv algorithm.

Huffman codes and Lempel and Ziv apply to compression problems where the source produces discrete time and discrete valued outputs. For cases where the source is analog there are powerful compression algorithms that specify all the steps from sampling, quantizations, and binary representation. These are referred to as waveform coders. JPEG, MPEG, vocoders are a few examples for image, video, and voice, respectively.

Comments, questions, feedback, criticisms?

Send feedback