Skip to content Skip to navigation Skip to collection information

Connexions

You are here: Home » Content » Purdue Digital Signal Processing Labs (ECE 438) » Lab 9b - Speech Processing (part 2)

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • NSF Partnership display tagshide tags

    This collection is included inLens: NSF Partnership in Signal Processing
    By: Sidney Burrus

    Click the "NSF Partnership" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Featured Content display tagshide tags

    This collection is included inLens: Connexions Featured Content
    By: Connexions

    Click the "Featured Content" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

Also in these lenses

  • UniqU content

    This collection is included inLens: UniqU's lens
    By: UniqU, LLC

    Click the "UniqU content" link to see all content selected in this lens.

  • Lens for Engineering

    This module and collection are included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
Download
x

Download collection as:

  • PDF
  • EPUB (what's this?)

    What is an EPUB file?

    EPUB is an electronic book format that can be read on a variety of mobile devices.

    Downloading to a reading device

    For detailed instructions on how to download this content's EPUB to your specific device, click the "(what's this?)" link.

  • More downloads ...

Download module as:

  • PDF
  • EPUB (what's this?)

    What is an EPUB file?

    EPUB is an electronic book format that can be read on a variety of mobile devices.

    Downloading to a reading device

    For detailed instructions on how to download this content's EPUB to your specific device, click the "(what's this?)" link.

  • More downloads ...
Reuse / Edit
x

Collection:

Module:

Add to a lens
x

Add collection to:

Add module to:

Add to Favorites
x

Add collection to:

Add module to:

 

Lab 9b - Speech Processing (part 2)

Module by: Charles A. Bouman. E-mail the author

Questions or comments concerning this laboratory should be directed to Prof. Charles A. Bouman, School of Electrical and Computer Engineering, Purdue University, West Lafayette IN 47907; (765) 494-0340; bouman@ecn.purdue.edu

Introduction

This is the second part of a two week experiment. During the first week we discussed basic properties of speech signals, and performed some simple analyses in the time and frequency domain.

This week, we will introduce a system model for speech production. We will cover some background on linear predictive coding, and the final exercise will bring all the prior material together in a speech coding exercise.

A Speech Model

Figure 1: Discrete-Time Speech Production Model
Figure 1 (model.png)

From a signal processing standpoint, it is very useful to think of speech production in terms of a model, as in Figure 1. The model shown is the simplest of its kind, but it includes all the principal components. The excitations for voiced and unvoiced speech are represented by an impulse train and white noise generator, respectively. The pitch of voiced speech is controlled by the spacing between impulses, TpTp, and the amplitude (volume) of the excitation is controlled by the gain factor GG.

As the acoustical excitation travels from its source (vocal cords, or a constriction), the shape of the vocal tract alters the spectral content of the signal. The most prominent effect is the formation of resonances, which intensifies the signal energy at certain frequencies (called formants). As we learned in the Digital Filter Design lab, the amplification of certain frequencies may be achieved with a linear filter by an appropriate placement of poles in the transfer function. This is why the filter in our speech model utilizes an all-pole LTI filter. A more accurate model might include a few zeros in the transfer function, but if the order of the filter is chosen appropriately, the all-pole model is sufficient. The primary reason for using the all-pole model is the distinct computational advantage in calculating the filter coefficients, as will be discussed shortly.

Recall that the transfer function of an all-pole filter has the form

V ( z ) = 1 1 - k = 1 P a k z - k V ( z ) = 1 1 - k = 1 P a k z - k (1)

where PP is the order of the filter. This is an IIR filter that may be implemented with a recursive difference equation. With the input G·x(n)G·x(n), the speech signal s(n)s(n) may be written as

s ( n ) = k = 1 P a k s ( n - k ) + G · x ( n ) s ( n ) = k = 1 P a k s ( n - k ) + G · x ( n ) (2)

Keep in mind that the filter coefficients will change continuously as the shape of the vocal tract changes, but speech segments of an appropriately small length may be approximated by a time-invariant model.

This speech model is used in a variety of speech processing applications, including methods of speech recognition, speech coding for transmission, and speech synthesis. Each of these applications of the model involves dividing the speech signal into short segments, over which the filter coefficients are almost constant. For example, in speech transmission the bit rate can be significantly reduced by dividing the signal up into segments, computing and sending the model parameters for each segment (filter coefficients, gain, etc.), and re-synthesizing the signal at the receiving end, using a model similar to Figure 1. Most telephone systems use some form of this approach. Another example is speech recognition. Most recognition methods involve comparisons between short segments of the speech signals, and the filter coefficients of this model are often used in computing the “difference" between segments.

Synthesis of Voiced Speech

Download the file coeff.mat for the following section.

Download the file coeff.mat and load it into the Matlab workspace using the load command. This will load three sets of filter coefficients: A1A1, A2A2, and A3A3 for the vocal tract model in Equation 1 and Equation 2. Each vector contains coefficients {a1,a2,,a15}{a1,a2,,a15} for an all-pole filter of order 15.

We will now synthesize voiced speech segments for each of these sets of coefficients. First write a Matlab function x=exciteV(N,Np) which creates a length NN excitation for voiced speech, with a pitch period of Np samples. The output vector xx should contain a discrete-time impulse train with period Np (e.g. [1 0 0 0 1 0 0 ]).

Assuming a sampling frequency of 8 kHz (0.125 ms/sample), create a 40 millisecond-long excitation with a pitch period of 8 ms, and filter it using Equation 2 for each set of coefficients. For this, you may use the command

s = filter(1,[1 -A],x)

where AA is the row vector of filter coefficients (see Matlab's help on filter for details). Plot each of the three filtered signals. Use subplot() and orient tall to place them in the same figure.

We will now compute the frequency response of each of these filters. The frequency response may be obtained by evaluating Equation 1 at points along z=ejωz=ejω. Matlab will compute this with the command [H,W]=freqz(1,[1 -A],512) , where AA is the vector of coefficients. Plot the magnitude of each response versus frequency in Hertz. Use subplot() and orient tall to plot them in the same figure.

The location of the peaks in the spectrum correspond to the formant frequencies. For each vowel signal, estimate the first three formants (in Hz) and list them in the figure.

Now generate the three signals again, but use an excitation which is 1-2 seconds long. Listen to the filtered signals using soundsc. Can you hear qualitative differences in the signals? Can you identify the vowel sounds?

INLAB REPORT

Hand in the following:

  • A figure containing the three time-domain plots of the voiced signals.
  • Plots of the frequency responses for the three filters. Make sure to label the frequency axis in units of Hertz.
  • For each of the three filters, list the approximate center frequency of the first three formant peaks.
  • Comment on the audio quality of the synthesized signals.

Linear Predictive Coding

The filter coefficients which were provided in the previous section were determined using a technique called linear predictive coding (LPC). LPC is a fundamental component of many speech processing applications, including compression, recognition, and synthesis.

In the following discussion of LPC, we will view the speech signal as a discrete-time random process.

Forward Linear Prediction

Suppose we have a discrete-time random process {...,S-1,S0,S1,S2,...}{...,S-1,S0,S1,S2,...} whose elements have some degree of correlation. The goal of forward linear prediction is to predict the sample SnSn using a linear combination of the previous PP samples.

S ^ n = k = 1 P a k S n - k S ^ n = k = 1 P a k S n - k (3)

PP is called the order of the predictor. We may represent the error of predicting SnSn by a random sequence enen.

e n = S n - S ^ n e n = S n - k = 1 P a k S n - k e n = S n - S ^ n e n = S n - k = 1 P a k S n - k (4)

An optimal set of prediction coefficients akak for Equation 4 may be determined by minimizing the mean-square error E[en2]E[en2]. Note that since the error is generally a function of nn, the prediction coefficients will also be functions of nn. To simplify notation, let us first define the following column vectors.

a = [ a 1 a 2 a P ] T a = [ a 1 a 2 a P ] T (5)
S n , P = [ S n - 1 S n - 2 S n - P ] T S n , P = [ S n - 1 S n - 2 S n - P ] T (6)

Then,

E [ e n 2 ] = E S n - k = 1 P a k S n - k 2 = E S n - a T S n , P 2 = E S n 2 - 2 S n a T S n , P + a T S n , P a T S n , P = E S n 2 - 2 a T E S n S n , P + a T E S n , P S n , P T a E [ e n 2 ] = E S n - k = 1 P a k S n - k 2 = E S n - a T S n , P 2 = E S n 2 - 2 S n a T S n , P + a T S n , P a T S n , P = E S n 2 - 2 a T E S n S n , P + a T E S n , P S n , P T a (7)

The second and third terms of Equation 7 may be written in terms of the autocorrelation sequence rSS(k,l)rSS(k,l).

E [ S n S n , P ] = E [ S n S n - 1 ] E [ S n S n - 2 ] E [ S n S n - P ] = r S S ( n , n - 1 ) r S S ( n , n - 2 ) r S S ( n , n - P ) r S E [ S n S n , P ] = E [ S n S n - 1 ] E [ S n S n - 2 ] E [ S n S n - P ] = r S S ( n , n - 1 ) r S S ( n , n - 2 ) r S S ( n , n - P ) r S (8)
E S n , P S n , P T = E S n - 1 S n - 1 S n - 1 S n - 2 S n - 1 S n - P S n - 2 S n - 1 S n - 2 S n - 2 S n - 2 S n - P S n - P S n - 1 S n - P S n - 2 S n - P S n - P = r S S ( n - 1 , n - 1 ) r S S ( n - 1 , n - 2 ) r S S ( n - 1 , n - P ) r S S ( n - 2 , n - 1 ) r S S ( n - 2 , n - 2 ) r S S ( n - 2 , n - P ) r S S ( n - P , n - 1 ) r S S ( n - P , n - 2 ) r S S ( n - P , n - P ) R S E S n , P S n , P T = E S n - 1 S n - 1 S n - 1 S n - 2 S n - 1 S n - P S n - 2 S n - 1 S n - 2 S n - 2 S n - 2 S n - P S n - P S n - 1 S n - P S n - 2 S n - P S n - P = r S S ( n - 1 , n - 1 ) r S S ( n - 1 , n - 2 ) r S S ( n - 1 , n - P ) r S S ( n - 2 , n - 1 ) r S S ( n - 2 , n - 2 ) r S S ( n - 2 , n - P ) r S S ( n - P , n - 1 ) r S S ( n - P , n - 2 ) r S S ( n - P , n - P ) R S (9)

Substituting into Equation 7, the mean-square error may be written as

E e n 2 = E S n 2 - 2 a T r S + a T R S a E e n 2 = E S n 2 - 2 a T r S + a T R S a (10)

Note that while aa and rSrS are vectors, and RSRS is a matrix, the expression in Equation 10 is still a scalar quantity.

To find the optimal akak coefficients, which we will call a^a^, we differentiate Equation 10 with respect to the vector aa (compute the gradient), and set it equal to the zero vector.

a E e n 2 = - 2 r S + 2 R S a ^ 0 a E e n 2 = - 2 r S + 2 R S a ^ 0 (11)

Solving,

R S a ^ = r S R S a ^ = r S (12)

The vector equation in Equation 12 is a system of PP scalar linear equations, which may be solved by inverting the matrix RSRS.

Note from Equation 8 and Equation 9 that rSrS and RSRS are generally functions of nn. However, if SnSn is wide-sense stationary, the autocorrelation function is only dependent on the difference between the two indices, rSS(k,l)=rSS(|k-l|)rSS(k,l)=rSS(|k-l|). Then RSRS and rSrS are no longer dependent on nn, and may be written as follows.

r S = r S S ( 1 ) r S S ( 2 ) r S S ( P ) r S = r S S ( 1 ) r S S ( 2 ) r S S ( P ) (13)
R S = r S S ( 0 ) r S S ( 1 ) r S S ( P - 1 ) r S S ( 1 ) r S S ( 0 ) r S S ( P - 2 ) r S S ( 2 ) r S S ( 1 ) r S S ( P - 3 ) r S S ( P - 1 ) r S S ( P - 2 ) r S S ( 0 ) R S = r S S ( 0 ) r S S ( 1 ) r S S ( P - 1 ) r S S ( 1 ) r S S ( 0 ) r S S ( P - 2 ) r S S ( 2 ) r S S ( 1 ) r S S ( P - 3 ) r S S ( P - 1 ) r S S ( P - 2 ) r S S ( 0 ) (14)

Therefore, if SnSn is wide-sense stationary, the optimal akak coefficients do not depend on nn. In this case, it is also important to note that RSRS is a Toeplitz (constant along diagonals) and symmetric matrix, which allows Equation 12 to be solved efficiently using the Levinson-Durbin algorithm (see [1]). This property is essential for many real-time applications of linear prediction.

Linear Predictive Coding of Speech

An important question has yet to be addressed. The solution in Equation 12 to the linear prediction problem depends entirely on the autocorrelation sequence. How do we estimate the autocorrelation of a speech signal? Recall that the applications to which we are applying LPC involve dividing the speech signal up into short segments and computing the filter coefficients for each segment. Therefore we need to consider the problem of estimating the autocorrelation for a short segment of the signal. In LPC, the following "biased" autocorrelation estimate is often used.

r ^ S S ( m ) = 1 N n = 0 N - m - 1 s ( n ) s ( n + m ) , 0 m P r ^ S S ( m ) = 1 N n = 0 N - m - 1 s ( n ) s ( n + m ) , 0 m P (15)

Here we are assuming we have a length NN segment which starts at n=0n=0. Note that this is the single-parameter form of the autocorrelation sequence, so that the forms in Equation 13 and Equation 14 may be used for rSrS and RSRS.

LPC Exercise

Download the file test.mat for this exercise.

Write a function coef=mylpc(x,P) which will compute the order-P LPC coefficients for the column vector xx, using the autocorrelation method (“lpc" is a built-in Matlab function, so use the name mylpc). Consider the input vector xx as a speech segment, in other words do not divide it up into pieces. The output vector coef should be a column vector containing the PP coefficients {a^1,a^2,,a^P}{a^1,a^2,,a^P}. In your function you should do the following:

  1. Compute the biased autocorrelation estimate of Equation 15 for the lag values 0mP0mP. You may use the xcorr function for this.
  2. Form the rSrS and RSRS vectors as in Equation 13 and Equation 14. Hint: Use the toeplitz function to form RSRS.
  3. Solve the matrix equation in Equation 12 for a^a^.

To test your function, download the file test.mat, and load it into Matlab. This file contains two vectors: a signal xx and its order-15 LPC coefficients aa. Use your function to compute the order-15 LPC coefficients of xx, and compare the result to the vector aa.

INLAB REPORT:

Hand in your mylpc function.

Speech Coding and Synthesis

Download the file phrase.au for the following section.

One very effective application of LPC is the compression of speech signals. For example, an LPC vocoder (voice-coder) is a system used in many telephone systems to reduce the bit rate for the transmission of speech. This system has two overall components: an analysis section which computes signal parameters (gain, filter coefficients, etc.), and a synthesis section which reconstructs the speech signal after transmission.

Since we have introduced the speech model in "A Speech Model", and the estimation of LPC coefficients in "Linear Predictive Coding", we now have all the tools necessary to implement a simple vocoder. First, in the analysis section, the original speech signal will be split into short time frames. For each frame, we will compute the signal energy, the LPC coefficients, and determine whether the segment is voiced or unvoiced.

Download the file phrase.au. This speech signal is sampled at a rate of 8000 Hz.

  1. Divide the original speech signal into 30ms non-overlapping frames. Place the frames into L consecutive columns of a matrix SS (use reshape). If the samples at the tail end of the signal do not fill an entire column, you may disregard these samples.
  2. Compute the energy of each frame of the original word, and place these values in a length L vector called energy.
  3. Determine whether each frame is voiced or unvoiced. Use your zero_cross function from the first week to compute the number of zero-crossings in each frame. For length N segments with less than N2N2 zero-crossings, classify the segment as voiced, otherwise unvoiced. Save the results in a vector VU which takes the value of “1" for voiced and “0" for unvoiced.
  4. Use your mylpc function to compute order-15 LPC coefficients for each frame. Place each set of coefficients into a column of a 15×L15×L matrix AA.

To see the reduction in data, add up the total number of bytes Matlab uses to store the encoded speech in the arrays A, VU, and energy. (use the whos function). Compute the compression ratio by dividing this by the number of bytes Matlab uses to store the original speech signal. Note that the compression ratio can be further improved by using a technique called vector quantization on the LPC coefficients, and also by using fewer bits to represent the gain and voiced/unvoiced indicator.

Now the computed parameters will be used to re-synthesize the phrase using the model in Figure 1. Similar to your exciteV function from "Synthesis of Voiced Speech", create a function x=exciteUV(N) which returns a length NN excitation for unvoiced speech (generate a Normal(0,1) sequence). Then for each encoded frame do the following:

  1. Check if current frame is voiced or unvoiced.
  2. Generate the frame of speech by using the appropriate excitation into the filter specified by the LPC coefficients (you did this in "Synthesis of Voiced Speech"). For voiced speech, use a pitch period of 7.5 ms. Make sure your synthesized segment is the same length as the original frame.
  3. Scale the amplitude of the segment so that the synthesized segment has the same energy as the original.
  4. Append the frame to the end of the output vector.

Listen to the original and synthesized phrase. Can you recognize the synthesized version as coming from the same speaker? What are some possible ways to improve the quality of the synthesized speech? Subplot the two speech signals in the same figure.

INLAB REPORT

Hand in the following:

  • Your analysis and synthesis code.
  • The compression ratio.
  • Plots of the original and synthesized words.
  • Comment on the quality of your synthesized signal. How might the quality be improved?

References

  1. J. G. Proakis and D. G. Manolakis. (1996). Digital Signal Processing. (3rd). Englewood Cliffs, New Jersey: Prentice-Hall.
  2. J. R. Deller, Jr., J. G. Proakis, J. H. Hansen. (1993). Discrete-Time Processing of Speech Signals. New York: Macmillan.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Reuse / Edit:

Reuse or edit collection (?)

Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.

| Reuse or edit module (?)

Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.