Inside Collection (Course): Purdue Digital Signal Processing Labs (ECE 438)
Questions or comments concerning this laboratory should be directed to Prof. Charles A. Bouman, School of Electrical and Computer Engineering, Purdue University, West Lafayette IN 47907; (765) 494-0340; bouman@ecn.purdue.edu
This is the second part of a two week experiment. During the first week we discussed basic properties of speech signals, and performed some simple analyses in the time and frequency domain.
This week, we will introduce a system model for speech production. We will cover some background on linear predictive coding, and the final exercise will bring all the prior material together in a speech coding exercise.
![]() |
From a signal processing standpoint, it is very useful to think of speech
production in terms of a model, as in Figure 1.
The model shown is the simplest of its kind,
but it includes all the principal components.
The excitations for voiced and unvoiced speech are represented by an
impulse train and white noise generator, respectively.
The pitch of voiced speech is controlled by the spacing between
impulses,
As the acoustical excitation travels from its source (vocal cords, or a constriction), the shape of the vocal tract alters the spectral content of the signal. The most prominent effect is the formation of resonances, which intensifies the signal energy at certain frequencies (called formants). As we learned in the Digital Filter Design lab, the amplification of certain frequencies may be achieved with a linear filter by an appropriate placement of poles in the transfer function. This is why the filter in our speech model utilizes an all-pole LTI filter. A more accurate model might include a few zeros in the transfer function, but if the order of the filter is chosen appropriately, the all-pole model is sufficient. The primary reason for using the all-pole model is the distinct computational advantage in calculating the filter coefficients, as will be discussed shortly.
Recall that the transfer function of an all-pole filter has the form
where
Keep in mind that the filter coefficients will change continuously as the shape of the vocal tract changes, but speech segments of an appropriately small length may be approximated by a time-invariant model.
This speech model is used in a variety of speech processing applications, including methods of speech recognition, speech coding for transmission, and speech synthesis. Each of these applications of the model involves dividing the speech signal into short segments, over which the filter coefficients are almost constant. For example, in speech transmission the bit rate can be significantly reduced by dividing the signal up into segments, computing and sending the model parameters for each segment (filter coefficients, gain, etc.), and re-synthesizing the signal at the receiving end, using a model similar to Figure 1. Most telephone systems use some form of this approach. Another example is speech recognition. Most recognition methods involve comparisons between short segments of the speech signals, and the filter coefficients of this model are often used in computing the “difference" between segments.
Download the file coeff.mat for the following section.
Download the file
coeff.mat
and load it into the
Matlab workspace using the load command.
This will load three sets of filter coefficients:
We will now synthesize voiced speech segments for each of these sets of
coefficients.
First write a Matlab function x=exciteV(N,Np)
which creates a
length
Assuming a sampling frequency of 8 kHz (0.125 ms/sample), create a 40 millisecond-long excitation with a pitch period of 8 ms, and filter it using Equation 2 for each set of coefficients. For this, you may use the command
s = filter(1,[1 -A],x)
where subplot()
and orient tall
to place them in the same figure.
We will now compute the frequency response of each of these filters.
The frequency response may be obtained by evaluating
Equation 1 at points along [H,W]=freqz(1,[1 -A],512)
,
where subplot()
and orient tall
to plot them in
the same figure.
The location of the peaks in the spectrum correspond to the formant frequencies. For each vowel signal, estimate the first three formants (in Hz) and list them in the figure.
Now generate the three signals again, but use an excitation which is 1-2
seconds long.
Listen to the filtered signals using soundsc.
Can you hear qualitative differences in the signals?
Can you identify the vowel sounds?
Hand in the following:
The filter coefficients which were provided in the previous section were determined using a technique called linear predictive coding (LPC). LPC is a fundamental component of many speech processing applications, including compression, recognition, and synthesis.
In the following discussion of LPC, we will view the speech signal as a discrete-time random process.
Suppose we have a discrete-time random process
An optimal set of prediction coefficients
Then,
The second and third terms of Equation 7 may be written
in terms of the autocorrelation sequence
Substituting into Equation 7, the mean-square error may be written as
Note that while
To find the optimal
Solving,
The vector equation in Equation 12 is a system of
Note from Equation 8 and Equation 9 that
Therefore, if
An important question has yet to be addressed. The solution in Equation 12 to the linear prediction problem depends entirely on the autocorrelation sequence. How do we estimate the autocorrelation of a speech signal? Recall that the applications to which we are applying LPC involve dividing the speech signal up into short segments and computing the filter coefficients for each segment. Therefore we need to consider the problem of estimating the autocorrelation for a short segment of the signal. In LPC, the following "biased" autocorrelation estimate is often used.
Here we are assuming we have a length
Download the file test.mat for this exercise.
Write a function coef=mylpc(x,P)
which will compute the order-P
LPC coefficients for the column vector
xcorr function for this.
toeplitz function to form To test your function, download the file
test.mat,
and load it into Matlab.
This file contains two vectors: a signal
mylpc function.
Download the file phrase.au for the following section.
One very effective application of LPC is the compression of speech signals. For example, an LPC vocoder (voice-coder) is a system used in many telephone systems to reduce the bit rate for the transmission of speech. This system has two overall components: an analysis section which computes signal parameters (gain, filter coefficients, etc.), and a synthesis section which reconstructs the speech signal after transmission.
Since we have introduced the speech model in "A Speech Model", and the estimation of LPC coefficients in "Linear Predictive Coding", we now have all the tools necessary to implement a simple vocoder. First, in the analysis section, the original speech signal will be split into short time frames. For each frame, we will compute the signal energy, the LPC coefficients, and determine whether the segment is voiced or unvoiced.
Download the file phrase.au. This speech signal is sampled at a rate of 8000 Hz.
mylpc function to compute order-15 LPC coefficients
for each frame. Place each set of coefficients into a column of a
To see the reduction in data, add up the total number of bytes Matlab
uses to store the encoded speech in the arrays
A,
VU, and
energy.
(use the whos function).
Compute the compression ratio by dividing this by the number of bytes
Matlab uses to store the original speech signal.
Note that the compression ratio can be further improved by using a technique
called vector quantization on the LPC coefficients, and also by using
fewer bits to represent the gain and voiced/unvoiced indicator.
Now the computed parameters will be used to re-synthesize the phrase
using the model in Figure 1.
Similar to your exciteV function from "Synthesis of Voiced Speech",
create a function x=exciteUV(N)
which returns a length
Listen to the original and synthesized phrase.
Can you recognize the synthesized version as coming from the same speaker?
What are some possible ways to improve the quality of the synthesized
speech?
Subplot the two speech signals in the same figure.
Hand in the following: