Correlation coefficients
Correlation, a measure of similarity between two signals, is
frequently used in the analysis of speech and other signals.
The cross-correlation between two discrete-time signals
xn
x
n
and
yn
y
n
is defined as
r
x
y
l=∑n=-∞∞xnyn-l
r
x
y
l
n
x
n
y
n
l
(1)
where
n
n is the sample index, and
l
l is the lag or time shift between the two signals
Proakis and Manolakis
(
pg. 120). Since speech signals are not
stationary, we are typically interested in the similarities
between signals only over a short time duration
(
<30
30
ms). In this case, the cross-correlation is
computed only over a window of time samples and for only a
few time delays
l=01…P
l
0
1
…
P
.
Now consider the autocorrelation sequence
r
s
s
l
r
s
s
l
, which describes the redundancy in the signal
sn
s
n
.
r
s
s
l=lN∑n=0N-1snsn-l
r
s
s
l
l
N
n
0
N
1
s
n
s
n
l
(2)
where
sn
s
n
,
n=-P-P+1…N-1
n
-P
-P
1
…
N
1
are the known samples (see
Figure 1) and the
1N
1
N
is a normalizing factor.
Another related method of measuring the redundancy in a
signal is to compute its autocovariance
r
s
s
l=1N-1∑n=lN-1snsn-l
r
s
s
l
1
N
1
n
l
N
1
s
n
s
n
l
(3)
where the summation is over
N-l
N
l
products (the samples
s-P…s-1
s
-P
…
s
-1
are ignored).
Linear prediction model
Linear prediction is a good tool for analysis
of speech signals. Linear prediction models the human vocal
tract as an
infinite impulse response
(
IIR) system that produces the speech signal.
For vowel sounds and other voiced regions of speech, which
have a resonant structure and high degree of similarity over
time shifts that are multiples of their pitch period, this
modeling produces an efficient representation of the
sound.
Figure 2 shows how the resonant
structure of a vowel could be captured by an IIR system.
The linear prediction problem can be stated as finding the
coefficients
a
k
a
k
which result in the best prediction (which minimizes
mean-squared prediction error) of the speech sample
sn
s
n
in terms of the past samples
sn-k
s
n
k
,
k=1…P
k
1
…
P
. The predicted sample
s
^
n
s
^
n
is then given by
Rabiner
and Juang
s
^
n=∑k=1P
a
k
sn-k
s
^
n
k
1
P
a
k
s
n
k
(4)
where
P
P is the number of past samples of
sn
s
n
which we wish to examine.
Next we derive the frequency response of the system
in terms of the prediction coefficients
a
k
a
k
. In
Equation 4, when the predicted
sample equals the actual signal (
i.e.,
s
^
n=sn
s
^
n
s
n
), we have
sn=∑k=1P
a
k
sn-k
s
n
k
1
P
a
k
s
n
k
sz=∑k=1P
a
k
szz-k
s
z
k
1
P
a
k
s
z
z
k
sz=11-∑k=1P
a
k
z-k
s
z
1
1
k
1
P
a
k
z
k
(5)
The optimal solution to this problem is
Rabiner and Juang
a=
a
1
a
2
…
a
P
a
a
1
a
2
…
a
P
r=
r
s
s
1
r
s
s
2…
r
s
s
PT
r
r
s
s
1
r
s
s
2
…
r
s
s
P
R=
r
s
s
0
r
s
s
1…
r
s
s
P-1
r
s
s
1
r
s
s
0…
r
s
s
P-2⋮⋮⋮⋮
r
s
s
P-1
r
s
s
P-2…
r
s
s
0
R
r
s
s
0
r
s
s
1
…
r
s
s
P
1
r
s
s
1
r
s
s
0
…
r
s
s
P
2
⋮
⋮
⋮
⋮
r
s
s
P
1
r
s
s
P
2
…
r
s
s
0
a=R-1r
a
R
r
(6)
Due to the Toeplitz property of the
R
R matrix (it is symmetric with equal diagonal
elements), an efficient algorithm is available for computing
a
a without the computational expense of finding
R-1
R
. The
Levinson-Durbin algorithm is an
iterative method of computing the predictor coefficients
a
a Rabiner and Juang
(
p.115).
Initial Step:
E
0
=
r
s
s
0
E
0
r
s
s
0
,
i=1
i
1
for
i=1
i
1
to
P
P.
Steps-
k
i
=1
E
i
-
1
r
s
s
i-∑j=1i-1
α
j
,
i
-
1
r
s
s
|i-j|
k
i
1
E
i
-
1
r
s
s
i
j
1
i
1
α
j
,
i
-
1
r
s
s
i
j
-
-
α
j
,
i
=
α
j
,
i
-
1
-
k
i
α
i
-
j
,
i
-
1
α
j
,
i
α
j
,
i
-
1
k
i
α
i
-
j
,
i
-
1
j=1…i-1
j
1
…
i
1
-
α
i
,
i
=
k
i
α
i
,
i
k
i
-
E
i
=1-
k
i
2
E
i
-
1
E
i
1
k
i
2
E
i
-
1
LPC-based synthesis
It is possible to use the prediction coefficients to
synthesize the original sound by applying
δn
δ
n
, the unit impulse, to the IIR system with lattice
coefficients
k
i
,
i=1…P
k
i
,
i
1
…
P
as shown in
Figure 3. Applying
δn
δ
n
to consecutive IIR systems (which represent consecutive
speech segments) yields a longer segment of synthesized
speech.
In this application, lattice filters are used rather than
direct-form filters since the lattice filter coefficients
have magnitude less than one and, conveniently, are
available directly as a result of the Levinson-Durbin
algorithm. If a direct-form implementation is desired
instead, the
α
α coefficients must be factored into second-order
stages with very small gains to yield a more stable
implementation.
When each segment of speech is synthesized in this manner,
two problems occur. First, the synthesized speech is
monotonous, containing no changes in pitch, because the
δn
δ
n
's, which represent pulses of air from the vocal
chords, occur with fixed periodicity equal to the analysis
segment length; in normal speech, we vary the frequency of
air pulses from our vocal chords to change pitch. Second,
the states of the lattice filter (i.e.,
past samples stored in the delay boxes) are cleared at the
beginning of each segment, causing discontinuity in the
output.
To estimate the pitch, we look at the autocorrelation coefficients of
each segment. A large peak in the autocorrelation coefficient at
lag
l ≠ 0
l
0
implies the speech segment is periodic (or, more
often, approximately periodic) with period
l
l. In synthesizing these segments, we recreate the
periodicity by using an impulse train as input and varying
the delay between impulses according to the pitch period.
If the speech segment does not have a large peak in the
autocorrelation coefficients, then the segment is an
unvoiced signal which has no periodicity. Unvoiced segments
such as consonants are best reconstructed by using noise
instead of an impulse train as input.
To reduce the discontinuity between segments, do not clear
the states of the IIR model from one segment to the next.
Instead, load the new set of reflection coefficients,
k
i
k
i
, and continue with the lattice filter computation.