We know how to acquire analog signals for digital processing
(
pre-filtering,
sampling, and
A/D conversion) and to compute spectra of
discrete-time signals (using the
FFT algorithm), let's put these various
components together to learn how the spectrogram shown in
Figure 1, which is used to
analyze speech, is
calculated. The speech was sampled at a rate of 11.025 kHz
and passed through a 16-bit A/D converter.
point of interest: Music compact discs (CDs) encode their signals at a
sampling rate of 44.1 kHz. We'll learn the rationale for this
number later. The 11.025 kHz sampling rate for the speech is
1/4 of the CD sampling rate, and was the lowest available
sampling rate commensurate with speech signal bandwidths
available on my computer.
Problem 1
Looking at
Figure 1 the
signal lasted a little over 1.2 seconds. How long was the
sampled signal (in terms of samples)? What was the datarate
during the sampling process in bps (bits per second)?
Assuming the computer storage is organized in terms of bytes
(8-bit quantities), how many bytes of computer memory does
the speech consume?
[
Click for Solution 1 ]
Solution 1
Number of samples equals
1.2×11025=13230
1.2
11025
13230
. The datarate is
11025×16=176.4
11025
16
176.4
kbps. The storage required would be
2646026460 bytes.
[
Hide Solution 1 ]
The resulting discrete-time signal, shown in the bottom of
Figure 1, clearly changes its
character with time. To display these spectral changes, the
long signal was sectioned into
frames:
comparatively short, contiguous groups of samples.
Conceptually, a Fourier transform of each frame is calculated
using the FFT. Each frame is not so long that significant
signal variations are retained within a frame, but not so short
that we lose the signal's spectral character. Roughly speaking, the speech signal's spectrum is evaluated over successive time segments and stacked side by side so that the
xx-axis corresponds to time and the
yy-axis frequency, with color indicating the spectral amplitude.
An important detail emerges when we examine each framed signal
(
Figure 2).
At the frame's edges, the
signal may change very abruptly, a feature not present in the
original signal. A transform of such a segment reveals a
curious oscillation in the spectrum, an artifact directly
related to this sharp amplitude change. A better way to frame
signals for spectrograms is to apply a
window:
Shape the signal values within a frame so that the signal decays
gracefully as it nears the edges. This shaping is accomplished
by multiplying the framed signal by the sequence
wn
w
n
. In sectioning the signal, we essentially applied a
rectangular window:
wn=1
w
n
1
,
0≤n≤N-1
0
n
N1
. A much more graceful window is the
Hanning
window; it has the cosine shape
wn=121-cos2πnN
w
n
1
2
1
2
n
N
. As shown in
Figure 2, this
shaping greatly reduces spurious oscillations in each frame's
spectrum. Considering the spectrum of the Hanning windowed
frame, we find that the oscillations resulting from applying the
rectangular window obscured a formant (the one located at a
little more than half the Nyquist frequency).
Problem 2
What might be the source of these oscillations? To gain
some insight, what is the length-
2N
2
N
discrete Fourier transform of a
length-NN pulse? The pulse
emulates the rectangular window, and certainly has edges.
Compare your answer with the length-
2N
2
N
transform of
a length-
N
N
Hanning window.
[
Click for Solution 2 ]
Solution 2
The oscillations are due to the boxcar window's Fourier
transform, which equals the sinc function.
[
Hide Solution 2 ]
If you examine the windowed signal sections in sequence to
examine windowing's affect on signal amplitude, we see that we
have managed to amplitude-modulate the signal with the
periodically repeated window (
Figure 3). To alleviate this problem, frames are
overlapped (typically by half a frame duration). This solution
requires more Fourier transform calculations than needed by
rectangular windowing, but the spectra are much better behaved
and spectral changes are much better captured.
The speech signal, such as shown in the
speech spectrogram, is sectioned into
overlapping, equal-length frames, with a Hanning window applied
to each frame. The spectra of each of these is calculated, and
displayed in spectrograms with frequency extending vertically,
window time location running horizontally, and spectral
magnitude color-coded.
Figure 4
illustrates these computations.
Problem 3
Why the specific values of 256 for
N
N and 512 for
K K? Another issue is how was the
length-512 transform of each length-256 windowed frame
computed?
[
Click for Solution 3 ]
Solution 3
These numbers are powers-of-two, and the FFT algorithm can
be exploited with these lengths. To compute a longer
transform than the input signal's duration, we simply
zero-pad the signal.
[
Hide Solution 3 ]
"Electrical Engineering Digital Processing Systems in Braille."