Summary: Spectrograms visually represent the speach signal, and the calculation of the Spectrogram is briefly explained.

We know how to acquire analog signals for digital processing (pre-filtering, sampling, and A/D conversion) and to compute spectra of discrete-time signals (using the FFT algorithm), let's put these various components together to learn how the spectrogram shown in Figure 1, which is used to analyze speech, is calculated. The speech was sampled at a rate of 11.025 kHz and passed through a 16-bit A/D converter.

Music compact discs (CDs) encode their signals at a
sampling rate of 44.1 kHz. We'll learn the rationale for this
number later. The 11.025 kHz sampling rate for the speech is
1/4 of the CD sampling rate, and was the lowest available
sampling rate commensurate with speech signal bandwidths
available on my computer.

Looking at Figure 1 the signal lasted a little over 1.2 seconds. How long was the sampled signal (in terms of samples)? What was the datarate during the sampling process in bps (bits per second)? Assuming the computer storage is organized in terms of bytes (8-bit quantities), how many bytes of computer memory does the speech consume?

Number of samples equals

Speech Spectrogram |
---|

The resulting discrete-time signal, shown in the bottom of Figure 1, clearly changes its
character with time. To display these spectral changes, the
long signal was sectioned into frames:
comparatively short, contiguous groups of samples.
Conceptually, a Fourier transform of each frame is calculated
using the FFT. Each frame is not so long that significant
signal variations are retained within a frame, but not so short
that we lose the signal's spectral character. Roughly speaking, the speech signal's spectrum is evaluated over successive time segments and stacked side by side so that the

An important detail emerges when we examine each framed signal (Figure 2).

Spectrogram Hanning vs. Rectangular |
---|

What might be the source of these oscillations? To gain
some insight, what is the
length-

The oscillations are due to the boxcar window's Fourier transform, which equals the sinc function.

Non-overlapping windows |
---|

If you examine the windowed signal sections in sequence to examine windowing's effect on signal amplitude, we see that we have managed to amplitude-modulate the signal with the periodically repeated window (Figure 3). To alleviate this problem, frames are overlapped (typically by half a frame duration). This solution requires more Fourier transform calculations than needed by rectangular windowing, but the spectra are much better behaved and spectral changes are much better captured.

The speech signal, such as shown in the speech spectrogram, is sectioned into overlapping, equal-length frames, with a Hanning window applied to each frame. The spectra of each of these is calculated, and displayed in spectrograms with frequency extending vertically, window time location running horizontally, and spectral magnitude color-coded. Figure 4 illustrates these computations.

Overlapping windows for computing spectrograms |
---|

Why the specific values of 256 for

These numbers are powers-of-two, and the FFT algorithm can be exploited with these lengths. To compute a longer transform than the input signal's duration, we simply zero-pad the signal.

Comments:"Collection for undergraduates interested in speech processing featuring the linear speech production model."