Within human speech, there are two methods employed to form our words. These sounds are categorized into the voiced and unvoiced. For the voiced part, our throat acts like a transfer function. The vowel sounds are included in this category. The unvoiced part describes the “noisy” sounds of speech. These are the sounds made with our mouth and tongue (as opposed to our throat), such as the “f” sound, the “s” sound, and the "th" sound.
This way of looking at speech as two seperable parts is known as the Source Filter Model of Speech or the Linear Separable Equivalent Circuit Model. Mathematically, they are described in the time domain as:
x(t)=
∫
0
t
g(τ)h(t−τ)dτ
x(t)=
∫
0
t
g(τ)h(t−τ)dτ
Since convolution in the time domain is multiplication in the frequency domain, this become:
X(ω)=G(ω)H(ω)
X(ω)=G(ω)H(ω)
There is a mathematical process with which we are familiar that can separate two multiplied variables. If we take the log of the magnitude of both sides of this transform, we reach:
log|
X(ω)
|=log|
G(ω)
|+log|
H(ω)
|
log|
X(ω)
|=log|
G(ω)
|+log|
H(ω)
|
Computing the inverse Fourier Transform of this equation brings us into the realm of "quefrency."
F
−1
log|
X(ω)
|=
F
−1
log|
G(ω)
|+
F
−1
log|
H(ω)
|
F
−1
log|
X(ω)
|=
F
−1
log|
G(ω)
|+
F
−1
log|
H(ω)
|
Quefrency is the x-axis of the cepstrum. Its units are in time. Typically the areas of intest are from 0ms to around 10ms. See figure 1 below for the full process.