Skip to content Skip to navigation

OpenStax-CNX

You are here: Home » Content » Modeling the Speech Signal

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.
  • IEEE-SPS display tagshide tags

    This module is included inLens: IEEE Signal Processing Society Lens
    By: IEEE Signal Processing SocietyAs a part of collection: "Speech Signal Analysis"

    Comments:

    "Collection for undergraduates interested in speech processing featuring the linear speech production model."

    Click the "IEEE-SPS" link to see all content they endorse.

    Click the tag icon tag icon to display tags associated with this content.

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • OrangeGrove display tagshide tags

    This module is included inLens: Florida Orange Grove Textbooks
    By: Florida Orange GroveAs a part of collection: "Fundamentals of Electrical Engineering I"

    Click the "OrangeGrove" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Rice DSS - Braille display tagshide tags

    This module is included inLens: Rice University Disability Support Services's Lens
    By: Rice University Disability Support ServicesAs a part of collection: "Fundamentals of Electrical Engineering I"

    Comments:

    "Electrical Engineering Digital Processing Systems in Braille."

    Click the "Rice DSS - Braille" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Rice Digital Scholarship display tagshide tags

    This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collections: "Speech Signal Analysis", "Fundamentals of Electrical Engineering I"

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Bookshare

    This module is included inLens: Bookshare's Lens
    By: Bookshare - A Benetech InitiativeAs a part of collection: "Fundamentals of Electrical Engineering I"

    Comments:

    "Accessible versions of this collection are available at Bookshare. DAISY and BRF provided."

    Click the "Bookshare" link to see all content affiliated with them.

  • Featured Content display tagshide tags

    This module is included inLens: Connexions Featured Content
    By: ConnexionsAs a part of collection: "Fundamentals of Electrical Engineering I"

    Comments:

    "The course focuses on the creation, manipulation, transmission, and reception of information by electronic means. It covers elementary signal theory, time- and frequency-domain analysis, the […]"

    Click the "Featured Content" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

Also in these lenses

  • Lens for Engineering

    This module is included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

Modeling the Speech Signal

Module by: Don Johnson. E-mail the author

Summary: A model of the human vocal tract.

Figure 1: The vocal tract is shown in cross-section. Air pressure produced by the lungs forces air through the vocal cords that, when under tension, produce puffs of air that excite resonances in the vocal and nasal cavities. What are not shown are the brain and the musculature that control the entire speech production process.
Vocal Tract
Vocal Tract (vocaltract.png)
Figure 2: The systems model for the vocal tract. The signals lt l t , p T t p T t , and st s t are the air pressure provided by the lungs, the periodic pulse output provided by the vocal cords, and the speech output respectively. Control signals from the brain are shown as entering the systems from the top. Clearly, these come from the same source, but for modeling purposes we describe them separately since they control different aspects of the speech signal.
Model of the Vocal Tract
Model of the Vocal Tract (sys7.png)

The information contained in the spoken word is conveyed by the speech signal. Because we shall analyze several speech transmission and processing schemes, we need to understand the speech signal's structure -- what's special about the speech signal -- and how we can describe and model speech production. This modeling effort consists of finding a system's description of how relatively unstructured signals, arising from simple sources, are given structure by passing them through an interconnection of systems to yield speech. For speech and for many other situations, system choice is governed by the physics underlying the actual production process. Because the fundamental equation of acoustics -- the wave equation -- applies here and is linear, we can use linear systems in our model with a fair amount of accuracy. The naturalness of linear system models for speech does not extend to other situations. In many cases, the underlying mathematics governed by the physics, biology, and/or chemistry of the problem are nonlinear, leaving linear systems models as approximations. Nonlinear models are far more difficult at the current state of knowledge to understand, and information engineers frequently prefer linear models because they provide a greater level of comfort, but not necessarily a sufficient level of accuracy.

Figure 1 shows the actual speech production system and Figure 2 shows the model speech production system. The characteristics of the model depends on whether you are saying a vowel or a consonant. We concentrate first on the vowel production mechanism. When the vocal cords are placed under tension by the surrounding musculature, air pressure from the lungs causes the vocal cords to vibrate. To visualize this effect, take a rubber band and hold it in front of your lips. If held open when you blow through it, the air passes through more or less freely; this situation corresponds to "breathing mode". If held tautly and close together, blowing through the opening causes the sides of the rubber band to vibrate. This effect works best with a wide rubber band. You can imagine what the airflow is like on the opposite side of the rubber band or the vocal cords. Your lung power is the simple source referred to earlier; it can be modeled as a constant supply of air pressure. The vocal cords respond to this input by vibrating, which means the output of this system is some periodic function.

Exercise 1

Note that the vocal cord system takes a constant input and produces a periodic airflow that corresponds to its output signal. Is this system linear or nonlinear? Justify your answer.

Solution

If the glottis were linear, a constant input (a zero-frequency sinusoid) should yield a constant output. The periodic output indicates nonlinear behavior.

Singers modify vocal cord tension to change the pitch to produce the desired musical note. Vocal cord tension is governed by a control input to the musculature; in system's models we represent control inputs as signals coming into the top or bottom of the system. Certainly in the case of speech and in many other cases as well, it is the control input that carries information, impressing it on the system's output. The change of signal structure resulting from varying the control input enables information to be conveyed by the signal, a process generically known as modulation. In singing, musicality is largely conveyed by pitch; in western speech, pitch is much less important. A sentence can be read in a monotone fashion without completely destroying the information expressed by the sentence. However, the difference between a statement and a question is frequently expressed by pitch changes. For example, note the sound differences between "Let's go to the park." and "Let's go to the park?";

For some consonants, the vocal cords vibrate just as in vowels. For example, the so-called nasal sounds "n" and "m" have this property. For others, the vocal cords do not produce a periodic output. Going back to mechanism, when consonants such as "f" are produced, the vocal cords are placed under much less tension, which results in turbulent flow. The resulting output airflow is quite erratic, so much so that we describe it as being noise. We define noise carefully later when we delve into communication problems.

The vocal cords' periodic output can be well described by the periodic pulse train p T t p T t as shown in the periodic pulse signal, with T T denoting the pitch period. The spectrum of this signal contains harmonics of the frequency 1T 1 T , what is known as the pitch frequency or the fundamental frequency F0 F0 . The primary difference between adult male and female/prepubescent speech is pitch. Before puberty, pitch frequency for normal speech ranges between 150-400 Hz for both males and females. After puberty, the vocal cords of males undergo a physical change, which has the effect of lowering their pitch frequency to the range 80-160 Hz. If we could examine the vocal cord output, we could probably discern whether the speaker was male or female. This difference is also readily apparent in the speech signal itself.

To simplify our speech modeling effort, we shall assume that the pitch period is constant. With this simplification, we collapse the vocal-cord-lung system as a simple source that produces the periodic pulse signal (Figure 2). The sound pressure signal thus produced enters the mouth behind the tongue, creates acoustic disturbances, and exits primarily through the lips and to some extent through the nose. Speech specialists tend to name the mouth, tongue, teeth, lips, and nasal cavity the vocal tract. The physics governing the sound disturbances produced in the vocal tract and those of an organ pipe are quite similar. Whereas the organ pipe has the simple physical structure of a straight tube, the cross-section of the vocal tract "tube" varies along its length because of the positions of the tongue, teeth, and lips. It is these positions that are controlled by the brain to produce the vowel sounds. Spreading the lips, bringing the teeth together, and bringing the tongue toward the front portion of the roof of the mouth produces the sound "ee." Rounding the lips, spreading the teeth, and positioning the tongue toward the back of the oral cavity produces the sound "oh." These variations result in a linear, time-invariant system that has a frequency response typified by several peaks, as shown in Figure 3.

Figure 3: The ideal frequency response of the vocal tract as it produces the sounds "oh" and "ee" are shown on the top left and top right, respectively. The spectral peaks are known as formants, and are numbered consecutively from low to high frequency. The bottom plots show speech waveforms corresponding to these sounds.
Speech Spectrum
Speech Spectrum (spectrum6.png)

These peaks are known as formants. Thus, speech signal processors would say that the sound "oh" has a higher first formant frequency than the sound "ee," with F2 F2 being much higher during "ee." F2 F2 and F3 F3 (the second and third formants) have more energy in "ee" than in "oh." Rather than serving as a filter, rejecting high or low frequencies, the vocal tract serves to shape the spectrum of the vocal cords. In the time domain, we have a periodic signal, the pitch, serving as the input to a linear system. We know that the output—the speech signal we utter and that is heard by others and ourselves—will also be periodic. Example time-domain speech signals are shown in Figure 3, where the periodicity is quite apparent.

Exercise 2

From the waveform plots shown in Figure 3, determine the pitch period and the pitch frequency.

Solution

In the bottom-left panel, the period is about 0.009 s, which equals a frequency of 111 Hz. The bottom-right panel has a period of about 0.0065 s, a frequency of 154 Hz.

Since speech signals are periodic, speech has a Fourier series representation given by a linear circuit's response to a periodic signal. Because the acoustics of the vocal tract are linear, we know that the spectrum of the output equals the product of the pitch signal's spectrum and the vocal tract's frequency response. We thus obtain the fundamental model of speech production.

Sf= P T f H V f S f P T f H V f
(1)
Here, H V f H V f is the transfer function of the vocal tract system. The Fourier series for the vocal cords' output, derived in this equation, is
c k =AeiπkΔTsinπkΔTπk c k A k Δ T k Δ T k
(2)
and is plotted on the top in Figure 4. If we had, for example, a male speaker with about a 110 Hz pitch ( T9.1ms T 9.1 ms ) saying the vowel "oh", the spectrum of his speech predicted by our model is shown in Figure 4(b).

Figure 4: The vocal tract's transfer function, shown as the thin, smooth line, is superimposed on the spectrum of actual male speech corresponding to the sound "oh." The pitch lines corresponding to harmonics of the pitch frequency are indicated.
voice spectrum
pulse
(a) The vocal cords' output spectrum P T f P T f .
pulse (spectrum3.png)
voice spectrum
(b) The vocal tract's transfer function, H V f H V f and the speech spectrum.
voice spectrum (spectrum7.png)

The model spectrum idealizes the measured spectrum, and captures all the important features. The measured spectrum certainly demonstrates what are known as pitch lines, and we realize from our model that they are due to the vocal cord's periodic excitation of the vocal tract. The vocal tract's shaping of the line spectrum is clearly evident, but difficult to discern exactly, especially at the higher frequencies. The model transfer function for the vocal tract makes the formants much more readily evident.

Exercise 3

The Fourier series coefficients for speech are related to the vocal tract's transfer function only at the frequencies kT k T , k12 k 1 2 ; see previous result. Would male or female speech tend to have a more clearly identifiable formant structure when its spectrum is computed? Consider, for example, how the spectrum shown on the right in Figure 4 would change if the pitch were twice as high ( 300 Hz 300 Hz).

Solution

Because males have a lower pitch frequency, the spacing between spectral lines is smaller. This closer spacing more accurately reveals the formant structure. Doubling the pitch frequency to 300 Hz for Figure 4 would amount to removing every other spectral line.

When we speak, pitch and the vocal tract's transfer function are not static; they change according to their control signals to produce speech. Engineers typically display how the speech spectrum changes over time with what is known as a spectrogram Figure 5. Note how the line spectrum, which indicates how the pitch changes, is visible during the vowels, but not during the consonants (like the ce in "Rice").

Figure 5: Displayed is the spectrogram of the author saying "Rice University." Blue indicates low energy portion of the spectrum, with red indicating the most energetic portions. Below the spectrogram is the time-domain speech signal, where the periodicities can be seen.
spectrogram
spectrogram (spectrum8.png)

The fundamental model for speech indicates how engineers use the physics underlying the signal generation process and exploit its structure to produce a systems model that suppresses the physics while emphasizing how the signal is "constructed." From everyday life, we know that speech contains a wealth of information. We want to determine how to transmit and receive it. Efficient and effective speech transmission requires us to know the signal's properties and its structure (as expressed by the fundamental model of speech production). We see from Figure 5, for example, that speech contains significant energy from zero frequency up to around 5 kHz.

Effective speech transmission systems must be able to cope with signals having this bandwidth. It is interesting that one system that does not support this 5 kHz bandwidth is the telephone: Telephone systems act like a bandpass filter passing energy between about 200 Hz and 3.2 kHz. The most important consequence of this filtering is the removal of high frequency energy. In our sample utterance, the "ce" sound in "Rice"" contains most of its energy above 3.2 kHz; this filtering effect is why it is extremely difficult to distinguish the sounds "s" and "f" over the telephone. Try this yourself: Call a friend and determine if they can distinguish between the words "six" and "fix". If you say these words in isolation so that no context provides a hint about which word you are saying, your friend will not be able to tell them apart. Radio does support this bandwidth (see more about AM and FM radio systems).

Efficient speech transmission systems exploit the speech signal's special structure: What makes speech speech? You can conjure many signals that span the same frequencies as speech—car engine sounds, violin music, dog barks—but don't sound at all like speech. We shall learn later that transmission of any 5 kHz bandwidth signal requires about 80 kbps (thousands of bits per second) to transmit digitally. Speech signals can be transmitted using less than 1 kbps because of its special structure. To reduce the "digital bandwidth" so drastically means that engineers spent many years to develop signal processing and coding methods that could capture the special characteristics of speech without destroying how it sounds. If you used a speech transmission system to send a violin sound, it would arrive horribly distorted; speech transmitted the same way would sound fine.

Exploiting the special structure of speech requires going beyond the capabilities of analog signal processing systems. Many speech transmission systems work by finding the speaker's pitch and the formant frequencies. Fundamentally, we need to do more than filtering to determine the speech signal's structure; we need to manipulate signals in more ways than are possible with analog systems. Such flexibility is achievable (but not without some loss) with programmable digital systems.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks