Skip to content Skip to navigation

OpenStax_CNX

You are here: Home » Content » Speech Perception

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

Speech Perception

Module by: David Lane. E-mail the author

For most of us, listening to speech is an effortless task. Generally speaking, speech perception proceeds through a series of stages in which acoustic cues are extracted and stored in sensory memory and then mapped onto linguistic information. When air from the lungs is pushed into the larynx across the vocal cords and into the mouth nose, different types of sounds are produced. the different qualities of the sounds are represented in formants, which can be pictured on a graph that has time on the x-axis and the pressure under which the air is pushed, on the y-axis. Perception of the sound will vary as the frequency with which the air vibrates across time varies. Because vocal tracts vary somewhat between people (just as shoe size or height do), one person's vocal cords may be shorter than another's, or the roof of someone's mouth may be higher than another's, and the end result is that there are individual differences in how various sounds are produced. You probably know someone whose voice is slightly lower in pitch than yours or higher in pitch. Pitch is the psychological correlate of the physical acoustic cue of frequency. The more frequently the vibrations of air occur for a particular sound, the higher in pitch it will be perceived. Less frequent vibrations are perceived as being lower in pitch. When language is the sound being processed, the formants are mapped onto phonemes, which are the smallest unit of sound in a language. For example, in English the phonemes in the word "glad" are /g/, /l/, /æ/, and /d/.

The nature of speech, however, has provided researchers of language with a number of puzzles, some of which have been researched for more than forty years.

Note:

To demonstrate one of these problems, click here. The waveform you see shows speech as a function of amplitude, which is measured in decibels (dB), and frequency of the sound waves, measured in hertz (Hz). As the cursor passes over the waveform, you may notice various sections that correspond to the words and individual sounds you hear; for example, you can detect where the word "show" begins and where the word "money" ends. After a bit of experimentation, however, you notice that it is difficult to pinpoint precisely where one phoneme ends and another begins. Try to find the "th" sound in the word "the", for example; and where can the "uh" sound in "the" be located? Often the acoustic feature of one sound will spread themselves across those of another sound, leading to the problem of linearity; that is, for each speech sound phoneme, if phonemes were produced one at a time, or linearly, there should be a single corresponding section in the waveform. As "the" shows, however, speech is not linear.

Another problem that investigators have studied is the problem of invariance. Invariance refers to a particular phoneme having one and only one waveform representation; that is, the phoneme /i/ (the "ee" sound in "me") should have the identical amplitude and frequency as the same phoneme in "money". As you can see again, that is not the case; the two differ. The plosives, or stop consonants, /b/, /d/, /g/, /k/, provide particular problems for the invariance assumption.

Note:

To download free sound-processing software to record your own sentences now, in order to see the problems of linearity and invariance in your own speech, click here.

The problems of linearity and invariance are brought about by co-articulation, the influence of the articulation (pronunciation) of one phoneme on that of another phoneme. Because phonemes cannot always be isolated in a spectrogram and can vary from one context to another depending on neighboring phonemes, speakers' rate of speech, and loudness, perceptually identifying one phoneme among a stream of others, the process of segmentation, also seems like a daunting task. Theories and models of speech perception have to be able to account for how segmentation occurs in order to provide an adequate account of speech perception. We will discuss some accounts of speech perception below.

Some clues as to how identifying phonemes occurs arise from investigation into the ability to perceive voiced consonants, or consonants in which the vocal cords vibrate. To understand the concept of voicing, say the phoneme, /p/, followed by the phoneme, /b/, while touching your throat. You will feel the vibration of your vocal cords during /b/ but not during /p/. Both of these phonemes are bilabial; that is, they are produced by pressing the lips together, and are released with a puff of air. Since the discriminating difference between these two phonemes relevant to English is in their voicing, the ability to adequately perceive voicing is crucial for an adept listener; for example, as the rate of speech increases, listeners are able to shift their criterion of what constitutes a voiceless phoneme. The criterion shift allows them to accept phonemes that are pronounced with shorter VOT durations. Although shifting criteria during the perception of phonemes may be one process that allows accurate identification of phonemes despite changing conditions, what supports the criterion shifts is still a matter of investigation. These skills effortlessly become highly automatic and are probably acquired and fine-tuned during early childhood, a topic we talk about in infant speech perception.

(Video clips courtesy of the late Peter W. Jusczyk and the Johns Hopkins University).

Is speech special?

In visual perception, people discriminate among colors based on the frequency of the wave length of light. Low frequencies are perceived as red and high frequencies are perceived as violet.

Figure 1
Figure 1 (spectrum.jpg)

As we move from low to high frequencies, we perceive a continuum of colors from red to violet. Notice that as we move from red to orange, we pass through a middle ground that we call "red orange." Speech sounds lie on a physical continuum as well. For example, an important dimension in speech perception is voice onset time. This refers to the time between the beginning of the pronunciation of the word and the onset of the vibration of the vocal chords. For example, when you say "ba" your vocal chords vibrate right from the start. When you say "pa" your vocal chords do not vibrate until after a short delay. To see this for yourself, put one of your fingers on your vocal chords and say "ba" and then "pa."

The only difference between the sound "ba" and the sound "pa" is that the voice onset time for "ba" is shorter than the voice onset time for "pa". An important difference between speech perception and visual perception is that we do not hear speech sounds as falling halfway between a "ba" and a "pa." We hear a sound one way or the other. This means that a range of voice onset times are perceived as "ba" and a different range of voice onset times are perceived as "pa". This phenomenon is called categorical perception and is very helpful for understanding speech.

The sounds "ba" and "pa" differ on the continuous dimension of voice onset time. The sounds "ga" and "da" also differ on a continuous dimension. However, the continuous dimension for these stimuli is more complex than the dimension of voice onset time (it is called the second formant but that is a little beyond the scope of this text). What is important here is that there is a continuum of sounds from "da" to "ga." The following demonstration uses computer generated speech sounds. Ten sounds were generated in equal steps from "da" to "ga." The experiment uses sounds numbered 1, 4, 7, and 10. Sounds 1 and 4 are both heard as "da" whereas sounds 7 and 10 are heard as "ga." In the task, subjects are presented with a randomly-ordered series of sound pairs and asked, for each pair, to judge whether the sounds are the same or different. Since sounds 1 and 4 are both heard as "da" it should be very hard to tell them apart. Therefore, subjects usually judge these sounds as identical. By contrast, Sound 4 is heard as "da" while Sound 7 is heard as "ga." Since Sound 4 and Sound 7 are on opposite sides of the "categorical boundary" it is easier to hear the difference between these sounds than the difference between Sounds 1 and 4. This occurs even though the physical difference between Sounds 1 and 4 is the same as the difference between Sounds 4 and 7. By similar logic, the difference between Sounds 7 and 10 should be hard to hear.

The results from one subject in this demonstration experiment are shown below and can be interpreted as follows: When the comparison was between Sounds 1 and 4, the subject judged them to be different once and the same 4 times. When the comparison was between Sounds 4 and 7 (which cross the border), the subject correctly judged them to be different 5/5 times. Finally, in comparing Sounds 7 and 10, the subject always judged the sounds to be the same. Thus, the only time this subject heard a difference between sounds that were three steps apart was for Sounds 4 and 7.

Table 1
Sound Pair Judged different Judged same
1 vs. 4 1 4
4 vs. 7 5 0
7 vs. 10 0 5

Not all results are as clear cut as those shown above. Many people need more time to become familiar with the task than is possible in this demonstration. In any case, you should get a sense of how this kind of experiment works.

The hypothesis that speech is perceptually special has arisen from this phenomenon of categorical perception. Listeners can differentiate between /p/ and /b/; however, performance in distinguishing between different types of /p/ sounds is difficult and, for some, impossible. This pattern is consistent with the pragmatic demands of language; there is a meaning distinction between /p/ and /b/, while the distinction between two variations of /p/ carries no meaning. (There are languages in which two different /p/ sounds are used, and, in such cases, perception would be categorical).

The first experiment to demonstrate categorical perception was conducted by Liberman, Harris, Hoffman and Griffith (1957), and in it they presented consonant-vowel syllables along a continuum. The consonants were stop consonants, or plosives, /b/, /d/, and /g/, followed by /a/; for example, /ba/. When asked to say whether two syllables were the same or different, the participants reported various forms of /pa/ to be the same, whereas /pa/ and /ba/ were easily discriminated.

Another categorical perception task presents two syllables followed by a probe syllable, and participants have to say which of the first two syllables the probe matches. If the first two sounds are from two different categories - for example, /da/ and /ga/ - participants accurately match the probe syllable. If the first two syllables are taken from the same category, however, participants cannot differentiate them well enough to do the matching task, and their performance is at chance.

Does the categorical perception of speech mean that speech is perceived via a specialized speech processor? Kewley-Port and Luce (1984) did not find categorical perception in some non-speech stimuli, indicating that there may be something special about speech.

For there to be a specialized speech processor, categorical perception should occur during the perception of all phonemes. However, Fry, Abramson, Eimas, and Liberman (1962), failed to find categorical perception with a vowel continuum. So, there are vowels and consonants that do not behave the same in that respect. Additionally, chinchillas have been shown to categorically perceive speech, despite their obvious lack of speech-processing mechanism (Kuhl, 1987).

How is speech perceived?

One theory of how speech is perceived is the Motor Theory of speech perception (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). The motor theory postulates that speech is perceived by reference to how it is produced; that is, when perceiving speech, listeners access their own knowledge of how phonemes are articulated. Articulatory gestures such as rounding or pressing the lips together are units of perception that directly provide the listener with phonetic information. The motor theory can account for the invariance problem; that is, the ways that phonemes are produced and perceived have more in common than the ways they are acoustically represented and perceived.

What would be the evidence that listeners use articulatory features when perceiving speech? Here, an accidental discovery made by two film technicians led to one of the most robust and widely discussed findings in language processing. A researcher, Harry McGurk, was interested in whether auditory or visual modalities are differentially dominant during infants' perceptual development. To find out, he asked his technician to create a film to test which modality captured infants' attention. In this film, an actor pronounced the syllable "ga" while an auditory "ba" was dubbed over the tape. Would babies pay attention to the "ga" or the "ba"? The process of making the film, however, led to a surprising finding about adults. The technician (and others) did not perceive either a "ga" or a "ba". Rather, the technician perceived a "da".

In an experiment that formally tested this observation, McGurk and McDonald (1976) showed research participants a video of a person saying a syllable that began with a consonant formed in the back of the mouth at the velum-that is, a velar consonant, "ga"-while playing an auditory tape of a consonant which is formed in the front of the mouth at the two lips; that is, a bilabial, "ba". When viewers were asked what they heard, like the film technician, they replied "da". Perceiving a "da" was the result of combining articulatory information from both visually and auditorily presented stimuli.

Note:

You can experience McGurk effect by clicking here.
(To return to the question Harry McGurk originally asked about infants, neither modality seems to have dominance; infants as young as 5-months old take in the visual and auditory information about words in the same way as adults: both influence perception).

In addition to being interpreted as evidence that listeners perceive phonetic gestures, an account that suggests an explanation based on memory has been raised. Because perceivers have ample experience with both hearing and seeing people speak, they may have built memories of these events that have subsequently become associated with the phoneme's mental representation, so that when the phoneme is perceived, memories based on the visual information are recalled (Massaro, 1987 [link]).

To test this possibility, Fowler and Dekle (1991) introduced research participants to one of two experimental conditions. In one, the participants were presented with either a printed ba or printed ga syllable, while listening to a syllable from the auditory /ba/-/ga/ continuum. In the other, the printed syllables were replaced with their haptic presentations; that is, participants were able to feel how the syllables were being produced. Since there are no previously made associations to how syllables feel when a speaker produces them, by the memory account there should be no McGurk effect. The experimenters found no effect of the printed syllables on the auditory ones, as expected, and they found that the feel of how a syllable is produced affected the perception of the auditory syllables, indicating that articulatory gestures are indeed perceived by listeners.

The TRACE model of speech perception, TRACE 1 , developed by Jay McClelland and Jeff Elman (1986; Elman & McClelland, 1988), depicts speech as a process in which speech units are arranged into levels and interact with each other. There are three levels: features, phonemes, and words. The levels are comprised of processing units, or nodes; for example, within the feature level, there are individual nodes that detect voicing.

Nodes that are consistent with each other share excitatory activation; for example, to perceive a /k/ in "cake", the /k/ phoneme and corresponding featural units share excitatory connections. Nodes that are inconsistent with each other share inhibitory links. Such nodes are nodes within a level. In this example, /k/ would have an inhibitory connection with the vowel sound in "cake", /eI/.

To perceive speech, the featural nodes are activated initially, followed in time by the phoneme and then word nodes. Thus, activation is bottom-up. Activation can also spread top-down, however, and TRACE can model top-down effects such as the fact that context can influence the perception of individual phonemes.

Perception of speech can be influenced by contextual information, indicating that perception is not strictly bottom-up but can receive feedback from semantic levels of knowledge. In 1970, Warren and Warren took simple sentences, such as "It was found that the wheel was on the axle", removed the /w/ sound from "wheel", and replaced it with a cough. They found that listeners were unable to detect that the phoneme was missing. They found the same effect with the following sentences as well:

It was found that the *eel was on the shoe.
It was found that the *eel was on the orange.
It was found that the *eel was on the table.
Listeners perceived heel, peel, and meal, respectively. Because the perception of the word with the missing phoneme depends on the last word of the sentence, their finding indicates that perception is highly interactive.

Gating Task:

A task developed to show the effect of context on spoken word recognition is Gating (Grosjean, 1980). In this task, participants are presented with fragments of a word, of gradually increasing duration (such as 50 msec increments); for example, t - tr - tre - tress - tresp - trespa. Upon hearing each fragment, the participant makes a guess at what the whole word might be. (Have a go at this gating task yourself). The point at which the person guesses the whole word is called the isolation point. Gating shows the effect of context on spoken word recognition: there is a time difference between identifying a word in isolation and identifying it in a sentence. The time to identify a word in context is about a fifth of a second, whereas it takes a third of a second in isolation. It is thought that the grammar and meaning of the preceding part of the sentence limit the range of possibilities for the gated word, such that it can be identified sooner in a sentence than on its own. The point at which there is only one possible candidate is called the uniqueness point. The uniqueness point and the isolation point need not correspond: on the one hand, the word may be recognized before there is one remaining candidate, if the context is helpful (i.e., strongly biasing); on the other hand, there may be a delay in isolating the word. There is a third point, called the recognition point. This is the point at which the person is confident in his/her identification of the gated word.

The guesses people make on this task indicate that the perceptual identity of the word is also important to spoken word recognition, even before the context has its effect. In other words, people's early guesses resemble the perceptual aspects of the word and not the contextually signaled candidate.

References

  1. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-368.
  2. Kewley-Port, D., & Luce, P. A. (1984). Time-varying features of initial stop consonants in auditory running spectra: A first report. Perception and psychophysics, 35, 353-360.
  3. Fry, D. B., Abramson, A. S., Eimas, P. D., & Liberman, A. M. (1962). The identification and discrimination of synthetic vowels. Language and Speech. Language and Speech, 5, 171-189.
  4. Kuhl, P.K. (1987). The special mechanisms debate in speech research: Categorization tests on animals and infants. In S. Harnad (Ed.), Categorical perception: The groundwork of cognition. (pp. 355-386). Cambridge: Cambridge University Press.
  5. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431-361.
  6. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746-748.
  7. Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: Cross-modal contributions to speech perception. Journal Experimental Psychology: Human Perception and Performance, 17, 816-828.
  8. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.
  9. Elman, J. L., & McClelland, J. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for Co-articulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-165.
  10. Warren, R. M., & Warren R. P. (1970). Auditory illusions and confusions. Scientific American, 223, 30-36.
  11. Grosjean, F. (1980). Spoken word recognition processes and the gating paradigm. Perception and Psychophysics, 28, 267-283.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks