Summary: An explanation of how individual syllables are anaylzed and broken down into vowel sounds and formants.
Introduction to Speaker Identification
Key Problems in Speaker Identification
Envelope Detection in Speech Signals
Interpreting this signal first begins with determining an actual equation for the signal. The best way to do that is by using an autoregressive model. An autoregressive model is simply a model used to find an estimation of a signal based on previous input values of the signal. The actual equation for the model is as follows:
| The Autoregressive Model |
|---|
![]() |
The model consists of three parts: a constant part, an error or noise part, and the autoregressive summation. The actual summation represents the fact that the current value of the input depends only on previous values of the input. The variable p represents the order of the model. The higher the order of the system, the more accurate a representation it will be. Therefore, as the order of the system approaches infinity, we get almost an exact representation of our input system.
This system looks almost exactly like a differential equation. In fact, this equation can be used to find the transfer function for the signal.
Once you have the transfer function, you merely need to get your enveloped syllables and pass them through this transfer function. Once you take the frequency response of the transfer function, you can get a very nice plot as its output (Figure 1).
![]() |
This gives us something we can actually interpret. Specifically, you can clearly see the formants of the vowel – that is, you can see the peak values of the frequency response. These peaks are what differentiate vowel sounds from one another. For instance, looking at these vowel sounds, all from the same person, there is a clear discrepancy in their appearances (see Sample Formants).
| Sample Formants | ||||
|---|---|---|---|---|
|
| Sample Formants | ||||
|---|---|---|---|---|
|
Examining the first two formants, there are clear differences between where they occur and their magnitude in each vowel sound. These peak values will also be different from person to person, even for the same vowel. For instance, compare the sound ‘a’ (as in cat) for each member of the group (see Speaker Vowel Comparisons).
| Speaker Vowel Comparisons | ||||
|---|---|---|---|---|
|
| Speaker Vowel Comparisons | ||||
|---|---|---|---|---|
|
Even though the structure of the frequency responses are similar, the vowel sounds each have slightly different formants, both in the frequency at which they occur and the height that they attain. So finally, we have some way to analyze our signal. All that remains is the final step – comparing these formants to the formants of the whole group.