# Connexions

You are here: Home » Content » Class Tutorial (Analysis of Speech Signal Spectrums Using the L2 Norm)

### Recently Viewed

This feature requires Javascript to be enabled.

# Class Tutorial (Analysis of Speech Signal Spectrums Using the L2 Norm)

Module by: Nicholas. E-mail the author

Summary: This method describes the course tutorial supplied by the professor.

## VI – Class Tutorial

This section is based on the class tutorial stated here: http://moodle.csun.edu/file.php/177/VoiceRecognition/node5.html.

For this tutorial, we analyze the sequence shown in figure 6.1; this figure shows two recordings of Nicholas saying the word “two”. We denote these sequences as two1 and two2.

Figure 6.1: Two recordings of Nicholas saying the word “two”.

We first compute the L2 norm of the difference of two signals as shown in (6.1).

 ∥ f 1 − f 2 ∥ = ∑ i min ( N 1 , N 2 ) f 1 ( i ) − f 2 ( i ) 2 ∥ f 1 − f 2 ∥ = ∑ i min ( N 1 , N 2 ) f 1 ( i ) − f 2 ( i ) 2 size 12{ ldline f rSub { size 8{1} } - f rSub { size 8{2} } rdline = Sum cSub { size 8{i} } cSup { size 8{"min" $$N rSub { size 6{1} } ,N rSub { size 6{2} }$$ } } { left (f rSub { size 8{1} } $$i$$ - f rSub { size 8{2} } $$i$$ right ) rSup { size 8{2} } } } {} (6.1)

We naively cut off the comparison of the two data sequences when the shorter signal ends. The norm of the difference between these two sequences is approximately 15.4. To gain an understanding of whether this value is large, we compute the energy in the individual signals. The energy in two1 and two2 are approximately 12.0 and 9.3, respectively. We see that the norm of the difference is greater than 100% of the energy in each individual signal. This is very large for two signals that produce the same sound (where “same” here means that both signals are interpreted by a human as having the same meaning).

We now compare the norm of the first “two” sequence to itself. Shown in figure 6.2 are two sequences: two1 and two3, where two3 = 5 * two1. Note the difference in the values of the y axes. As one can see, the difference in the signals is large (as was expected).

Figure 6.2: Plots showing a sequence “two” stated by Nicholas, that signal multiplied by 5, and the difference of the two.

Were two1 and two3 different recordings of the same person saying the phrase “two”, we could first make the sequences comparable by normalizing the amount the two sequences. As suggested in the tutorial, we could normalize by the maximum value in the signal. This is done according to the formula shown in (6.2).

 normalized data = data max ( data ) normalized data = data max ( data ) size 12{"normalized data"= { {"data"} over {"max" $$"data"$$ } } } {} (6.2)

In this case this procedure works perfectly, and in fact the L2 norm of the difference vector between two1 and the normalized two3 is 0. However, this procedure only works because one signal is exactly a multiple of the other. If the signals were slightly misaligned, or if there were noise added to the signal, then the energy in the difference signal would again be on the order of the energy in the signal itself. There would not have to be a lot of noise to corrupt this procedure. If two3 equaled 5*two1 at all points except the maximum, and that point were corrupted such that it were 2*5*two1, then the average value for the ratio between the two1 and the normalized data would be approximately 2.

A more robust normalization procedure is to normalize by the energy in the signal. This is done according to the formula shown in (6.3); the 2 subscript denotes that the 2 norm is used.

 normalized data = data ∥ data ∥ 2 normalized data = data ∥ data ∥ 2 size 12{"normalized data "= { {"data"} over { ldline "data" rdline rSub { size 8{2} } } } } {} (6.3)

Though this procedure does not make the comparison robust to alignment issues, it does make the procedure slightly robust to spurious noise, as long as that noise has a 0 temporal mean. Again, in our example where no noise is added to the system and the signals are perfectly aligned, the L2 norm of the difference between two1 and the normalized two3 is 0.

Comparing the norms as performed above is interesting; this procedure reveals just how adaptable the human brain is. The same phrase emitted by the same person while changing the amount of contraction in the diaphragm, the amount of contraction of the intercostals muscles, the spectrum emitted by the vocal cords (changing the pitch), and the shape of the respiratory tract (e.g. the shape of the mouth) are easily interpreted by the human brain to have the same meaning.

For a computer to perform similarly, we will need a more sophisticated processing than a comparison of norms.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks