<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/cnxml/0.5/DTD/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:bib="http://bibtexml.sf.net/" id="m0049">

  <name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Modeling the Speech Signal</name>

  <metadata xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
  <md:version xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">2.25</md:version>
  <md:created xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">2000/07/26</md:created>
  <md:revised xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">2007/11/30 16:06:26.505 US/Central</md:revised>
  <md:authorlist xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
      <md:author xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="dhj">
      <md:firstname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Don</md:firstname>
      
      <md:surname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Johnson</md:surname>
      <md:email xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">dhj@rice.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
    <md:maintainer xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="dhj">
      <md:firstname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Don</md:firstname>
      
      <md:surname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Johnson</md:surname>
      <md:email xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">dhj@rice.edu</md:email>
    </md:maintainer>
    <md:maintainer xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="mrshawn">
      <md:firstname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Shawn</md:firstname>
      
      <md:surname xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Stewart</md:surname>
      <md:email xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">mrshawn@alumni.rice.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
    <md:keyword xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">speech model</md:keyword>
    <md:keyword xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">vocal tract</md:keyword>
  </md:keywordlist>

  <md:abstract xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">A model of the human vocal tract.
</md:abstract>
</metadata>

  <content xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">

    <figure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="vocal">
      <name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Vocal Tract</name>
      <media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="vocaltract.png"/>
      <caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	The vocal tract is shown in cross-section. Air pressure
	produced by the lungs forces air through the vocal cords that,
	when under tension, produce puffs of air that excite
	resonances in the vocal and nasal cavities. What are not shown
	are the brain and the musculature that control the entire
	speech production process.

      </caption>
    </figure>

    <figure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="vocalmodel">
      <name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Model of the Vocal Tract</name>
      <media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="sys7.png"/>
      <caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	The systems model for the vocal tract. The signals
	<m:math>
	  <m:apply>
	    <m:ci type="fn">l</m:ci>
	    <m:ci>t</m:ci>
	  </m:apply>
	</m:math>,
	<m:math>
	  <m:apply>
	    <m:ci type="fn">
	      <m:msub>
		<m:mi>p</m:mi>
		<m:mi>T</m:mi>
	      </m:msub>
	    </m:ci>
	    <m:ci>t</m:ci>
	  </m:apply>
	</m:math>, and
	<m:math>
	  <m:apply>
	    <m:ci type="fn">s</m:ci>
	    <m:ci>t</m:ci>
	  </m:apply>
	</m:math>
	are the air pressure provided by the lungs, the periodic pulse
	output provided by the vocal cords, and the speech output
	respectively. Control signals from the brain are shown as
	entering the systems from the top. Clearly, these come from
	the same source, but for modeling purposes we describe them
	separately since they control different aspects of the speech
	signal.
      </caption>
    </figure>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="information">
      The information contained in the spoken word is conveyed by the
      speech signal. Because we shall analyze several speech
      transmission and processing schemes, we need to understand the
      speech signal's structure -- what's special about the speech
      signal -- and how we can describe and <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">model</emphasis>
      speech production. This modeling effort consists of finding a
      system's description of how relatively unstructured signals,
      arising from simple sources, are given structure by passing them
      through an interconnection of systems to yield speech. For
      speech and for many other situations, system choice is governed
      by the physics underlying the actual production process. Because
      the fundamental equation of acoustics -- the wave equation --
      applies here and is linear, we can use linear systems in our
      model with a fair amount of accuracy. The naturalness of linear
      system models for speech does not extend to other situations. In
      many cases, the underlying mathematics governed by the physics,
      biology, and/or chemistry of the problem are nonlinear, leaving
      linear systems models as approximations. Nonlinear models are
      far more difficult at the current state of knowledge to
      understand, and information engineers frequently prefer linear
      models because they provide a greater level of comfort, but not
      necessarily a sufficient level of accuracy.
    </para>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="speechproduction"><cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="vocal" strength="9"/> shows the actual speech
      production system and <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="vocalmodel" strength="9"/> shows the model speech production system. The characteristics of the
      model depends on whether you are saying a vowel or a consonant. We
      concentrate first on the vowel production mechanism. When the
      vocal cords are placed under tension by the surrounding
      musculature, air pressure from the lungs causes the vocal cords
      to vibrate. To visualize this effect, take a rubber band and
      hold it in front of your lips. If held open when you blow
      through it, the air passes through more or less freely; this
      situation corresponds to "breathing mode". If held tautly and
      close together, blowing through the opening causes the sides of
      the rubber band to vibrate. This effect works best with a wide
      rubber band. You can imagine what the airflow is like on the
      opposite side of the rubber band or the vocal cords. Your lung
      power is the simple source referred to earlier; it can be
      modeled as a constant supply of air pressure. The vocal cords
      respond to this input by vibrating, which means the output of
      this system is some periodic function.
    </para>

    <exercise xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1">
      <problem xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="probonetext">
	  Note that the vocal cord system takes a constant input and
	  produces a periodic airflow that corresponds to its output
	  signal. Is this system linear or nonlinear? Justify your
	  answer.
	</para>
      </problem>

      <solution xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="probonesoln">
	  If the glottis were linear, a constant input (a
	  zero-frequency sinusoid) should yield a constant output. The
	  periodic output indicates nonlinear behavior.
	</para>
      </solution>
    </exercise>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="VoiceAsSignal">Singers modify vocal cord tension to change the pitch to produce
      the desired musical note. Vocal cord tension is governed by a
      control input to the musculature; in system's models we
      represent control inputs as signals coming into the top or
      bottom of the system. Certainly in the case of speech and in
      many other cases as well, it is the control input that carries
      information, impressing it on the system's output. The change of
      signal structure resulting from varying the control input
      enables information to be conveyed by the signal, a process
      generically known as <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">modulation</term>. In singing,
      musicality is largely conveyed by pitch; in western speech, pitch
      is much less important.

      <!--<annotation type="teacher"> In some East Asian languages,
      pitch plays a much larger role.  </annotation>--> A sentence can
      be read in a monotone fashion without completely destroying the
      information expressed by the sentence. However, the difference
      between a statement and a question is frequently expressed by
      pitch changes. For example, note the sound differences between
      "Let's go to the park." and "Let's go to the park?";
    </para>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="PhonemePitch">
      For some consonants, the vocal cords vibrate just as in
      vowels. For example, the so-called nasal sounds "n" and "m" have
      this property. For others, the vocal cords do not produce a
      periodic output. Going back to mechanism, when consonants such
      as "f" are produced, the vocal cords are placed under much less
      tension, which results in turbulent flow. <!--<annotation
      type='teacher'>You can emulate this situation with your rubber
      band.</annotation>--> The resulting output airflow is quite
      erratic, so much so that we describe it as being
      <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">noise</term>. We define noise carefully later when we
      delve into communication problems.
    </para>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="VoiceFrequency">
      The vocal cords' periodic output can be well described by the
      periodic pulse train
      <m:math>
	<m:apply>
	  <m:times/>
	  <m:apply>
	    <m:ci type="fn">
	      <m:msub>
		<m:mi>p</m:mi>
		<m:mi>T</m:mi>
	      </m:msub>
	    </m:ci>
	    <m:ci>t</m:ci>
	  </m:apply>
	</m:apply>
      </m:math> as shown in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0042" target="pps" strength="8">the periodic pulse signal</cnxn>, with
      <m:math>
	<m:apply>
	  <m:ci>T</m:ci>
	</m:apply>
      </m:math>
      denoting the pitch period. The <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0042" target="pulsespec" strength="8">spectrum of this signal</cnxn>
      contains harmonics of the frequency
      <m:math>
	<m:apply>
	  <m:divide/>
	  <m:cn>1</m:cn>
	  <m:ci>T</m:ci>
	</m:apply>
      </m:math>, what is known as the <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">pitch frequency</term> or
      the <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">fundamental frequency</term>
      <m:math>
	<m:apply>
	  <m:ci>F0</m:ci>
	</m:apply>
      </m:math>. The primary difference between adult male and
      female/prepubescent speech is pitch. Before puberty, pitch
      frequency for normal speech ranges between 150-400 Hz for
      both males and females. After puberty, the vocal cords of males
      undergo a physical change, which has the effect of lowering
      their pitch frequency to the range 80-160 Hz. If we could
      examine the vocal cord output, we could probably discern whether
      the speaker was male or female. This difference is also readily
      apparent in the speech signal itself.
    </para>


    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="Simplifications">To simplify our speech modeling effort, we shall assume that the
      pitch period is constant. With this simplification, we collapse
      the vocal-cord-lung system as a simple source that produces the
      periodic pulse signal (<cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="vocalmodel" strength="5"/>). The sound pressure signal thus produced enters
      the mouth behind the tongue, creates acoustic disturbances, and
      exits primarily through the lips and to some extent through the
      nose. Speech specialists tend to name the mouth, tongue, teeth,
      lips, and nasal cavity the <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">vocal tract</term>. The physics
      governing the sound disturbances produced in the vocal tract and
      those of an organ pipe are quite similar. Whereas the organ pipe
      has the simple physical structure of a straight tube, the
      cross-section of the vocal tract "tube" varies along its length because
      of the positions of the tongue, teeth, and lips. It is these
      positions that are controlled by the brain to produce the vowel
      sounds.  Spreading the lips, bringing the teeth together, and
      bringing the tongue toward the front portion of the roof of the
      mouth produces the sound "ee." Rounding the lips, spreading the
      teeth, and positioning the tongue toward the back of the oral
      cavity produces the sound "oh." These variations result in a
      linear, time-invariant system that has a frequency response
      typified by several peaks, as shown in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="speech" strength="5"/>.
    </para>

    <figure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="speech">
      <name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Speech Spectrum</name> <media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="spectrum6.png"/> <caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">The ideal frequency response of
      the vocal tract as it produces the sounds "oh" and "ee" are
      shown on the top left and top right, respectively. The spectral
      peaks are known as formants, and are numbered consecutively from
      low to high frequency. The bottom plots show speech waveforms
      corresponding to these sounds. </caption></figure>
    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="Explanation">
      These peaks are known as <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">formants</term>. Thus, speech
      signal processors would say that the sound "oh" has a higher
      first formant frequency than the sound "ee," with
      <m:math>
	<m:ci>F2</m:ci>
      </m:math>
      being much higher during "ee."
      <m:math>
	<m:ci>F2</m:ci>
      </m:math>
      and
      <m:math>
	<m:ci>F3</m:ci>
      </m:math>
      (the second and third formants) have more energy in "ee" than in
      "oh." Rather than serving as a filter, rejecting high or low
      frequencies, the vocal tract serves to <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">shape</emphasis>
      the spectrum of the vocal cords.  In the time domain,
      we have a periodic signal, the pitch, serving as the input to a
      linear system. We know that the output—the speech signal
      we utter and that is heard by others and ourselves—will
      also be periodic. Example time-domain speech signals are shown
      in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="speech" strength="5"/>, where the periodicity
      is quite apparent.
    </para>


    <exercise xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1wall">
      <problem xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1awall">
	  From the waveform plots shown in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="speech" strength="5"/>, determine the pitch period and the pitch
	  frequency.
	</para>
      </problem>
      <solution xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1bwall">
	  In the bottom-left panel, the period is about 0.009 s,
	  which equals a frequency of 111 Hz.  The bottom-right
	  panel has a period of about 0.0065 s, a frequency of
	  154 Hz.
	</para>
      </solution>
    </exercise>


    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="VoicePeriodicity">Since speech signals are periodic, speech has a Fourier series
      representation given by a
      <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0044" target="output" strength="5">
	linear circuit's response to a periodic signal</cnxn>.
      Because the acoustics of the vocal tract are linear, we know
      that the spectrum of the output equals the product of the pitch
      signal's spectrum and the vocal tract's frequency response. We
      thus obtain the <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">fundamental model of speech
      production</term>.
      <equation xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="eq1">
	<m:math>
	  <m:apply>
	    <m:eq/>
	    <m:apply>
	      <m:ci type="fn">S</m:ci>
	      <m:ci>f</m:ci>
	    </m:apply>
	    <m:apply>
	      <m:times/>
	      <m:apply>
		<m:ci type="fn">
		  <m:msub>
		    <m:mi>P</m:mi>
		    <m:mi>T</m:mi>
		  </m:msub>
		</m:ci>
		<m:ci>f</m:ci>
	      </m:apply>
	      <m:apply>
		<m:ci type="fn">
		  <m:msub>
		    <m:mi>H</m:mi>
		    <m:mi>V</m:mi>
		  </m:msub>
		</m:ci>
		<m:ci>f</m:ci>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	</m:math>
      </equation>
      Here,
      <m:math>
	<m:apply>
	  <m:ci type="fn">
	    <m:msub>
	      <m:mi>H</m:mi>
	      <m:mi>V</m:mi>
	    </m:msub>
	  </m:ci>
	  <m:ci>f</m:ci>
	</m:apply>
      </m:math>
      is the transfer function of the vocal tract system. The Fourier
      series for the vocal cords' output, derived in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0042" target="expar1" strength="9">this
      equation</cnxn>, is
      <equation xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="eq2">
	<m:math>
	  <m:apply>
	    <m:eq/>
	    <m:ci>
	      <m:msub>
		<m:mi>c</m:mi>
		<m:mi>k</m:mi>
	      </m:msub>
	    </m:ci>
	    <m:apply>
	      <m:times/>
	      <m:ci>A</m:ci>
	      <m:apply>
		<m:exp/>
		<m:apply>
		  <m:minus/>
		  <m:apply>
		    <m:divide/>
		    <m:apply>
		      <m:times/>
		      <m:imaginaryi/>
		      <m:pi/>
		      <m:ci>k</m:ci>
		      <m:cn>Δ</m:cn>
		    </m:apply>
		    <m:ci>T</m:ci>
		  </m:apply>
		</m:apply>
	      </m:apply>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:sin/>
		  <m:apply>
		    <m:divide/>
		    <m:apply>
		      <m:times/>
		      <m:pi/>
		      <m:ci>k</m:ci>
		      <m:cn>Δ</m:cn>
		    </m:apply>
		    <m:ci>T</m:ci>
		  </m:apply>
		</m:apply>
		<m:apply>
		  <m:times/>
		  <m:pi/>
		  <m:ci>k</m:ci>
		</m:apply>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	</m:math>
      </equation>

      and is plotted on the top in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="voicetransfer" strength="7"/>.  If we had, for example, a male speaker with
      about a 110 Hz pitch
      (<m:math>
	<m:apply>
	  <m:approx/>
	  <m:ci>T</m:ci>
	  <m:apply>
	    <m:times/>
	    <m:cn>9.1</m:cn>
	    <m:ci>ms</m:ci>
	  </m:apply>
	</m:apply>
      </m:math>) saying the vowel "oh", the spectrum of his speech
	<emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">predicted by our model</emphasis> is shown
	 in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="voicespect" strength="8"/>.
    </para>

    <figure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="voicetransfer" orient="vertical"><name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">voice spectrum</name>
      <subfigure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="throut">
	<name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">pulse</name>
	<media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="spectrum3.png"/>
	<caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	  The vocal cords' output spectrum
	  <m:math>
	    <m:apply>
	      <m:ci type="fn">
		<m:msub>
		  <m:mi>P</m:mi>
		  <m:mi>T</m:mi>
		</m:msub>
	      </m:ci>
	      <m:ci>f</m:ci>
	    </m:apply>
	  </m:math>.
	</caption>
      </subfigure>
      <subfigure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="voicespect">
	<name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">voice spectrum</name>
	<media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="spectrum7.png"/>
	<caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	  The vocal tract's transfer function,
	  <m:math>
	    <m:apply>
	      <m:ci type="fn">
		<m:msub>
		  <m:mi>H</m:mi>
		  <m:mi>V</m:mi>
		</m:msub>
	      </m:ci>
	      <m:ci>f</m:ci>
	    </m:apply>
	  </m:math> and the speech spectrum.
	</caption>
      </subfigure>
      <caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/"> The vocal tract's transfer function,
	shown as the thin, smooth line, is superimposed on the
	spectrum of actual male speech corresponding to the sound
	"oh."  The pitch lines corresponding to harmonics of the pitch frequency are indicated.
      </caption>
    </figure>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="ModelSpectrum">
      The model spectrum idealizes the measured spectrum, and captures
      all the important features. The measured spectrum certainly
      demonstrates what are known as <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">pitch lines</term>, and we
      realize from our model that they are due to the vocal cord's
      periodic excitation of the vocal tract. The vocal tract's
      shaping of the line spectrum is clearly evident, but difficult
      to discern exactly, especially at the higher frequencies. The model transfer function for the vocal
      tract makes the formants much more readily evident.
    </para>

    <exercise xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1fast">
      <problem xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1a">
	  The Fourier series coefficients for speech are related to
	  the vocal tract's transfer function only at the frequencies
	  <m:math>
	    <m:apply>
	      <m:divide/>
	      <m:ci>k</m:ci>
	      <m:ci>T</m:ci>
	    </m:apply>
	  </m:math>,
	  <m:math>
	    <m:apply>
	      <m:in/>
	      <m:ci>k</m:ci>
	      <m:set>
		<m:cn>1</m:cn>
		<m:cn>2</m:cn>
		<m:ci>…</m:ci>
	      </m:set>
	    </m:apply>
	  </m:math>; see <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0042" target="pulsespec" strength="4">previous result</cnxn>. Would male or female
	  speech tend to have a more clearly identifiable formant
	  structure when its spectrum is computed? Consider, for
	  example, how the spectrum shown on the right in <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="voicetransfer" strength="7"/> would change if the
	  pitch were twice as high
	  (<m:math>
	    <m:apply>
	      <m:ci><m:mo>≈</m:mo></m:ci>
	      <m:cn>300</m:cn>
	    </m:apply>
	    <m:ci>Hz</m:ci>
	  </m:math>).
	</para>
      </problem>
      <solution xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">
	<para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="exer1b">
	  Because males have a lower pitch frequency, the spacing
	  between spectral lines is smaller. This closer spacing more
	  accurately reveals the formant structure. Doubling the pitch
	  frequency to 300 Hz for <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="voicetransfer" strength="4"/> would amount to removing every other spectral
	  line.
	</para>
      </solution>
    </exercise>


    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="VoicePicture">
      When we speak, pitch and the vocal tract's transfer function are
      not static; they change according to their control signals to
      produce speech. Engineers typically display how the speech
      spectrum changes over time with what is known as a
      <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0505" strength="8">spectrogram</cnxn> <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="spectrogram" strength="5"/>.
    <figure xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="spectrogram">
      <name xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">spectrogram</name> <media xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" type="image/png" src="spectrum8.png"/> <caption xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Displayed is the spectrogram of
      the author saying "Rice University." Blue indicates low
      energy portion of the spectrum, with red indicating the most
      energetic portions. Below the spectrogram is the time-domain
      speech signal, where the periodicities can be seen.
      </caption>
    </figure>
      Note how the line spectrum, which indicates how the pitch
      changes, is visible during the vowels, but not during the
      consonants (like the <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">ce</emphasis> in "Rice").
    </para>
    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="VoiceEnergy">
      The fundamental model for speech indicates how engineers use the
      physics underlying the signal generation process and exploit its
      structure to produce a systems model that suppresses the physics
      while emphasizing how the signal is "constructed."  From
      everyday life, we know that speech contains a wealth of
      information. We want to determine how to transmit and receive
      it. Efficient and effective speech transmission requires us to
      know the signal's properties and its structure (as expressed by
      the fundamental model of speech production). We see from <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" target="spectrogram" strength="5"/>, for example, that speech
      contains significant energy from zero frequency up to around
      5 kHz.
    </para>

    <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="effective">
     <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Effective</emphasis> speech transmission systems must
      be able to cope with signals having this bandwidth. It is
      interesting that one system that does <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">not</emphasis>
      support this 5 kHz bandwidth is the telephone: Telephone
      systems act like a <term xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">bandpass filter</term> passing energy
      between about 200 Hz and 3.2 kHz. The most important
      consequence of this filtering is the removal of high frequency
      energy. In our sample utterance, the "ce" sound in "Rice""
      contains most of its energy above 3.2 kHz; this filtering
      effect is why it is extremely difficult to distinguish the
      sounds "s" and "f" over the telephone.  Try this yourself: Call
      a friend and determine if they can distinguish between the words
      "six" and "fix".  If you say these words in isolation so that no
      context provides a hint about which word you are saying, your
      friend will not be able to tell them apart.  Radio does support
      this bandwidth (see more about <cnxn xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" document="m0518" strength="5">AM and FM radio systems</cnxn>).
   </para>

   <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="efficient">
      <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Efficient</emphasis> speech transmission systems
      exploit the speech signal's special structure: What makes speech
      speech?  You can conjure many signals that span the same
      frequencies as speech—car engine sounds, violin music, dog
      barks—but don't sound at all like speech. We shall learn
      later that transmission of <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">any</emphasis> 5 kHz
      bandwidth signal requires about 80 kbps (thousands of bits
      per second) to transmit digitally. <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">Speech</emphasis>
      signals can be transmitted using less than 1 kbps because
      of its special structure. To reduce the "digital bandwidth" so
      drastically means that engineers spent many years to develop
      signal processing and coding methods that could capture the
      special characteristics of speech without destroying how it
      sounds. If you used a speech transmission system to send a
      violin sound, it would arrive horribly distorted; speech
      transmitted the same way would sound fine.
  </para>

  <para xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="conclude">
      Exploiting the special structure of speech requires going beyond
      the capabilities of analog signal processing systems.  Many
      speech transmission systems work by finding the speaker's pitch
      and the formant frequencies.  Fundamentally, we need to do more
      than filtering to determine the speech signal's structure; we
      need to manipulate signals in more ways than are possible with
      analog systems.  Such flexibility is achievable (but not without
      some loss) with programmable <emphasis xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/">digital</emphasis>
      systems.
    </para>

  </content>
</document>
