Skip to content Skip to navigation

Connexions

You are here: Home » Content » Speak and Sing - Time Scaling with WSOLA

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice University ELEC 301 Projects

    This module is included inLens: Rice University ELEC 301 Project Lens
    By: Rice University ELEC 301As a part of collection: "ELEC 301 Projects Fall 2009"

    Click the "Rice University ELEC 301 Projects" link to see all content affiliated with them.

Also in these lenses

  • Lens for Engineering

    This module is included inLens: Lens for Engineering
    By: Sidney BurrusAs a part of collection: "ELEC 301 Projects Fall 2009"

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
Download
x

Download module as:

  • PDF
  • EPUB (what's this?)

    What is an EPUB file?

    EPUB is an electronic book format that can be read on a variety of mobile devices.

    Downloading to a reading device

    For detailed instructions on how to download this content's EPUB to your specific device, click the "(what's this?)" link.

  • More downloads ...
Reuse / Edit
x

Module:

Add to a lens
x

Add module to:

Add to Favorites
x

Add module to:

 

Speak and Sing - Time Scaling with WSOLA

Module by: Sam Soundar, Alysha Jeans, Graham Houser, Matt Szalkowski. E-mail the authors

Summary: This module describes the WSOLA algorithm as a method to modify the time scale of a speech signal.

Introduction

There are many applications for time-scale modification ranging from post production of audio video synchronization in film to voicemail playback. Time-scale modification essentially is the process of either speeding up or slowing down the apparent rate of speech without corrupting other characteristics of the signal such as pitch and voice quality. Resampling is out of the question because is directly modifies pitch and very often voice quality loss is significant. To maintain these characteristics, the short-time Fourier transform of corresponding regions of the original (input) and scaled (output) signals should be very similar. Overlap add algorithms achieve this by simply cutting out smoothly windowed chunks of the input signal, repositioning them to corresponding time indexes in the output signal, overlapping the windows to achieve continuity, and adding. WSOLA is unique among overlap add algorithms in that it maintains local Fourier similarity in a time-scaled fashion but more importantly, the excised segment is similar to the segment adjacent to the previously excised segment. This makes WSOLA a very robust time-scaling algorithm being able to time scale even in the presence of noise and even competing voices in the input speech signal.

The Algorithm

The first step is to window the input signal with a smooth window such as a hanning window. Let w(n) be the window. Then establish a time warp function τ(n) such that for n an index in the input signal τ(n) equals the time scaled index in the output signal.

The input signal should then be windowed such that each segment overlaps with half of the previous segment. Then copy the first windowed segment to the output signal. The first segment of the input should be copied to the first segment in the output without consideration for the time warp function. Call the location of the last copied segment in the input S1. Now the algorithm needs to find the next segment which it will copy, overlap and add with the current output signal.

There are quite a few ways to find this next segment. The most obvious method is to simply copy the segment at τ-1(S2) to the segment at S2 in the output. However, this would wreak havoc on the phase synchronicity of the signal. The second method is to copy phase synchronous segment such that overlapping and adding will not cause huge phase differences between the two segments. This would maintain Fourier characteristics but would sound very choppy at syllable change edges in the speech. The WSOLA algorithm looks for a segment near τ-1(S2) that is most similar to the segment at τ-1(S2) + length(w). The next signal must be near (within a threshold) of the index given by the time warp function but also similar in Fourier characteristics to the next segment in the input signal. In other words, it finds a segment near the time scaled index such that it is very similar to the next naturally occurring segment in the input signal.

There are many ways to define most similar. The easiest way is Euclidian distance between the two signals. But computing the distance between the segments is computationally very expensive O(N2). The cross-correlation between the next adjacent segment and the region of interest seems to be the next alternative. The normalized cross-correlation if computed in the time domain is equally expensive. However, in the frequency domain, the computation is rather fast O(NlogN). The peak of the normalized cross-correlation occurs at the point of highest similarity. The segment corresponding to this point is then taken as the next segment and copied over to the output signal, overlapped with the existing signal and added. This is done iteratively until the entire output signal is created.

This algorithm can take arbitrary time-warp functions and can time-scale a signal while reliably maintaining Fourier characteristics. It is also less computationally expensive than simple up sampling or down sampling for non-integer scaling factors. In fact, sampling rate modification cannot be easily done for irrational scaling factors but this algorithm will even handle that.

Implementation

This algorithm was implemented in matlab and achieved good results even with arbitrary constant time warp functions. Constants ranging from .1 to 10 were tested with satisfactory results.

Figure 1: Short sample of a speech signal
Speech Signal

Figure 2: Time scaled version of previous speech signal
Scaled Speech Signal

Figure 3: Fourier Transform of speech signal
Speech Signal FFT

Figure 4: Fourier Transform of time scaled speech signal
Scaled Speech Signal FFT

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Reuse / Edit:

Reuse or edit module (?)

Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.