Skip to content Skip to navigation

OpenStax_CNX

You are here: Home » Content » Digital Song Identification Using Frequency Analysis

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

Digital Song Identification Using Frequency Analysis

Module by: Dante Soares, John Thompson, Yilong Yao, Shahzaib Shaheen. E-mail the authors

Summary: This module is the final report of the ELEC 301 project done by the group Curtis Thompson, Dante Soares, Shahzaib Shaheen and Yilong Yao. In this project, we created a Matlab program capable of identifying a noisy segment of a song recorded from a set of songs in a database.

Introduction

Imagine sitting at a café (or “other” public venue) and you hear a song playing on the stereo. You decide that you really like it, but you don’t know the name of the song. There’s a solution for that. Software song identification has been a topic of interest for years. However, it is computationally difficult to tackle this problem using conventional algorithms. Frequency analysis provides for a fast and accurate solution to this problem, and we decided to use this analysis to come up with a fun project idea. The main purpose of our project was to be able to accurately match a noisy song segment with a song in our song library. The company Shazam was our main inspiration and we started out by studying how Shazam works.

The Fingerprint of a Song

Just like how every individual has a unique fingerprint that can be used to distinguish one person from another, our algorithm creates a digital fingerprint for each song that can be used to distinguish two songs. The song’s fingerprint consists of list of time-frequency pairs that uniquely represent all the significant peaks in the song’s spectrogram. To assure accurate matching between two fingerprints, our algorithm needs to take into account the following issues when choosing peaks for the fingerprint:

  • Uniqueness – The fingerprint of each song needs to be unique to that one song. Fingerprints of different songs need to be different enough to be easily distinguished by our scoring algorithm.
  • Sparseness – The computational time of our matched filter depends on the amount of data in each song’s fingerprint. Thus each fingerprint needs to sparse enough for fast results, but still contain enough information to provide accurate matches.
  • Noise Resistance – Song data may contain large amounts of background noise. The fingerprinting algorithm must be able to differentiate between the signal and added noise, storing only the signal information in the fingerprint.

These criteria are all met by identifying major peaks in the song’s spectrogram. The following section describes the fingerprinting algorithm in more detail.

The Fingerprint Finding Algorithm

Filtering and Resampling

After the song data is imported, the data is high-pass filtered using a 30th order filter with a cutoff frequency around 10KHz. Filtering is used because the higher frequencies in songs are more unique to each individual song. The bass and other lower frequencies, however, tends to overshadow these frequencies, thus the filter is used make fingerprint include more high frequencies points. Testing has shown that the algorithm has a much easier time distinguishing songs after they are high-pass filtered.

The signal is then resampled to 8000 samples per second in order to reduce the number of columns in the spectrogram. This will speed up later computations but still leaves enough resolution in the data for accurate results.

The Spectrogram

The spectrogram of the signal is then taken in order to view the frequencies present in each time slice. The spectrogram below is from a 10 second noisy recording.

Figure 1: The effect of the low-pass filter is clearly visible in the spectrogram. However, local maxima in the low frequencies still exist and will still show up in the fingerprint.
A spectrogram of a signal that has been high-pass filtered. Frequencies on the lower half of the spectrogram are now negligible compared to higher frequencies.

Each vertical time slice in the bin is then analyzed for prominent local maxima as described in the next section.

Finding the Local Maxima

In the first time slice, the five greatest local maxima are stored as points in the fingerprint. Then a threshold is created by convolving these five maxima with a Gaussian curve, creating a different value for the threshold at each frequency. An example threshold is shown in the figure below. The threshold is used to spread out the data stored in the fingerprint, since peaks that are close in time and frequency are stored as one point.

Figure 2: The initial threshold, formed by convolving the peaks in the first time slice with a Gaussian curve.
An amplitude versus frequency plot that shows an approximately gaussian peak around the 100th frequency bin.

For each of the remaining time slices, up to five local maxima above the threshold are added to fingerprint. If there are more than five maxima, then the five greatest in amplitude are chosen. The threshold is then updated by adding new Gaussian curves centered at the frequencies of the newly found peaks. Finally the threshold is scaled down so that it decays exponentially over time. The following figure shows how the threshold changes over time.

Figure 3: The threshold increases whenever a new peak is found, since we add a Gaussian that peak’s frequency, and decays exponentially over time.
A plot similar to the spectrogram shown above, but with all visible frequencies made wider by convolution with Gaussians.

The final list of the time and frequencies of the local maxima above the threshold are returned as the song’s fingerprint.

The Resulting Fingerprint

The following is the fingerprint of the sample signal from the examples above.

Figure 4: The fingerprint of the 10 second segment from the previous examples
A 2D stem plot of frequency vs time that shows a 1 (stem) where there is a peak and nothing otherwise.

From the graph, it is easy to see patterns and different notes in the song. Lets see how the algorithm addresses the three issues identified in the first paragraph:

  • Uniqueness – The algorithm only stores the prominent peaks in the spectrogram. Different songs have a different pattern of peaks in frequency and time, thus each song will have a unique fingerprint.
  • Sparseness – The algorithm only picks up at most five peaks per time slice. This limits the number of peaks in the resulting fingerprint. The threshold spreads out the positions of peaks so that the fingerprint is more representational of the data.
  • Noise Resistance – Unless the background noise is loud enough to create peaks greater than the peaks present in the song, then very little noise will show up in the fingerprint. Also, a ten second segment has around 6000 data points, so a matched filter will be able to detect a match between two fingerprints, even with a reasonable amount of added noise.

The next section will detail the process used to compare the fingerprint of the song segment to the fingerprints of the songs in the library.

Matched Filter for Spectrogram Peaks

In order to compare songs, we can generate match scores for them using a matched filter. We wanted a filter capable of taking the spectral peaks information generated by the fingerprint finding algorithm for two different songs and produce a single number that would tell us how much the two songs being compared look alike. We wanted this filter to be as insensitive as possible to noise and produce a score that is independent of the length of each recording.

Our approach to this was completely different from that used by the creators of Shazam, as we did not use Hash tables at all and did not combine the peaks into pairs limited by certain regions, as they did. In the end we still managed to get very good accuracy and decent performance by using a matched filter.

The Matched Filter algorithm

Preparation

Before filtering, we take the lists of spectral peaks that is the output of the landmarks generator algorithm and generate matrices that are the same size as the spectrograms, with the peaks replaced by 1’s in their respective positions and all other points replaced by 0’s. At some point during our project we had the idea of convolving this matrix with a Gaussian curve, in order to allow peaks to match somewhat if they were shifted only slightly. However, we later determined that even a very small Gaussian would worsen our noise resistance, so this idea was dropped. So basically now we have one map for each song that shows the position in time and frequency bins of all peaks. Next we normalize these matrices using their Frobenius norm. This ensures that the final score is normalized. Then we apply the matched filter which basically consists of flipping one of the matrices and convolving them, which is done by zero padding them both to the proper size and multiplying their 2D FFT’s, for speed. The result is a cross correlation matrix, but we still need to extract a single number from it to be our match score.

Extracting Information from the Cross Correlation Matrix

Through much testing, we determined that the most accurate and noise-resistant measure of the match was simply taking the global maximum of the result. Other approaches that we tried, such as taking the trace of the XTX or the sum of the global maxima for each row or column, had much more frequent mismatches. Taking just the global maximum of the whole matrix was simple and extremely effective.

When looking at test results, however, we saw that the score still had a certain dependency on the size of the segments being compared. Through more testing, we determined that this dependency looked approximately like a dependency on the square root of the ratio of the lower number of peaks by the higher number of peaks, when testing with a noiseless fragment of a larger song. This can be seen in this plot:

Figure 5: A plot showing the score of a song fragment that should perfectly match the song it was taken from, seen without correcting the square root dependency mentioned above
A stem plot of score versus number of peaks that shows a square root-like increase as the number of peaks increases.

In the plot above, the original segment has 6915 peaks and the fragment was tested with between 100 and 5000 peaks, in intervals of 100. Since smaller sample sizes usually lead to having fewer peaks, we had to get rid of this dependency. To prevent the square root growth of the scores, the final score is multiplied by the inverse of this square root, yielding a match score that is approximately independent of sample size. This can be seen in the next stem plot, made with the same segments as the first:

Figure 6: The same plot shown before, but with the square root dependency on number of peaks removed
A stem plot of score versus number of peaks that stays constant at 1.

So clearly this allows us to get better match scores with small song segments. After this process, we had a score that was approximately independent of segment size, normalized and could tell apart matches and mismatches, even with lots of noise. All that was left was to test it against different sets of data and set a threshold for distinguishing between matches and non-matches.

Setting a Threshold

The filter’s behavior proved to be very consistent. Perfect matches (trying to match a segment with itself) always got scores of 1. Matching noiseless segments to the whole song usually yielded scores in the upper .8’s or in the .9’s, with a few rare exceptions that could have been caused by a bad choice of segment, such as a segment with a long period of silence, for example. Noisy segments usually gave us low scores such as in the .1’s, but more importantly mismatches were even lower, in the .05’s to .07’s or so. This allowed us to set a threshold for determining when we have a match or not.

During our testing, we considered using a statistical approach to set the threshold. For example, if we wanted a 95% certainty that a song matched, we could require the highest match score to be greater than 1.66*[σ/sqrt(n)] + µ, where σ is the standard deviation, n is the sample size and µ is the mean. However, with our very small sample size, this threshold seemed to yield inaccurate results, so the simple threshold criterion of the highest match having to be at least 1.5 times the second highest in order to be considered a match was used.

Similarities and Differences from Shazam’s Approach

Even though we followed the ideas in the paper by Wang, we still had some significant differences from the approach used by Shazam. We followed the ideas they had for fingerprint creation, to a certain extent, however the company uses hash tables instead of matched filters to perform the comparison. While evidently faster than using a matched filter, hash tables are not covered in ELEC 301. Furthermore, when making a hash, Wang says they combine several points in an area with an anchor point and pair them up combinatorially. This allows the identification of a time offset to be used with the hash tables and makes the algorithm even faster and more robust. Perhaps investigating this would be an interesting extension of the project, if we had more time.

Project Results

The final step in the project was to test the algorithm we had created so we went ahead and conducted a series of tests that would evaluate mostly correctness but also, to some extent, performance.

Testing

First, we wanted to test to make sure that our algorithm was working properly. To do this, we attempted to match short segments of the original song (i.e. “noiseless”, actual copies of the library songs) of approximately ten seconds in length. The table below shows how these original clips matched. The titles from left to right are song segments, and titles running from top to bottom are library songs. We abbreviated them from the original, so they would fit in the matrix. The original names are “Stop this Train”, by John Mayer, “Semi-Charmed Life”, by Third Eye Blind, “I’ve got a Feeling” by Black Eyed Peas, “Love Like Rockets”, by Angels and Airwaves, “Crash Into Me”, by Dave Matthews Band and “Just Another Day in Paradise”, by Phil Vassar.

Figure 7: This matrix shows the match score results of the six noiseless recordings made from fragments of songs in the database, each of them compared to all songs in the database
A matrix that shows the test score results of matching 6 noiseless song segments against a database of 6 songs. Matches have scores really close to 1 and non-matches are closer to 0.1.

The clear matches with highest scores can be seen along the diagonal. Most of these are close to 1, and each match meets our criteria of being 1.5 times greater than the other scores (comparing horizontally.) This was a good test that we were able to use to modify our algorithm and try different techniques. Ultimately, the above results showed that our code was sufficient for our needs.

We then needed to see if our code actually worked with real world (noisy) song segments. Songs were recorded on an iPhone simultaneously with various types of noise as follows: Train- low volume talking, Life- loud recording (clipping), Crash- typing, Rockets- repeating computer error noise, Feeling- Gaussian noise (added in Matlab to wav file), and Paradise- very loud talking. There were two additional songs we used in this test to check for robustness and proper matching. One is a live version of Crash, which includes a lot of crowd noise but does not necessarily have all the identical features of the original Crash fingerprint. The other additional song, “Yellow”, by Coldplay, is a song that is not in our library at all.

Figure 8: This matrix shows the match score results of the six noisy recordings made from fragments of songs in the database, plus a live version of a song in the database and another song entirely not in the database
A matrix that shows the score results of 48 tests, matching 8 song segments against a database of 6 songs. The scores are not as good as the previous matrix, but matches always have higher scores than non-matches.

Again, the clear matches are highlighted in yellow along the diagonal. The above results show that our algorithm can still accurately match the song segments in more realistic conditions. The graph below shows more interesting results.

Figure 9: This plot is a visual representation of the results matrix seen above
A bar plot that represents the matrix above.

Conclusions

As before, the matches in the first six songs (from left to right) are obvious, and Yellow does not show any clear correlation to any library song, as desired, but the live version of Crash presents an interesting question. Do we actually want this song to match? Since we wanted our fingerprinting method to be unique to each song and song segment, we decided it would be best to have a non-match in this scenario. However, if one observes closely, it can be seen that the closest match (though it is definitely not above the 1.5 mark) is, in fact, matching to the original Crash. This emerges as a small feature of our results. This small “match” says that although we may not match any songs in the library, we can tell you that this live version most resembles the original Crash version, which may be a desirable outcome if we were to market this project.

We were amazed that the final filter could perform so well. The idea of completely ignoring amplitude information in the filter came from the paper by Avery Li-Chun Wang, one of Shazam’s developers. As he mentions, discarding amplitude information makes the algorithm more insensitive to equalization. However, this approach also makes it more noise resistant since, since what we do from there on basically consists of counting matching peaks versus non-matching peaks. Any leftover noise will count very little towards the final score, as the number of peaks per area in the spectrogram is limited by the thresholding algorithm and all peaks have the same magnitude in the filter.

About the Team

Team Members

  • Dante Soares is a Junior ECE student at Martel. He is specializing in Computer Engineering.
  • Yilong Yao is a Junior ECE student at Sid Richardson. He is specializing in Computer Engineering.
  • Curtis Thompson is a Junior ECE at Sid Richardson College. He is specializing in Signals and Systems.
  • Shahzaib Shaheen is a Junior ECE at Sid Rich. He is specializing in Photonics.

Special Thanks

We would like to thank Eva Dyer for her help with the algorithm and her feedback on the poster presentation.

Sources

Dan, Ellis. "Robust Landmark-Based Audio Fingerprinting." Lab ROSA. Columbia University, 7 June 2006. Web. 6 Dec. 2009. <http://labrosa.ee.columbia.edu/>. [link]

Dyer, Eva. Personal interview. 13 Nov. 2009.

Fiona, Harvey. "Name That Tune." Scientific American June 2003: 84-6. Print.

Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Bellingham: Society of Photo-Optical Instrumentation Engineers, 2003. Print. SPIE proceedings series.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks