Given the difficulties mentioned in the above paragraph, it became quite evident that any voice analysis in time domain would be extremely impractical. Instead, an analysis of the frequency spectra in a voice (which remains predominately unchanged as speech is slightly varied) turned out to be a more viable option. Converting all recordings into frequency domain (by applying the Discrete Fourier Transform) greatly simplified the process of comparing two recordings. That being said, working in frequency domain also provided a new set of issues that required attention.
Due to nature of human speech, all data pertaining to frequencies above 600Hz can safely be discarded. Therefore, once a recording is converted into frequencey domain, it could then be simply regarded as a vector in 600-dimensional Euclidean space. At this point, a comparison between two vectors could easily be carried out by normalizing the vectors (giving them length 1) then computing the norm of the difference betweeen the two (of course, the difference between two vectors in R600 is performed by subtracting componentwise). Unfortunately, exactly which norm to use is not immediately clear.
After carefully comparing and contrasting the use of the Taxicab, Euclidean, and Maximum norms, it became clear that the Euclidean norm most accurately measured the closeness between different frequency spectra. Once the norm function was chosen, all that remained was to decide exactly how small the norm of the difference of two vectors had to be in order to determine that both recordings originated from the same person.
Recall that Chebyshev's Inequality states that in particular, at least 3/4 of all measurements from the same population fall within 2 standard deviations of the mean. Hence, in response to the problem posed at the end of the previous paragraph, the following solution can be formulated:
By requiring that the norm of the difference fall within 2 standard deviations of the normal average voice, we are then ensured that at least 3/4 of the time, the algorithm would recognize a voice correctly.