And no limitations section is complete without some mention of how to surpass those limitations. The goal of any project is to refine the product to the point beyond which refinement is no longer possible. Because this is in practice impossible to accomplish, we will list a set of future "next steps" we or others who follow may be encouraged to take.
To intelligently detect the relative volume of noise in a given sample, one might best be served to create a statistical filter which recognizes random noise. This statistical filter would, in theory, identify the windows which most resemble random noise. From knowledge of which windows cause noise, one might derive the volume-level (read: power-level) associated with said noise and set the threshold at some point beyond that. The upper-bound of the threshold could be found as the lowest power value for any other non-noise (as indicated by the statistical filter) window.
The threshold detection for specific instruments is more complicated: our suggestion is to develop some method of correlation or detection as-of-yet unknown to these authors (but likely known to those who research these concepts). This method would likely match frequency domain signals rather than time domain (that is, match filtering two frequency domain representations; sort of a meta-Matched Filter in terms of FFTs) using some statistical algorithm.
The computation complexity issue is trivial to solve. One must simply code the infrastructure to analyze a given signal in several channels, each acting as our entire program now acts. To convert the samples into the frequency domain, one need only FFT each sample.
The final observed limitation, too, is within our grasp. We briefly attempted a method which is promising: Mellin transformation. Essentially, when one takes a signal and transforms it into the Mellin domain (by multiplying by an exponential), one is in the position to merely phase-shift the frequency domain representation to acheive a modulation. Thus, converting back from the Mellin domain after phase-shifting the original transformed signal changes one note into another (musical modulation). This also has (many) more applications than simply for our particular program. Image recognition over dilation comes most immediately to mind.