Scan a track for specific sounds...

I know there’s controversy on this topic, but I want to edit out some “ahs” and “ums” from a track.

Is it possible to scan the whole track for sections that are similar to a selection, then cycle through and decide to delete or keep on a case-by-case basis?

This would speed up the editing process a lot!

Any suggestions much appreciated.



Probably not. The software doesn’t speak English and it doesn’t know what words are. If you did those words very quietly, you might be able to use Analyze > Silence Finder.


Yeah, I figured audacity wouldn’t speak english :slight_smile:. But I was thinking that it might be possible to search a track for similar looking waveforms… Admittedly I’m an Audio-noob, but the “ums” this guy speaks have really similar looking waveforms.

I was hoping for some sort of feature where you could give the program 1 or 2 example waveforms you want it to search for, then it would find ones with a similar profile within a certain margin of difference, and allow you to jump from one section of track to the next to so I could manually verify that its actually an “um”. Sort of like how spell-check in MS word jumps you from one (potentially) misspelled word to the next.

Is there anything like that?


Nice idea, but spoken words are much more complex than written words.

Here we can see three words, as both waveforms and spectrograms. The words are “some”, “um” and “room”. Can you see the differences? Can you tell which is which?
If you compared these with recordings of someone else saying the same words, you would probably notice more differences due to the different voice than you can see due to the different words. The position of the bright lines in the spectrum will be different according to the pitch and timbre of the voice, and there will be significant differences due to the characteristics of the microphone, microphone placement, the room in which it has been recorded, background noise, and many other factors. Also, if I make another recording of the same words, there will be some similarities, but many differences between the two recordings.

The difficulty is that whereas with text you have 26 character which may be upper or lower case, with a voice recording you have an infinite number of frequencies that vary over time in amplitude and harmonic relationships. Some sophisticated software (such as “Dragon Naturally Speaking”) is able to characterise “phonemes” and compare them to a database of patterns and cross-reference with linguistic context, but even with that level of sophistication correct identification is still far from perfect.