Automatically generate timings for a bouncing ball karaoke

Let’s say I have a waveform, called wavAZ, and multiple others which are smaller, i.e wavBD, wavHL, wavPT, wavWY and each of these smaller waves are somewhat similar to certain parts of wavAZ, what can I do to find out the timings where they occur along wavAZ? What are the existing tools/techniques that can help in this? Find timings tBDs tBDe, tHLs tHLe, tPTs tPTe, and tWYs tWYe.

Just trying to automatically generate timings for a bouncing ball karaoke lyrics thing.

See illustration below:

I don’t know any way to make Audacity do pattern or content recognition. It’s harder than you think. Digital systems don’t do “close.” Unless the match is exact, a comparison system will fail. That means no noise, no room sounds, no digital errors and no competing music.

The editing people want us to do matching.

“Help me automatically find the trumpet sound so I don’t have to go looking for it.”

Not so far.


Awww shuck! I thought this would be possible given the present state of voice recognition technologies.

Can we tackle this from the wave analysis approach? Look for clusters of higher amplititudes and see if its a spoken syllubus?

That’s the problem. There’s about one company that has “good” speech recognition, and a handful of other companies (including Microsoft) that are trying to catch up with them. There’s not yet really good open source speech recognition, though this is improving. Speech recognition is very complicated for computers. Perhaps in 10 years time there will be a good open source speech recognition library.

Yes we can look for clusters of higher amplitude frequencies. You can see them as bright bits in the spectrogram view of a track (Spectrogram View - Audacity Manual). The difficult part is being able to tell if a particular cluster of frequencies is someone saying “oh” or “ah” or the sound of a car horn, or a crow.

Someone on the forum made an experimental plug-in that would try to distinguish between vowel sounds and consonants. It mostly worked when used on a good quality voice recording. I’ll post a link if I can find it.

Here: Preliminary "Phoneme finder" toy