Aligning a speech recognition transcript to audio

andrewpcone · January 29, 2012, 1:07am

I need to edit many hours of high-quality, studio-recorded English conversation. I want to run the entire thing through a speech recognizer, generating a transcript (it’s fine if it has a bunch of silly errors). Then I want to align that text with the waveform viewed in audacity, so I can see which blob corresponds to which utterance.

That way, I could easily jump to a particular phrase by text search (or visual inspection), without having to listen to the whole darn thing. I could also cut out a particular utterance without having to listen to exactly when it starts and stops.

This seems like a pretty natural thing to want. Can Audacity do this? If not, which other tools should I use? I prefer open source, of course, but I’ll use whatever works.

kozikowski · January 29, 2012, 1:31am

This seems like a pretty natural thing to want.

It’s a pretty natural thing for you to want.

First I’ve heard of this. I can’t picture a way to do this that doesn’t involve video. Video is the natural sync of picture and sound. Pictures of the words, which will not let you Text Search. Audacity does not support Time Of Day and I believe will not export it’s Event Time outside the program.

Good question.

I’m picturing Keraoke where the words appear on the screen with a bouncing ball or colored words indicating what you should be singing. I think you can scan through those plate by plate.

Describe it again. You read through the text like a Word document and find a word you want to delete. Then what? You look down slightly and get the event times for that word, write them down, open Audacity and put those times in the timer windows below the editor. Press Delete and play through the show to make sure you didn’t mess something up.

Editing is more than a little touchy-feely and you can get different effects depending on how closely you edit words. A good editor (Academy Awards, etc.) can edit a show so you can’t tell one dialog sentence was shot over the course of two weeks. A bad edit can be detected by a layman in a noisy car.

Koz

kozikowski · January 29, 2012, 1:35am

Movie Scripts do something like this. Well constructed movie scripts move about a minute per page. A 90 page script…

So you play a half-hour into the show, open to page 29, 30, or 31 and pick it right up. You don’t even need those little post-it note things hanging out.

That’s as close to a real world connection I can think of.

Koz

Trebor · January 29, 2012, 3:12pm

allegedly Google (youtube) has that feature … https://www.youtube.com/watch?v=kTvHIDKLFqc (I’ve never used it)

steve · January 29, 2012, 6:28pm

You could possibly do it by using the Silence Finder (Analyze menu) to mark each word (some manual adjustment world probably be necessary due to some words running into each other, words with multiple syllables and suchlike), then export the labels as a text file (File menu), then use a spreadsheet program to change the label text in each label for words from your transcript and save the output as a text file, then importing the modified labels.txt file back into Audacity.
Probably a lot more trouble than it’s worth.