AI Transcription repeats text over and over

I’m using Open Vino Whisper transcription on a newish Windows 11 computer, and an Intel UHD Graphics 63 GPU using the large model.

Sometimes lines of text are repeated over and over in the output files when I save the labels as .SRT or .VTT files.

I record in stereo, and often mixing it down the two tracks to mono before transcribing helps eliminate the problem. Is this a requirement before running transcription? I see that it converts the audio to 16 bit. Should I do that too?

Hey @Wrecks0,

I record in stereo, and often mixing it down the two tracks to mono before transcribing helps eliminate the problem. Is this a requirement before running transcription? I see that it converts the audio to 16 bit. Should I do that too?

Interesting. The plugin should be doing this on your behalf (mixing stereo to mono, down sampling to 16khz, etc.). It’s weird that you are seeing different results when you manually mix & render to mono before running whisper.

The goal is that the user shouldn’t have to apply destructive operations (like downsampling) to their audio before running transcription in order to get proper results. If that is the case, then perhaps there is a bug in the plugin.

Would it be possible for you to file an issue in our GitHub project here (Issues · intel/openvino-plugins-ai-audacity · GitHub)? I’d like to take a deeper look at this one, and it helps to have an issue to track progress against.

Thanks!
Ryan

I will do that tonight when I get home.

I figured out that the problem was in the large model. Something in it was making the transcript repeat over and over again. It was also taking about a half-hour to transcribe a 20-minute audio file.

I arrogantly assumed that a larger model would return better results. The regular model gives results that are just as good, and only takes a few minutes to finish.

There are a few different variants of ‘Large’ – when you say ‘regular model’, which one are you referring to, ‘base’? There is also ‘small’ and ‘medium’ models as well, which in general give better results than base (although these will also increase processing time).

And it’s very possible that ‘base’ can give ‘just as good’ of results as the larger models – it just depends on the audio snippet / speech in question. Larger models are much better at translating other languages to english.

I probably should have posted that while I was at home and could look at the settings I ended up using, instead of trying to go by memory.