I’m trying to run an interesting experiment. I am testing the use of the Google speech-to-text API to auto-transcribe audio files. The problem is, it only allows clips up to about 15 seconds long, so I have split my audio file into sections. However, when I blindly split an audio file, some words are inevitably split in half. I know it wouldn’t be perfect, but I need some way to automatically split an audio file by voice detection to avoid this issue.
I don’t care if heavy, destructive processing is needed to do this. I only need to know the timecodes, because I could automatically split the original audio by that list of timecodes. For instance, is there some way to (heavily) process the audio until all that remains is mostly voices, the rest being total silence? If so, then I can split by silence detection at whatever time interval I choose – for example, five seconds.
Again, I know such a method wouldn’t be flawless. I am simply looking for a way to split an audio file with a minimum of split words. (It seems to me that any method with some intelligence to it would be better than blindly splitting the audio.) Another important note is that the content of these recordings can be somewhat random – they can clear or noisy, and they can be loud or quiet.
I hope I’ve communicated the general idea. Any suggestions on how to approach this? Thanks!
(P.S.: I’m using Audacity 2.0.3 on Windows Vista 64-bit.)