I’m wondering whether you might know of a tool or method (other than my hearing) to discriminate voices in a recording?
I sometimes work with focus groups. During a round of introductions, I get anywhere from 30-120s of clear speech from each participant. These segments could be used to build a discrimination function, e.g., training.
Ideally, the discriminate function would run on the audio and create a labeled section (label track) for times the function is returning a value greater than a settable threshold (might even have a settable hysteresis value).
Clues or discussion welcomed!
Thanks very much!
David Stubbs
Usability Architects, Inc.
Portland OR US
If you are trying to automatically cut out the quiet/silent bits between the speech then “Truncate Silence” will do the job … http://www.richardcravy.com/?p=46
“Truncate Silence” can be found on Audacity 1.3. by clicking on “Effect”, “Utility”, “Timeline Changer”, “Truncate Silence”.
Select representative samples from those “clear speech” chunks,
label them (as Mary, John etc) and compute a good FFT.
Then, when analysing the whole audio, try to find the “best fits” to spectrum.
Maybe this helps…
Do you think the method you recommend could be accomplished through the Audacity API? I don’t want to crack into that stuff without some confidence that it might be accomplished there.
The original question sound like it is asking about “speaker recognition”
Accurate speaker recognition is an extremely complex task. I know of no tools in Audacity for this purpose, however you may be able to discern a rough guide by switching the track view to “Spectrum”. For higher definition, increase the FFT size to “most narrow band” in Preferences.
In this screenshot you can see some music (strings) playing up to about 9.5 seconds. From 9.5 seconds to 14 seconds the recording changes to my voice. After 14 seconds it goes back to music.
Another project that may be of interest; Speech Signal Processing Toolkit (SPTK)
Stevethefiddle, the spectrum above is fascinating!
In spectrum view, my noise background (video projector fan) is clearly going to overwhelm the voices, but in the recording, I can almost “read” the different voice signatures.
Thanks very much, also, for the pointer to the other project.
I found these links to free software via the above wikipedia link on speaker recognition…
Speech Synthesis and Recognition CSLU Toolkit for Spoken Dialogue Systems> . The Centre for Speech and Language Understanding in Oregon have produced an amazing toolkit supporting the construction of spoken dialogue systems. Components for speech recognition, speech synthesis, dialogue management and even a talking head are included. {Windows].
HTK - Hidden Markov Model Toolkit> . HTK is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. [Unix]
Hidden Markov Models etc, “sounds” good,
if you are preparing some scientific grant proposal…
If I were You, I would first try my approach.
I presuppose, your audio file with speakers, is a kind of Windows PCM WAV file.
Prepare first “fingerprint” spectra, for all participants.
Then scan your audio file, taking FFT “chunks” (they may overlap!)
and marking parts as “Mary”, “John”, “Silence”, “Unknown” etc
based on spectrum best fit.
First try “max” norm = max |Spectrum(f) - Signature(f)|, over all “reasonable” f.
Next, try L2 norm = sum of squares of differences…
Possibly, use “weighting” of freqs, using some psycho-acoustic model, like mp3…