I have an audio file which has 2-3 people speaking one after the other. Basically it is the first person speaking and other 2 people translating into their own language. There will be only one person speaking at any time.
I want to separate out each language or person speaking automatically. Can I separate it?
The problem is there’s no identity. I don’t know of any good free way to detect spoken English, for one example.
You might be able to sense the pitch and timber of the voices and sort which one is speaking at any one time. You could also use a cyclical sense as error detection. If you knew that French was always the second language and you missed one, that would get you back on track. That would stop working the minute the gap between the voices changed.
That’s way beyond regular Audacity, home-baked Macro programming, and may be beyond Niquist programming.
Although Alexa, Siri, Cortana (and similar) can detect certain spoken words, that is about the extent of speech recognition in PCs in 2021. Even they don’t “understand” what is being said, and they have no idea “who” is speaking.
Audacity has no way of knowing when one person stops speaking and another starts speaking, unless there is something physical that can be measured. If, for example, there is a longer pause between people speaking than between the words of one person speaking, then you could use “Label Sounds” to add labels around each part. See: https://manual.audacityteam.org/man/label_sounds.html