Removing Individual Voices


I’m frequently tasked to remove the voices of speakers who are being recorded for training purposes. They are recorded using a single PZM type microphone, which is used to record all the speakers in a room (usually 2 or 3 speakers). I know my life would be so much easier if each person had their own individual microphone - however, that isn’t going to happen any time soon. :frowning:

I then take these recordings and have to laboriously edit each one, the result frequently being hours worth of a single person’s questions or answers, dependent on the requirements. The biggest job of this type I had was 21 DAT tapes on extended play, which took me almost 3 mind-numbing weeks to complete. My bosses think I can do this stuff in minutes, and don’t really accept it when I tell them that editing even a recording lasting 60 minutes is like editing a commercial recording, and takes me at least a day, resulting (usually) with @ 10 minutes worth of usable “product”.

I’m pretty well used to removing noise with Audacity which I think does a great job. I’ve tried visually recognizing individual voices by the waveforms, but owing to speakers moving in the room, levels can fluctuate so there seems to me to be no “quick fix” - I still have to listen to it all. Does anyone know of any voice analysis / removal techniques I might be able to use to get the job done quicker? My eardrums would be eternally grateful. Thanks.


PS - Apologies for originally posting in “Recording Techniques”.

Individual lapel mics is the way to go.
Separating the voices is as you are doing it now - I don’t think there are any “quick fixes” for this (and yes, I know it is mind numbing work).
For evening out the volume level after you have separated the bits that you want you could try using a “dynamics compressor”. There is a compressor included in the effects menu of Audacity (the version in Audacity 1.3.12 is much better than the one in Audacity 1.2.6). Probably the best compressor for this job is a third party plug-in that you can get from here:
Right click and save from the “plugin source” link.

Instruction for installing plug-ins:

Try this plug-in with the default settings - if you need to increase the effect, increase the first slider (compress amount) up to about 0.8 (don’t go above 1.0).

Hi Steve,

I’d love them to have individual mics - trouble is my firm are slow to accept change, especially when it comes to buying us stuff we badly need (having said that they did kit us out with Avid, the MOJO, Pro-Tools and a desk - they must have been flush, but I still prefer Audacity for the audio and Womble Edit for the vids!). I could rant forever about this, but reckon this isn’t the place. I might PM you though when I’ve had a few drinks :wink: Another pain is the fact that most of the recordings are “historical” (had a job the other week dating back to '94!), so I’ve pretty much got to grin and bear it and work with what they dump on me.

Thanks for the advice with regard to compression. I’ve tried it a few times with (I think) pretty good results, but have probably been a bit too heavy handed with the settings. I’ll revisit an old job on Monday and try the ones you have suggested.


The universal rule on editing and post production is that, on average, leave ten times the length of the expected show. So to produce effects on a 60 minute show is effectively going to take all day – 6-8 hours.

If you know you have impossible edits, it’s much longer.

Producers have been trying to get around this number for centuries. No matter what, when they get to the end of the show and count up the minutes, guess what? 10x or worse.


Hi Koz,

Thanks for replying. I might just print this thread off to show them! :wink:

Guess what I need is some or all of the following:

1/. A time machine.
2/. Some piece of clever trickery whereby I can sweep, train to recognize, then delete the unwanted voice(s).
3/. Understanding from those who are on a lot more money than me.
4/. As a last resort, more patience, a better chair and someone who’ll at least authorize some overtime!


I wish I could give you a magic tool, but you have one of the terrible editing jobs. There’s no way to technically recognize one voice over others in a mix. Human voices have remarkably similar electronic signatures.

And similar to musical instruments, too. No shortage of postings trying to remove instruments or voices from music.

Some of the much more advanced voice conference systems work by having two (or more) microphones and being able to tell direction. Talker A is in the middle and Talker B is half-left around the table because the voices arrive at slightly different times. Those commercial products are just now coming on the market and we’re buying them for our conference rooms.

Audacity Voice Remover tool works something like that. It can tell when a voice is in the middle of a stereo show (not left or right) and manage it. But that’s it. That’s all it does.

And none of these tools will do anything for your older interviews.


Hi Koz,

I thought as much. I’d be interested in learning some more about some of those conferencing products you’re evaluating though. If you can’t post details on here for whatever reason, could you PM me with them?


Although I might not be a prolific poster, I’m grateful for this forum and the wealth of experience of the people on here that make it work. My thanks to all.

<<<If you can’t post details on here for whatever reason>>>

No, and the reason is they’re trade secrets. People get multiple hundreds to multiple thousands of dollars for machines that can do this.

We have six of these things as part of much larger videoconferenceing systems…

That one’s $1200 bucks (US) and it uses the technology for echo cancellation. It can’t single out one person and ignore everybody else – or it does, but internally.

We toyed with this technology with Radio Shack microphones and multi-channel sound mixers, but, while fascinating, wasn’t useful. For example, the farther apart the microphones are, the better, right? More separation gives you better location data. It also kills you with room echoes which are now different between the microphones. Also for simple calculations (and this stuff is anything but simple) we tried to only use voice tones that a telephone uses 300 to 3000 Hz.

Cool. Now all you have to do is figure out a way to separate people exactly sideways to both microphones. Those people’s voices show up exactly in time with each other.

Well, let’s use three microphones, right?

Do you see where this is going? You’re up to hundreds of dollars of development costs and you haven’t even got a working model yet – even with felt-tip pen on coffee-stained napkin, the traditional start of good design ideas.

I need a strong cuppa.


Interesting bit of kit, though as you say, unable (as is everything at present) to do the job. Although the conversations are recorded with each participant’s knowledge, we can’t compel them to wear a microphone. With some of the stuff they’re talking about they’d probably end up strangling themselves with the wire ! Hey ho - I’ll just crack on with what I’ve got. I have a sign that I sometimes tape to my back when I have my cans on that says “Do NOT Disturb - I am very, VERY Busy!”. Guess I’ll make another one that says “Look - You’ll Just Have To Wait!” :wink: - with or without expletives.

I don’t know about a strong cuppa. After a week’s worth of headbanging frustration I generally opt for Vodka :wink: Iechyd Da!



I haven’t had to do this yet, but I have thought the problem through. OK you have a monophonic recording of several speakers and have to separate the voices. This is a bionics problem which could probably be sorted out with software which mimicked the frequency responses of the human vocal tract, or just some electronics and mechanical tinkering. Also each persons larynx will produce different tonal qualities. These are the things to mimic, each separate speakers tonal resonances.

Now there are the several resonance chambers in the vocal system such as the throat, nasal cavity and mouth. If you can build these in various common types and make them somewhat adjustable then they would start to mimic the individual speakers resonate signatures. run a small speaker into each one and place a microphone where the mouth would be. Each of the several then would “band pass” the speakers voice which most closely matched it and attenuate the rest. A simple negative feedback of the resulting output from one set of chambers with the input should start the separation processes. More resonance chambers for a little more filtering and then subtracting the output of the other voices should get you there.

If you have doubts about the effectiveness of the various vocal tract filters just pinch your nose while you are talking and listen to the difference in sound.

From the “sound of it” I would think of developing this at home and telling no one at work. Then do the work at home by contract, just because it is easier to concentrate.


Hi Dinosaur,

Yep, it sounds simple enough on paper :wink: Thing is with the work that I do, even if I were clever enough to develop such a “voice-recognition / voice removal” gizmo, it would have to meet certain “standards” and have an extremely limited number of customers (less than 50 - and they’d only buy it once and expect it to last @ 300 years), which means I probably wouldn’t make enough money to buy a half-decent TV dinner for two!

So … it’s either the time-machine or I just get my head down and crack on with it :wink:

Thanks for everyone’s thoughts,


I will miss Walter Disney and Mel Blanc.

If your comment is not relevant to the subject of the topic being discussed (as is the case here), please don’t post it. Off-topic comments are a distraction and a waste of everyone’s time.