Removing background music from driver level

Something like a reverse karaoke mode, which will remove the music by running on OS background, and leave the vocal and preferably other sounds too.

I have no experience with audio programming ever. What would be the best approach to make such a software? Is it even doable? Can I modify some of audacity source code to achieve this? Or is there any other APIs available?

Is it even doable?

In general, no.

That is the holy grail, isn’t it? Start with a ratty MP3 download and remix it into your own production, splitting off all the instruments, sections and voices.

The karaoke tool works by one single clever trick which tells us lead vocals usually appear in the dead center of a stereo show. Simple subtraction usually sucks out the vocal leaving much (but not all) of the orchestration. It also leaves a mono, not stereo show. That’s important because people immediately want to go charging off and use that work for production. It doesn’t work like that.

Bass and drums, usually produced in the middle of a stereo show, go too.

Audacity 2.1.2 has Effect > Vocal Reduction and Isolation. That one tries to use positioning and volume clues to figure out which instrument or voice is which. Tools that depend entirely on content clues fall apart immediately in the face of MP3. MP3, without putting too fine a point on it, remixes a show for good efficiency. If positioning clues happen to be part of the mix, you lose.

So that’s what we can do. See if that tool does anything for you.


I’m just now paying attention. You want this thing to run in real time. Audacity doesn’t do anything in real time save Record, Play and some Timers.


Yes that’s the problem.

Is it possible to filter all the audio in the driver level with the vocal isolation tool of audacity?

In spite of our best efforts, Audacity still doesn’t run in real time. It’s a popular feature request. I, myself with these fingers have run into his problem. I need Audacity to filter XXX in real time.

Nope. Sorry.

By the way one of the serious problems when you start digging into these tools is delay. You think cellphone delays are a problem, getting unto and out of a computer can take enough time to plan a vacation or clean out the vegetable bin.


You said vocals appear on the central channel. What if only the central channel is kept and other channels are cut off on driver level? Is that possible?

You said vocals appear on the central channel.

Nice try.

A good, undistorted stereo musical show, as a general rule, has two independent signals; Left and Right. If everything happens correctly, Left and Right stay left and right from the production all the way up to your ears. This gives you the ability to predict with uncanny accuracy that the violins are going to appear on the left, French horns on the right.

As a general rule, vocals appear stereo center, which means the voice appears equally in both left and right signals. That’s it. There is no “center channel.” It’s convenient in music production to also put the drums and bass equally on left and right. Nobody can tell where bass is coming from and drums are distracting when they wander left to right.

Nobody can ever leave well enough alone, so producers like to either add big concert hall stereo effects, or actually shoot vocals in a big concert hall. This mucks up the idea that the vocals appear equally in left and right. You can get a nice deep effect by making the voice slightly different left to right.

So that’s what you have. Left and Right signals with vocal, bass and drums equally in both. Fancy productions can “leak” vocal differences between left and right.

All these variations can give you the YouTube instructions for how to remove Vocals from a song…that turn out to only work on one song. There must be thousands of those videos. There is no simple technique for vocal isolation. However you do that requires the software to “know” content, a lot more serious affair than just reversing Right and smashing Left and Right together which is how the videos do it.


There is a newbie test where you set someone the task of Vocal Isolation based on Vocal Removal working really well. This is akin to sending the new kid at camp to find the keys to the oar locks.

You can do the karaoke trick of reversing, say Right, and then adding Left and Right together to get the song without the vocal (assume you have a song that works really well—and there are some that do). Now that you have the vocal-free song, it’s natural to assume you can use those parts to suppress the music and isolate the vocal. You go ahead. We’ll wait.

Somebody in the past got through a screen-full of calculations trying to figure the spells.

The killer is you did not get the vocal removal from the stereo show. You got it from the mono mix—a different show—and there is no way to reverse the process.

You can do this the way cellphones do it. They try to recognize your voice, to “know” content. I’m amazed they work at all.


The exact reverse of “simple vocal removal” is “stereo to mono”.
The result is that the center keeps the same volume whereas left and right are reduced by a mere 6 dB.
Obviously, this approach doesn’t work.
My “Vocal Reduction and Isolation” tool reduces the sides by infinity (0.0 linearly) but sounds that are only halfway away from the center are logically still at -6 dB (0.5 linearly).
It achieves this by engaging a Fast Fourier Transform.
Thus, your driver-based effect had to do the same. The latency is determined by the window size (8192 samples in the Audacity effect).
Is it doable?
Of course it is, if you know how to program on driver level. The main problem is how to get the information from the device manufacturer…
You’ll better do the thing for an open source audio player and write an add-on for it (e.g. for Foobar 2000).

My “Vocal Reduction and Isolation” tool…

It’s not simple cancellation any more. The tool has to start knowing content such as where is the sound coming from, how far from stereo middle it is, etc. and then make intelligent decisions, build sound envelopes, based on the results.

Impressive, but it doesn’t make a very good two-minute YouTube video. We gotta work on that.


left and right are reduced by a mere 6 dB.

I was admiring the flowers and clouds and semi-thinking about this. 6dB is a piffle compared to the full range of sounds and production, but it is half. Half the voltage in the presentation signal just vanished. Could you not use that as the cue for further processing? Sense the parts of the show that dip 6dB and assume they are the ones that need further processing?

Or is that what you’re already doing and I’m late to the ball.


Indeed, I’m just doing that:
The audio is first transformed into side (L-R) and mid (L+R).
In the Short Fourier transform, the side is divided by the mid signal (magnitudes)–for all of the 4096 frequency bands.
The resulting array of values is then remapped to make a weighting for the mid channel, which is multiplied by those, still bin by bin.
The reconstructed mid channel is now indeed the center with a magnitude proportional to the distance from the center-panned position.
Subtracting the center from L and R respectively gives the stereo audio without center.