method of extracting TV show music

ok so i’m trying to work out a process for extracting the background music from a TV show. the shows in question are How It’s Made and Scrapheap Challenge. if i go through a couple episodes i should be able to find 6 or so copies of one of the songs but with other sounds mixed into it because its in differant episodes.

so basicly i’ll have several audio clips where the music is exactly the same in each clip but the sound effects and voices are differant in each one (they reuse songs a lot). i imagine i could extract the music almost like how some people do it if they have stereo sound. i am curious if anyone has any ideas on how to compare all the clips of the song to keep what is similar between them (background music) but remove the parts that are not the same in each clip (voices and sound effects)

this would be cool if i could figure out a way to do it.

i imagine i could extract the music almost like how some people do it if they have stereo sound. i am curious if anyone has any ideas on how to compare all the clips of the song to keep what is similar between them (background music) but remove the parts that are not the same in each clip (voices and sound effects)

In theory, that could probably be done with [u]FFT[/u], but I doubt there are any existing applications.

The most basic vocal-removal technique uses simple sample-by-sample subtraction (no FFT). In this situation it would be difficult to get the samples exactly aligned and then you’d have an equation with more than one “solution”… You could find the difference between two or more samples but since both samples might contain unwanted sound, you would not know how much of that difference is due the “sound” that you don’t want…

As a sort-of related example, if you record yourself saying or singing the same thing twice, subtraction sounds identical to addition!!! (The sound of the difference is not the same as the difference in sound.) Your situation isn’t quite as bad, since presumably the music is digitally identical (before mixing with other sounds).

More-advanced solutions (including center-channel “isolation”) use FFT to analyze the moment-to-moment frequency content. That’s more like what you want to do.

My newest voice removal uses STFT (or FFT) to preserve the stereo character of the Audio.
https://forum.audacityteam.org/t/karaoke-rotation-panning-more/30112/1
You could - in principle - apply the effect on different versions of the song and then mix and render them (with proper set gain of course).
Or you go the other way and mix all left channels and all right channels together, re-combine them to a stereo track and do the center removal then.
There’s also the possibility to crossfade between the tracks to gather the best parts of each version.
You can first try with the normal voice removal to see how much unwanted sound effects are removed.