Removing Music

I’m assuming the spoken word is mono but all the music stereo.

My idea was to make an envelope (control track) to silence all the (stereo) music using amplitude modulation then truncate silence, leaving speech only.

To make the envelope (control) track …

Split a copy of the podcast to dual mono, invert one track which silences the mono voice speaker.

Full-wave-rectify, low-pass-filter and amplify this envelope (control) track so it is a square wave: zero when speech, 1 when music.

Invert this envelope (control) track so it is 1 when voice, 0 when music, then using amplitude modulation use this envelope (control) track to silence the music in a copy of the podcast, then use truncate silence to remove the large areas of silence where the music was.

I didn’t say it would be easy, just that may be possible.

[ Afterthought: where the mono DJ voice overlaps with the stereo music (e.g. talking over an intro) that speech will be lost, similarly any mono bits in the music will not be silenced ]