Music Source Separation

One thing that is needed in audacity is the ability to separate audio sources from given track. There is a tool able to do this from where Audacity can be biased to build a similar tool. It´s facebook´s Demucs app.

Demucs separates the elements of an audio track in 4 different tracks (bass, drums, voices and others). It works comparing a database of isolated sounds and voices and compare with the track. Once it finds on the database a similar sound associated with bass, it puts the bass on the proper new track “bass.mp3”, when it finds something similar to voice, it do the same onto the voice.mp3 track and so on.

The problem of demucs is that it is incredibly slow because it uses python to perform the calculations of the algorithm and then separate it.

Source separation can also be very handy to correctly isolate noise audio, specially ones originated from old LPs or from tracks on old 16mm movies.

What can be improved in audacity is a similar technique, but using assembly code directly rather then things slow as python, and also separating in more tracks, such as “hum” and “noise” etc. We can then use 3 different ways to achieve the result.

  1. The “normal” model way. So, using a database to compare the different elements of the audio. (In assembly, o0f course and not python or other slow programming language)

  2. Using the image generated on the Spectrogram instead a database of models. This technique seems to me to be more accurate. For example, Audacity Spectrogram can be a bit more powerful if we make it as good and fast as in IsotopeRX. And also inserting tools like magic wand etc to se select specific parts of the audio, harmonics, etc etc

In what concerns voices, the image generated by the spectrogram is good enough to we visually distinguish what is a voice and what is a noise, hum, bass, etc etc. Human voices have a specific format in the spectrogram easily distinguishable.

IsotopeRX voice inside a audiotrack

Btw…the above iZotope spectrogram settings are:
Adjustable STFT
FFT Size = auto
Window: Hamming
Color map: Cyan to Orange
Frequency Scale: Mel
Amplitude range (low - db): -120
Amplitude range (hi db): 0
High Quality rendering: checked (enabled)
Enable reassignment: Unchecked (disabled0

Audacity voice inside a audiotrack

What can be improved is making Audacity output be as accurated (and also equal) as in IsotopeRX output (images above). Or even making Audacity UI be more similar to Izotope.

Separation from the generated images seems less complicated, since we can actually “see” the where the voice is, and using simple image detection algorithms we can split voices from the other elements as well. Using images can also be handy because we can separate it equalizing the generated image which will enhance it and make easier to identify the elements. (Specially when we are dealing with sounds too low which are represented with dar areas on the spectrum. So, if a sound is barelly hearable, equalization can enhance it to we isolate it more properly)

The technique is similar to OCR, where the pieces of image associated to a word, phrase or letter can be identified properly…

Also, once we identify the different elements we can split everything existent on a certain audio track, and putting them on separated tracks, such as voices, drums, bass, hums, noise, guitar, trumpet sound, etc etc etc

Separating the elements we can also reconstruct damaged audio. For example, say we succeed to separate a voice track. The new generated image may contains the harmonics in different volumes from a given words. For example say we identify the word “house” and fundamental wave is around 300 Hz is higher then the same harmonic on 2Khz band. To properly reconstruct we can simply identify all the harmonics on a given identified word and adjust the volume of all of them to their average, So we enhance specific harmonics on a given band, resulting on a more intelligible word. Also, once we reconstruct some of the harmonics we can also rebuild harmonics of the voice that were cutted on higher or lower frequencies frequencies, for example.

Note: The “OCR” technique can be also handy if we want to do the reverse operation with voices. So, once we identify a voice track, we can separate the words in phonems and place them onto a database. Upon this we can associate each phonem (or word etc) to a given Ascii (or unicode) char/text, so we can do a similar technique as in deep audio where we can recreate new audio voice tracks using plain text, for example. This technique is also handy when we have missing parts of audios and we need to rebuild them using a predefined database (or even creating full phonems or new voices with “signatures” created on previous voices database)

3) A hybrid technique. Another way is create a hybrid method cobinig the model database and also the above “OCR” detection technique.

Fwiw, there is a GSoC project adding machine learning source separation: