Noise reduction design

Hi everyone,

I am interested in using the great noise reduction feature outside Audacity framework in order to use it in realtime and to allow myself more flexibility (this is why I am less interested in using that feature with audacity scripting).
Currently, I am looking into 2 options to achieve that goal:

  1. Extract the core functionality from noise_reduction.cpp and remove unnecessary dependencies.
  2. Reimplement noise reduction feature in python.

Regarding option 1, I have started reading the code but it is hard to untangle. I understand that EffectNoiseReduction::Worker::ReduceNoise actually reduce the noise and that EffectNoiseReduction::Worker::GatherStatistics is used to create the noise profile but in order for those functions to work I must first call other methods that are responsible for initializations, saving the signals into buffers and etc… Has anyone tried that before and can give me tips on how to do it?

Regarding option 2, I already have a basic code working, but it is not working good enough, probably because of implementation differences. I want to make sure I understand audacity technique perfectly (I base my understanding on the cpp code and https://wiki.audacityteam.org/wiki/How_Audacity_Noise_Reduction_Works#artifacts):

  1. Get noise profile by performing STFT (Window size = 2048, overlap/add = 512, fft size=2048, hann window with coeff=0.5), extract the mean magnitude for each frequency band(1023 bands total).
  2. Calculate : noise_threshold[band] = mean_magnitude[band] * sensitivity. The sensitivity default value is 6 then it is multiplied by log_e(10), this seems to give a very high threshold, what am I missing here?
  3. Perform STFT on the signal you wish to denoise (same parameters as in step 1).
  4. Create a mask [dimensions: bands X time channels]: For each band, inspect overlapping time channels and check if the third greatest magnitude is greater than noise_threshould. The confusing part is on which overlapping time channels should I inspect. The default value in the code is mNWindowsToExamine=5 but in case STFT parameters are correct there are 7 overlapping channels.
  5. Multiply the mask by the gain, the default value is -12dB. I guessed the conversion is gain_in_db = 10*log10(gain_amplitude).
  6. Smooth the mask in time, this is the most confusing part. - In the code it says the attack time is 20ms and release time is 100ms. Does the attack time mean how much to smooth backward in time and the release how much to smooth forward in time? If so, how is the smoothing done exactly, what is the window or the compression algorithm I should use?
  7. Smooth the mask in frequency - Smooth mask values by averaging neighboring bands geometrically. The default value of the number of bands is 3 in each direction.
  8. Apply the mask on the signal STFT.
  9. Perform iSTFT (same parameters as in 1).

Thanks in advance.

It might be useful to spend a little more time in the Philosophy/Metaphor step. What’s noise? The Audacity developers neatly side-stepped a whole bunch of programming by asking the user where the noise was (the Profile step). After that, Noise Reduction comes down to subtracting the stated noise from the show with the least damage possible. I know I make that sound so easy, but defining the problem is a Big Deal.

If you have no profile, you will have to make one and that is not going to happen in real time. There are two methods I know of. One inspects the whole show or a major part of it and builds a profile from the quietest parts. It takes a while to do that and stops working if the show is unstable. It was remarkably effective and it did its thing while I was off making coffee.

The other method is more brute force used by laptops and cellphones. Any sound present unchanged for a set time (several seconds) is noise and get rid of it. This gives you noise reduction that refuses to pass music. It’s voice-only. And it’s not real time. It needs a certain time window to sample and analyze the sound.

Which method were you planning on using?

There is another brute-force method that does work in real time. Noise Gate. Any sound below a certain volume doesn’t go through. Full Stop. There are sticky adjustments to keep it from chopping off beginnings and ends of words, but that does work. There was a commercial Noise Gate product (Keypex) that lasted until too many people produced comedy routines about what it sounded like when it was badly adjusted.

You can eliminate many of its shortcomings by look-ahead processing…which makes it not real time.

Koz

Koz, thanks a lot for your quick and detailed response.

As a first step, I wish to remove hiss noise coming from air conditioning units in order to clear a signal of speech.

I wasn’t clear enough regarding what I meant by real-time, I am interested in soft real-time, where a delay of up a few hundred ms is acceptable.
I am planning to use the first method, I will find a quiet zone at the start of the signal and use it as the noise profile. As the signal advances in time, I would refresh the noise profile based on more recent samples. Obviously, this method would only work if the noise refresh time is smaller than the time it takes the noise to change, but I believe it should work against the air conditioning hiss noise.

I will find a quiet zone at the start of the signal

So you already have a sound file. That’s a concept shift. My idea of “Real Time” is applying Noise Reduction as I talk. There isn’t a quiet zone until I stop speaking. If you have an existing sound file, then quite a few techniques open up.

So your real goal is on-the-fly Adaptive Noise Reduction in Post Production. That’s not a dreadful idea, but it can all happen while you’re making coffee. You’re designing an improvement of the first method. Chew on everything and correct where needed.

If you really want to tempt fate, make it recursive. Do a gentle correction at first and then fold back to the beginning and do it again.

We should note that ACX doesn’t like distractions and they don’t like audible processing. If you have a background that changes over the course of the reading, that might not be welcome.

There’s a standing joke that many of the correction tools work really well if you don’t need them. I accidentally discovered a technique of using very gentle Noise Reduction and very gentle Noise Gate one after the other. You get very serious background noise reduction with very little voice damage.

Koz

That’s how Audacity does “real-time preview”, though I think for this effect you will probably need more than a few hundred ms.

So the assumption is that from “track start” to “quiet zone”, the background noise matches the noise in “quiet zone”.

So you’re updating the noise profile, each time a “quiet zone” is found?

As you’re looking to remove fairly constant noise, I assume that the new noise profile will “refine” rather than “replace” the original noise profile.

You are losing me a bit here. Surely you can only refresh / update the noise profile when the effect encounters a “quiet zone”, which for a speech recording could be several seconds, or for music could be several minutes after the first “quiet zone”?

We could make the assumption that the noise is changing slowly, or not at all, in which case it is safe to assume that the noise after a “quiet zone” is similar to the noise in the quiet zone. We then apply the updated noise profile from the time that the profile has been calculated (which is a little way after the samples that are currently having noise reduction applied), and keep using that noise sample until the next profile update.

The first noise profile is problematic as we don’t know when the first “quiet zone” will be.
One possible solution to that is that we use a generic noise profile (perhaps low amplitude pink noise) until the first “quiet zone” is found. This approach would allow the effect to work with a known, constant delay. For example, the effect could look a constant half second ahead for “quiet zones”, and would use the “generic profile” until the first “quiet zone” is encountered. In practice, as most recordings start with a bit of a lead-in, the first “quiet zone” would be right at the start, so the generic noise profile can be updated straight away.

Not accurate, when I start my system it searches for the first quiet zone, this is an initialization phase which can take a few seconds, meanwhile, the filtering isn’t working and the output is the noisy sound without any change or delay. After the initialization phase I would use the obtained noise profile to filter the output, which could cause a delay of up to a few hundred ms. As time progress, I would update the noise profile based on the received signal samples (which I always save in a cyclic buffer).
To summarize:

  1. Initialization step - find a quiet zone for noise profile, no delay, no filtering, lasts a few seconds.
  2. Filter step - delay of up to a few hundred ms, filters the noise, lasts until refresh_timer expires. While filtering saves noisy data into a cyclic buffer.
  3. Refresh step - Search cyclic buffer for a new quiet zones and update noise profile.
  4. Go back to step 2.

Hi steve, thanks a lot for your detailed response.

Why? the look-ahead buffer must have enough information to perform a very short stft. Given that fft_size=2048 samples and overlap/add=512samples I need fft_size+3*overlap/add=3584samples to calculate a single time frame, given that my sampling rate is 48000 the delay is 75ms. I am not sure how large is the time smoothing (the documentation mention release time of 100ms but I am not sure what that means, see point 6 in my first post) but let’s assume it is not larger than 300ms. So my total delay would be 300ms+75ms + process_time(<100ms) =475ms.

Yes. I assume that the noise changes slowly.


Correct.

That’s the exact assumption I am making.

Regarding the first noise profile, my solution right now is not to filter at all until I find a quiet zone.

I guess it comes down to what we mean by “a few” :wink:
One or two hundred ms may be pushing it - I was thinking in the region of around half a second (500 ms), which is in the same ballpark as your estimate above.


How will you define “quiet zone”?

I agree that 0.5ms is more than a few :slight_smile: But I think my estimation is kind of a worst-case because I couldn’t figure out exactly the smoothing algorithm. I hope someone could clarify that for me so we could have a more realistic delay estimation.

With a simple energy detector (RMS), during the initialization phase, which could take more than a few seconds, I would look for a window size of ~400ms with minimal energy relative to the other windows. This window would be the first quiet zone.

RMS detection should be pretty quick. Even Nyquist can get the RMS of a 5 minute audio track almost instantly.
But if you’re doing that, why not scan the entire selection for RMS, then create the noise profile from say the lowest 1 percentile? Doing so would greatly simplify the code.

The reason the initialization phase could take some time isn’t because of the processing time of RMS, it’s because I need to have enough samples so I could extract a “good” quiet zone.

If I understand you correctly, you suggest extracting windows (lets say ~100ms) with the lowest RMS values (1%), join them together to create a longer noise signal (or a longer “quiet zone”) and then use it to create a noise profile. This sounds like a good idea. The only difficulty is how to join the different parts while avoiding a big phase jumps between them which could lead to some sort of distortion.

Anyway, before I can do any of that I need to make my code work for the simple case where the noise profile is given to me and apply noise reduction process, not in realtime. This leads me to the original problem of figuring out how noise reduction works exactly (see my first post).

We’re both in the same boat there. It’s not easy to read. When I get time I’d like to do a similar thing (automatic noise profile and real time preview) for Audacity, but as I’m relatively inexperienced in C++, I’m expecting it to be a big job.

I assume that the new noise profile will “refine” rather than “replace” the original noise profile.

If you don’t do that, you could get pumping background noise which would not be welcome. On the other hand, making the noise reduction much better at the end of the show than at the beginning might not be welcome, either. ACX likes chapter beginnings and ends to match. That’s why I suggested multi-pass.

How will you define “quiet zone”?

Which folds back to “What’s Noise?” If you find quiet sound with very similar characteristics to the performance, it could be part of the show. Figuring that out could be very desirable. That could eliminate many of the shortcomings of the more brute-force methods.

Koz

I agree that refining the noise profile is better than replacing it with a new one. What do you mean a multi-pass and how is it answer the requirement of soft realtime?

So right now I define a quiet zone with a simple energy detector (RMS) as I described a few posts back. It is true that this method could accidentally select a frame with quiet speech (with the addition of noise) instead of just noise but it almost shouldn’t happen if the noise is stationary enough and if I search for a quiet zone in a buffer that contains enough time.

How rude it would be if I try to contact one of the code contributors? :slight_smile:

That depends what you say to them :smiley:
The person to try and contact is Paul Licamelli. He wrote the current version of Noise Reduction. One way to contact him is via the Audacity developer’s mailing list, see: https://www.audacityteam.org/contact/mailing-lists/