Automatic Cutting and Normalizing of a Concert Recording

Hello,

from a concert I have a recording that consists mainly of:

  1. Songs performed by our choir (mid-volume)
  2. Announcements by the conductor (low-volume)
  3. Hand clapping (applause - high volume, broad spectrum)
  4. Background noise from the audience (“silence”, very low volume)

I’d now like to
A) Identify and mark the songs with labels
B) Normalize across all songs such that the loudest sections use full range while maintaining the loudness differences between the songs
C) Adjust the loudness between the remaining different parts, that means: reduce volume of applause, boost volume of announcements, and lower volume of “silent” parts (some sort of intelligent automatic gain control …).

So far, so good. However, since I have such recordings every once in a while and they comprise several hours of raw material and 25+ songs, a little automatic help in processing would be highly appreciated.

That’s now where Nyquist comes in:
As the parts of the recording have quite different characteristics in terms of loudness and spectrum, I wondered whether it would be possible to have some scripts to
identify the parts of the recording, label them, and do the other processing as well.

The idea would e.g. be to

  • provide different samples of each category to a Nyquist script to have it extract characteristics in terms of spectrum and amplitude and determine thresholds,
  • have a script that identifies the sections according to the thresholds and provides labels for each identified section which could then still be edited manually,
  • have a script to perform the loudness adjustment (or do it manually).

As for identification, I thought that maybe running a FFT once every second could be starting point with further refinements at the borders of segments.

Being fairly new to Nyquist, I wonder though,

  • whether this sounds feasible at all? I wouldn’t bother for long run-times of the automatic part (the PC has time, I don’t …).
  • how I could get the different samples into a Nyquist script to differentiate the different categories (as far as I understood, I can only send the selected track of audio to Nyquist)?
  • how to best continue once I have labelled sections: I probably have to split the tracks somehow according to the category to be in a position to normalize across the songs only, but how do I manage this and ensure that there are smooth transitions in terms of loudness when jumping from one section (e.g. clapping) to another (e.g. a new song)?
  • whether there is any possibility to have a Nyquist script to generate or edit the envelope control points of a track in Audacity?

Any comments, code snippets, pointers, etc. are highly appreciated.
Thanks

Stefan_

Is there a goal?
Koz

Many of the steps that you identify are feasible. Developing a plug-in script to handle all of the steps automatically would be extremely complex.
I think that a good approach would be to separate this into small individual tasks and approach each task separately.

Audacity has a feature called “Chains” that allows a sequence (chain) of effects to be run one after the other. Chains support Nyquist plug-ins and built in effects. Other third party plug-ins are not currently supported by Chains. Audacity Manual


FFT in Nyquist is possible but is quite difficult and quite slow.
Finding audio above, or below a specified level (either peak level or rms level) is reasonably easy (see “Sound Finder” and “Silence Finder”).


That is correct, though there is a rudimentary “clipboard” like feature that can allow passing data from one run to the next, or from one plug-in to another. It is not really suitable for passing large amounts of audio because the data is held in ram, but it can be useful for passing lists of parameters. This feature is a variable called SCRATCH which can be set to a “value”. The “value” may be of any data type (integer, float, array, list, string …). It is possible to pass a “sound” via SCRATCH but doing so is quite quirky, tricky to get right, and can consume huge amounts of ram. To pass very short sections of audio it is easier to convert the audio to an array and store that in SCRATCH then convert the array data back into a sound.

It is also possible for Nyquist to read or write files directly to disk. WAV and plain text files are supported plus some other file types. The main limitation here is that there is no file browser for Nyquist plug-ins and it is difficult to reliably determine the current directory, so it is usually necessary to manually enter the fully qualified path and file name to the file that is being read or written. An example of writing a file to disk can be found if “Sample Data Export”: Audacity Manual


Unfortunately Nyquist has no direct access to envelope control points. Envelope control points are a feature specific to Audacity that are stored as XML data in the .AUP file. This data is not passed to Nyquist when the plug-in is called and there is no built in mechanism to pass envelope points back from Nyquist to Audacity.
It would be possible (but difficult) to read a .AUP file and extract the envelope data, or conversely to read a .AUP file, add envelope data, and write it back to the file, but this would require a lot of string manipulation which can be very cumbersome in Nyquist. (Nyquist is much better at handling sounds than it is at handling text).

Amplifying sounds can be achieved quite easily. An example of this can be found in “Adjustable Fade” Audacity Manual and a more sophisticated example in “Text Envelope” Missing features - Audacity Support


I think the way that I would approach this would first be to look at how to identify the different sections of the recording.
You have already described how the sections can be identified based on the amplitude level.
Asking a plug-in to make a judgement call about which part of a recording is applause and which part is loud music would be quite difficult, but I assume is very easy to spot by eye.
You have said that the choir music is medium volume and that applause id high volume and broad spectrum.
Because of the spectrum differences, it will probably be possible to exaggerate the difference in amplitude by looking at a specific part of the spectrum. For example, if you band pass the audio to allow frequencies between 2 kHz and 5 kHz, does that make the applause stand out more clearly?

(highpass8 (lowpass8 s 5000) 2000)

high pass filters: Nyquist Functions
Nyquist manual index page: Index

I think that this could all be done in Nyquist.
However, it is a rather ambitious project. There is almost everything involved that makes up modern DSP.
You can hardly do anything if part of the song or the announcer is buried in the applause, unless you have a wide stereo image.
You could as a first experiment try to “abuse” the auto-duc effect as quasi-compressor with very long attack and release times. To do this, you have to duplicate your track in order to have a affected and a control signal.
It is perhaps best if you supply a sample with the transition from one song to the other (via a file sharing site).

One good measure for applause activity could be auto correlation because it is mainly uncorrelated noise.
There are really infinite possibilities. It is difficult to judge which features will fit best.

Still not good to get buried in technical minutiae without a goal, as much fun as that is.

If it’s for broadcast, the transmitter compressors are going to take a lot of those technical decisions away from you. For casual listening/posting to YouTube, Chris’s Compressor is not bad for overall evening out and taming wild volume swings. That was its design goal.

True, splitting portions of the show up into individual clips for processing may be required, but nobody is saying with any certainty that process is going to be successful. Most times it’s not. See: Robert J. H.

Koz

Thanks a lot for all of your comments. I’ll try out your suggestions.

As for the goal:

There are two aspects:

  1. The postprocessed complete audio file will be added to a corresponding movie to replace the audio track as recorded by a camcorder. The audio quality of my field recorder is simply superior to that of the camcorder. However, the subjective experience would be even better with some volume adjustments being made as described before.
  2. The isolated songs would additionally be extracted and burnt on CD / Audio-DVD.
    There are no professional ambitions, yet I do appreciate it the better the audio quality is (even though it may be poor for “professional ears”).

After some further experiments in Audacity I have to admit that probably an automatic distinction between songs, voice and silence may not be feasible (the differences in spectrum and volume are often not significant enough). However, applause clearly has very high frequency components to be found nowhere else. E.g. a high pass filter with a cut-off frequency of 18kHz and 36dB edge produces an output that clearly indicates applause sections.

So far for now.
Best regards

Stefan_

If the applause is significantly louder (higher peak level) than the rest of the audio, then you may be able to use this “pop mute” effect: http://wiki.audacityteam.org/wiki/Nyquist_Effect_Plug-ins#Pop_Mute

Here’s a sample which ducs the music according to the zero crossing rate:

It is:
original intro → zero crossing rate (sine tone) → original multiplied by inverse zcr.
That’s one of those features I’ve mentioned.
One would naturally set a threshold for the zcr, e.g. all above 0.5 is applause.
The same is valid for the time domain, i.e. only long lasting high zcr regions are gated.
There are some other features that could be combined with this one:

  • auto correlation
  • linear prediction error
  • spectrum flatness, flux, centroid etc.
    Your highpass energy proposal is of course also a feature.

It is actually a “applause activity detection” coding task.