Filter Recorded Audio into three constant levels

On the audacity discord, I was told to ask here about a Nyquist script that could do this.

I want to filter a vocal audio recording through Audacity so that there are three constant levels. Replacing the existing audio with a continuous beep sound in place of the original recording

  1. If it’s quiet to a certain low-end threshold, it’s silenced completely.
  2. If it’s loud to a certain high-end threshold, it’s a constant loud beep
  3. Everything in between these two thresholds is a constant medium beep in between silence and loud

I can use this data to automate crude lip-syncing for a Blender 3d animation. I can use an audio file to drive keyframes. I can use silence for “close mouth”, I can use “small open mouth” for medium volume, and “big open mouth” for loud volume. And then play the unedited recording with the mouth flaps I’ve made from this. Can this be done with a Nyquist script?

I tried asking ChatGPT for help and got this result

As an AI language model, I cannot write a script in Audacity. However, I can provide you with a general outline of the steps you can take to achieve the desired result.

Open the vocal recording in Audacity and select the entire track.
Go to “Effect” and select “Normalize.” In the “Normalize” dialog box, check the “Remove DC offset” option and set the “Normalize maximum amplitude” to a desired level that is considered as “loud” in your case.
Go to “Effect” again and select “Noise Reduction.” In the “Noise Reduction” dialog box, click “Get Noise Profile” and select a portion of the track where there is no vocal or desired tone.
With the entire track still selected, go back to “Effect” and select “Nyquist Prompt.” In the “Nyquist Prompt” dialog box, enter the following code:

; Define the high and low thresholds for silence and constant tone respectively
(setq low-threshold -50) ; Adjust this value as per your preference
(setq high-threshold 0) ; Adjust this value as per your preference

; Define the tone frequencies for medium and constant tones respectively
(setq medium-tone-frequency 1000) ; Adjust this value as per your preference
(setq constant-tone-frequency 2000) ; Adjust this value as per your preference

; Define the silence and constant tone duration in seconds respectively
(setq silence-duration 0.5) ; Adjust this value as per your preference
(setq constant-tone-duration 0.5) ; Adjust this value as per your preference

; Analyze each sample and apply desired tone accordingly
(do ((i 0 (+ i 1)))
((= i (sound-length s)))
(let ((sample (aref s i)))
(cond ((<= sample low-threshold) ; Silence
(fill-array s i (min (+ i (/ (round (* silence-duration srate)) 2)) (sound-length s)) 0))
((>= sample high-threshold) ; Constant tone
(fill-array s i (min (+ i (/ (round (* constant-tone-duration srate)) 2)) (sound-length s)) (sin (* i 2 pi constant-tone-frequency/srate))))
(t ; Medium tone
(fill-array s i (min (+ i (/ (round (* silence-duration srate)) 2)) (sound-length s)) (sin (* i 2 pi medium-tone-frequency/srate)))))))

; Return the modified sound
s

>
> Click "OK" to apply the script to the selected track. The script should now modify the audio track by adding constant tones in between silence and constant beeps based on the threshold values you have set.

I followed the steps, but the AI's Nyquist script wouldn't work. I just got a prompt saying "Nyquist returned the value 0.5"

It’s possible using Steve’s envelope follower plugin,
to create a control signal that matches the envelope* of the speech.
speech envelope control track.png
Then that control track could be used to move the mouth.

[ * I would bandpass the speech 200Hz-2kHz to remove sibilance,
otherwise the mouth will open on ess & eff sounds ].

I appreciate your reply and I tried this, but it isn’t giving the result that I want.

I need the waveform to be flat constant amounts. I’m lip-syncing for a PS1 style character that shifts from one mouth shape to another by moving the texture coordinates. So, I want to be fed values from the audio that go 0.0, 0.5, 1.0. (0.0 being silence, 0.5 being medium sounds, and 1.0 being loud sounds). The problem with using my unedited voice recordings is that the waveform slides up and down (resulting in the mouth texture on the face just sliding up and down) It doesn’t snap to these values.

In regards to the ChatGPT script. I tried working it out with the AI and the AI keeps giving me a script that outputs a “Nyquist returned the value:0.500000” prompt.

Hi. That was me. I’m not on discord very often, and I find discord pretty bad for coherent discussion, whereas on this forum you get your own topic thread to discuss your topic without interruption.

The forum is also much better for displaying code. Use the “</>” button to enter “code tags” that look like this:
[code][/code]
then enter the code between the [code] and [/code] tags so that it looks like this:
[code]
Code goes here.
Line indentation works inside a code block.
[/code]
and it is displayed on the forum like this:

Code goes here.
    Line indentation works inside a code block.



OK, I think that can be done, but we need to frame the specification more precisely before we start.

If you zoom in a long way on an audio track, you will see the waveform, like this:

First Track000.png
Notice that the waveform is going up and down, passing through zero thousands of times per second.
Obviously you don’t want your high/ mid/ low tones to be switching thousands of times per second.

Zooming out a bit, we see something like this.
The picture below shows a female voice saying the words “expecting her mother to be somewhere near”:

First Track001.png
We can now see the “shape” of the audio, which is what I think you want to track.
To get the “shape”, we need to be looking at some kind of “average” level, by stepping through the waveform in small blocks.
I would guess that we would need blocks of about 1/10th of a second, so we are tracking the average level in 0.1 second steps.

A common way to measure an “average” level of a waveform is to measure the “RMS” level. This is a good way to measure the average level as it takes care of the fact that the audio has positive amplitude (above the central “0.0” line in the track) and negative amplitude (below the central line).
“RMS” stands for “Root Mean Square” - if you’re not good at maths, then it’s probably sufficient to just think of it as a special kind of “average”.

For our initial tests, we will run commands in the Nyquist Prompt.

In modern Nyquist, selected audio can be accessed from the symbol TRACK (Nyquist is a “case insensitive” programming language, so TRACK may also be written as track).
Nyquist also has a command for calculating RMS: Nyquist Functions
So let’s try that, using TRACK as the audio (a selection in a mono audio track) and 10 as the RATE (measuring the selected audio in blocks of 1/10th of a second - that is, 10 blocks per second):

(rms *track* 10)

When I apply this code (via the Nyquist Prompt) to the audio shown above, the result is this:

First Track002.png
:astonished: It’s gone!

Well actually, no. It’s still there but it is now much shorter.
The RMS function with RATE set to 10, takes blocks of 1/10th of a second, calculates the RMS (“average”) for that block, and returns one sample for that block. Each 1/10 second is now represented as a single sample.

Let’s zoom in really close:

First Track003.png
Now we can see that the waveform is following the “shape” of the audio. Perhaps not enough detail, so let’s try increasing the RATE (smaller blocks).

(rms *track* 1000)

This will measure 1000 blocks per second, with each block being 1/1000th second (1 ms):

First Track004.png
Hmm, perhaps the block size is a bit too small now - you probably don’t want it to be as wiggly as that.

Have a play with this code, and work out a suitable value for RATE.

Once we’ve decided on that, the next thing will be to decide exactly what we mean by “high” and “low”.

That’s a big problem with ChatGPT; it gives very confident answers even when it’s outputting complete rubbish. The code that it has produced “looks” like Nyquist, but it’s actually garbage.

Can this be done automatically? Analyze the sound clip and find the highest sound and lowest sound, then (perhaps) use that as the high and low and then add/subtract 20%, leaving the rest being the middle sound.

Basically “yes”, but we will still need to work out percentages. The divisions could be made adjustable (set their values with slider controls), for example:

;control highthresh "High threshold %" int "" 60 50 100
;control lowthresh "Low threshold %" int "" 40 0 50
;control gain "Level adjust dB" int "" 0 -12 12

(format nil "high: ~a   low: ~a   gain: ~a" highthresh lowthresh gain)

More info about control “widgets” here: Missing features - Audacity Support


You will also need to ensure that the voice recording maintains a fairly constant level. If you’re using “text to speech” then this is not a problem because synthesised speech will be at a constant level, but if it’s a real voice you may need to even out the level with a compressor.