So I’m trying to scramble a segment of music (without lyrics) in order to make it sound unpredictable. I have come across a research paper which has the following method for doing so (or at least this is the pre-processing bit before it’s scrambled):
“The original digitized file had its DC level set to zero, after which the envelope contour was extracted (absolute value smoothed with a 20 ms window and peak normalized to 1). A copy of the envelope was gated at 0.1 of peak threshold to identify ‘‘low-amplitude’’ time intervals, another copy was gated at 0.2 of peak amplitude to identify ‘‘high-amplitude’’ time intervals, and the rest of the time intervals were classified as ‘‘midamplitude.”
Firstly, apologies if any of these questions are naive as I’m not familiar with the terminology at all! From what I’ve read, I believe DC is 0 as long as the waveform is not ‘off-centre’ and I understand there’s an envelope tool in the tool bar though I’m not sure how exactly to go about smoothing the absolute value and normalising the peak to 1? Also, how would I go about identifying the peak threshold - I’m not sure what this even means? I’ve read that the Noise Gate or Pop Mute plug-ins may help with the actual gating bit, or the Limiter effect, but I’m unclear as to what the 0.1 corresponds to.
If anyone could offer a non-techie any enlightenment, that would be much appreciated! I have tried to read around but feel like I’m going in circles sometimes!
[In Schwarzenegger voice]: “Get to da choppa” … https://youtu.be/PE_5mJYzV8A?t=72
[64-bit plug-ins are currently not compatible with Audacity in windows, but there are 32-bit choppers/slicers out there].
I assume the “window” is a moving average (otherwise it wouldn’t be “smooth”). i.e. 20ms at 44.1kHz it would be a moving average of 882 samples. That doesn’t really leave you “audio”, just a low-frequency signal and maybe that’s used for the “gating”?
Normalizing means re-scaling so the peak equals 1.0. So of course you have to read-through the data to find the current peak first. Then find the multiplication factor to make the peak 1.0 and multiply all of the samples by that same value. “Internally” Audacity uses floating-point audio where 1.0 represents 0dB so Audacity’s Normalize or Amplify effects can do that (but you don’t normally “see” those raw floating-point numerical values).
I believe DC is 0 as long as the waveform is not ‘off-centre’
DC Offset is a little magic. You can have a profound inequality between up and down blue waves as long as everything settles to absolute zero when the performer shuts up. I can identify a news presenter just by the oddity of his waves. Neither he nor his broadcast are “broken.”
normalising the peak to 1
That’s not too difficult. Effect > Amplify > OK. Amplify’s default target is 0dB or 100%. It will keep boosting volume until something in your selection hits maximum. Note that it’s not going to surgically line up all your wave peaks. It’s not a compressor or non-linear processor. It just reaches over (digitally) and turns the volume up.
(I’m using Audacity 2.3.2)
You might be best on 2.4.2 so we’re all talking the same words.
I assume that means that it has been corrected for any “DC offset” (see: https://manual.audacityteam.org/man/dc_offset.html)
Normally that would not need doing unless the recording hardware is faulty, but may be necessary with some audio downloaded from the Internet (there’s a lot of very poor recordings on the Internet) or if recorded using a computer’s on-board sound card (which are frequently very poor quality).
This means that they calculated the amplitude over time (the overall shape of the blue waveform in an Audacity track).
“Normalizing to 1” means that they amplified it so that the highest peak is the full track height (“full scale”).
I assume that they are using a scale of 0 to 1, like the vertical scale at the left end of an Audacity track. 0 = silence, 1 = full track height. Because they are only looking at the amplitude they are only using the half of the track that is above zero.
You can get the normalized envelope from a mono track in Audacity by running this code in the Nyquist Prompt effect (see: https://manual.audacityteam.org/man/nyquist_prompt.html)
Text that follows a semicolon “;” is a comment. Comments are ignored, but I’ve provided them to explain what each line does.
The very first comment tells Audacity to treat this code as “version 4” code.
(setf interval 0.02) ;20 ms in seconds
;convert to samples
(setf step (round (* interval *sound-srate*)))
;follow the peak level in 20 ms intervals
(setf envelope (snd-avg *track* step step op-peak))
;get the absolute peak level
(setf peak (peak envelope ny:all))
(setf envelope (mult (/ peak) envelope))
;force back to the original sample rate
(force-srate *sound-srate* envelope)
Thank you all so much for your helpful (and fast!) responses.
It looks like the first step is to download the 2.4.2 version of Audacity and then I’ll look at implementing what all of you guys have said with regards to the technical aspects.
Yep, that is indeed the paper… good detective work!
What we’re trying to do is not exactly the same though as we don’t want to insert speech anywhere; we simply want to scramble the melody of the music so that it doesn’t sound predictable (we basically want it to sound like unpredictable ‘noise’). Well I first tried reversing the song but found that it was still fairly regular, so I’m trying to scramble the reversed version now. Before posting, I had a go at scrambling in a more simplified way by segmenting the track into regular intervals (350ms as stated in the paper), then exporting them as multiple files before recombining them into a 6s track in an order dictated by a random number sequence generator (so avoiding all the technical stuff to do with gating and the rest of it). Would you mind having a listen to the attached clips and let me know what you think? Unfortunately, I don’t have any prizes if you can correctly guess the song!
If you want the result to be musical, the modifications should be synchronised to the beat of the music,
350ms was chosen as a maximum because it approximately matches the length of the shortest words, (it/we/my).
If your project has nothing to do with words you can ignore the 350ms limit.
Do you need the scrambling to preserve the pitch ?, if not …