Remove verbal content but not voice qualities?


Using Windows Audicity 2.0.5 obtained via the .exe installer.

I’m conducting a research study (as a social psychologist) in which we want participants watch and listen to people appearing in video (with audio) clips. In one condition, we want the participants to hear vocal qualities (e.g., female/male voice, pitch) but not be able to decipher what the person is saying. Essentially, we want to garble the content of speech but preserve the tempo and pitch in each video clip. In my field, this is referred to content-filtering. Other researchers who have done this end up with audio clips that sound sort of like the Charlie Brown teacher.

I’ve been trying to figure out a relatively simple way to do this manipulation via Audacity software, but lack any audio engineering experience. I tried the Wahwah effect, but this is too echo-y and you can still hear the content of what the person is saying. Does anyone have any ideas? Or what effect should be manipulated to achieve this?

I realize that we will not be able to preserve all voice qualities when we apply a filter but I’d appreciate any ideas to investigate. I spoke to a friend with audio engineering experience and he suggested that this is a fairly specialized request (as most folks are trying to remove voices altogether or improve the audio to make voices clearer).

Thanks in advance for any ideas,

You might try Effect > Vocoder. That tool combines two sounds and can give the “talking guitar” effect and other combinations. Try that with two people speaking the same words with the same inflection. The result should be completely unintelligible, but have the same vocal qualities. Also try the same voice twice with a delay (time shift tool — two sideways black arrows). I did that by accident recently and I got what you want. Somebody speaking with a sharp loss of intelligibility.

You could also split the sound into high and low frequencies and then recombine them off time.


That’s rather hard to do, at least if it should still sound like the original.
Here’s a simple example, based on partial reversion of the sound:

The window size is 16784 samples with 50 % overlap.
Here’s the code:

(defun cosine (freq)
  (abs-env (hzosc (snd-pwl 0 *sr* (list 0 freq (1- *fr*)freq *fr*)) *table* -90)))
(setf snd (send class :new '(sound)))
(send snd :answer :isnew '(snd) '((setq sound snd)))
(send snd :answer :next '() '(
  (let ((temp (snd-fft sound *fr* *hp* *win*)))
        (if temp 
            (snd-samples (mult *phase*  (snd-from-array 0 *sr* temp)) ny:all)))))
(defun scramble (sig)
  (snd-ifft 0 *sr* (send snd :new sig) *hp* *win*))
;;; Globals
(psetq *sr* *sound-srate* *fr* 16384 *hp* 8192)
(setf *phase* (cosine (/ *sr* 2.0))) 
(setf *win* (s-sqrt (sum 0.5 (mult 0.5 (cosine (/ *sr* *fr*))))))
(multichan-expand #'scramble s)

Copy the code to the Nyquist Prompt (Effect menu).
You can change the frame or hop size (fr, hp respectively).
I’ve not included a proper length correction (due to the fft size). If this is necessary, depends on the manner you merge the audio and video content eventually.

Partial Reversion?

Sorry, that’s maybe a bad wording…
The sound is frame-wise reversed and re-mixed.
There are about 5 frames per second.

Try it on this clip.

This clip was designed for left-right stereo testing, but you can use the first two segments and mix them down to plain mono for application of the effect. Or you can just use the third segment which is already two-channel mono. Beware the fourth segment which is intentionally damaged.


Koz, do you mean my code-example?
It gives this:

That should work. It went from a clear voice to a clear voice you can’t understand. I don’t have anything like the expression of a pro announcer, so if one of those is available, the effect should be even more pronounced.

If the frame size in that code is changed from 16384 to 1024 and hop size changed from 8192 to 256 it produces a laryngitis effect :slight_smile: : …

Would it be possible to have the hop size progressively increase or decrease throughout the selected audio,
somewhat similar to “sliding timescale tempo shift” ,
[ if a lot of time and effort is required to do this don’t bother , it’s just an idea for a novelty effect ] .

Do you want to simulate a slow motion kick into the groin?
Should theoretically be possible.
The inverse FFT is a bit stubborn though. It does not allow a sliding hop size.
Perhaps with some pre-/postprocessing, e.g. non-linear resampling.
You could start with a sliding hop size in the fft call (i.e. time to frequency domain).
For this purpose, we introduce a different hop variable for the FFT
For instance:

(setf *s-hp*(quantize (abs-env (control-srate-abs 1 
   (pwlv 20  1000  2000 1000))) 1))
(snd-display *s-hp*)

The last line is only to display the first values of this “sliding hop variable”
s-hp has 1000 values ascending from 20 to 2000.
The new “Next” procedure would be:

(send snd :answer :next '() '(
  (let ((temp (snd-fft sound *fr* (truncate (snd-fetch *s-hp*)) *win*)))
        (if temp 
            (snd-samples (mult *phase*  (snd-from-array 0 *sr* temp)) ny:all)))))

The value 1000 is arbitrary, it should be calculated properly because it represents the number of frames that are taken from the input sound.
It is clear that it is easy to calculate the number of frames for a constant hop size, just number of samples divided by the constant. You can try to solve this problem for our sliding hop case.
It is probably somewhat the average + quantization error (since we need integer values).
It is a bit early for calculus… :smiley:
And zero padding and phase concerns are not treated yet, sigh.

the tempo shouldn’t slow-down , more like progressively distorted : progressively comb-like …

Hi Trebor,
“Laryngitis” as effect name doesn’t seem to be much of an ear catcher, at least not for us foreigners.
How about “Hangover Effect”,it is just how I speech after a week of carnival. :smiley:
As I thought, there are many problems for a variable implementation.
The first step, that I’ve described above works so far, however, the tempo changes accordingly (as expected).
Currently, the sliding time/pitch shift effect has to be used to correct this.
Here’s a comparison between normal pitch-shift, laryngitis and laryngitis down-shift (has a crying mood to it)

So, our aim is for the second effect, isn’t it?
I actually didn’t want to make my own time scaling algorithm…

Have you tried a simple cross fade? This does not sound too bad actually.
We could of course also process the sound with different hop-sizes and then blend from one to the other, similar to what you did manually, but with some overlap.

The laryngitis effect is only true of a window size 1024 and hop size around 256 on a sample-rate of 44.1KHz.

The sliding hop-size effect I had in mind would produce a range of effects over the selected audio , not just the laryngitis one.

If the hop size could be progressively varied, one could rapidly explore all the possible effects.

When the hop size is close to the window size the result is tremolo, almost chopped pulses.
When the hop size gets down below about a quarter of the window size the result is very like a comb filter , or flanger if the comb it varies during the selected audio. With the larger window sizes there is an obvious echo effect.

So there’s a lot of effects potential just by changing two variables : the window and the hop sizes.
[ OK three variables if we include the sample rate ].

Initially I though it would be a simple matter of introducing a for-next loop somewhere to progressively change the hop-size, but seamlessly joining all the little bits is going to be difficult.

The concept was just a notion . Like I said if it requires a lot of time and effort produce a sliding hop-size version don’t bother , I’ll just experiment with the existing code by inserting different values into it .

You could try this “Reverso” plug-in effect: Reverso