Implementing a VAD in Audacity

mszlazak · June 6, 2009, 5:44pm

Currently, I have the fortune or misfortune of learning just enough about audio, audio file formats and voice activity detection (VAD) to try to create a VAD in Audacity.
I want to experiment with VAD’s on .wav or .mpg files and see how various algorithms perform.
The first VAD algorithm I’d like to test is described here:

http://figment.cse.usf.edu/~sfefilat/data/papers/WeBT5.3.pdf

Suggestions, pointers and help would be appreciated.

waxcylinder · June 6, 2009, 6:40pm

In Audacity 1.3 the developers have kindly added Sound Activated Recording.

Go to Edit > Preferences > Sound Acticated Recording

WC

mszlazak · June 6, 2009, 11:04pm

Sound Activated Recording looks like it activates to sound versus silence.
This is not what a VAD does. A VAD activates to human voice versus non-voice and they should work in low SNR environments.

steve · June 7, 2009, 10:09am

Link not working.

How much programming experience do you have?

mszlazak · June 7, 2009, 3:01pm

FYI: Definition of VAD.

http://en.wikipedia.org/wiki/Voice_Activity_Detection

I checked the link in my first post today and it works but here is the title of the paper if you are having problems. This papers method looks simple.

“Combining Speech Energy and Edge Information for Fast and Efficient Voice Activity Detection in Noisy Environments”

My programming experience is minimal or hobbyist and for the past decade has been restricted mostly to Javascript.
None in c/c++ but I can learn that to create a VAD module that works with Audacity.

steve · June 7, 2009, 3:35pm

Tried the link again and it’s working now - must have just been a glitch.

Without a lot of C programming experience I think that you will be struggling to make any headway into building Voice Activity Detection directly into Audacity (as a means of starting / stopping recording).

However, if you are thinking more of a non-realtime processing of recorded data for detecting areas of speech within a track, then I think you may have more success (though still very challenging). Audacity natively supports both Ladspa and Nyquist plug-ins. Without C+ experience, the Nyquist programming language is by far the easier. Nyquist is based on XLisp, and apart from the over-abundance of parentheses (you will need a text editor with parentheses matching) is a relatively easy language.

You can find information about Nyquist here: http://audacityteam.org/wiki/index.php?title=Nyquist_Basics:_The_Audacity_Nyquist_Prompt

mszlazak · June 7, 2009, 5:50pm

Yes, it would be for non-realtime processing.
I’d like to display the before and after VAD waveforms.
I’ll check out Nyquist.
Thanks for the tip Steve.

steve · June 7, 2009, 6:47pm

That’s the easy bit. You just select the track before processing and press Ctrl+D . That will create a duplicate track. Process the copy and not the original and you have before and after.

mszlazak · June 7, 2009, 11:13pm

Steve, I appreciate the pointers and help.
I have a comment on the Nyquist example given in the page. It seems to me that an audio demo could have been used instead of the text “Hello.”
Instead, I went here, http://audacityteam.org/help/nyquist, then generated a tone and tried transforming it with the following inside the Nyquist Prompt.
(mult (ramp) s)

Second, is my .ny file in the Plug-Ins folder suppose to show up in Audacity’s “Effect” drop down list?
I saved the following .ny file but “My Fade In” doesn’t show up on the drop down list.

;nyquist plug-in
;version 1
;type process
;name “My Fade In”
;action “My Fading In…”
(mult (ramp) s)

Thanks again.

steve · June 8, 2009, 11:14pm

I think that Audacity is fussy about file names.
Rather than naming the file “My Fade-in.ny” name it “myfadein.ny”

Also, you need to restart Audacity for new plug-ins to be found.

Other than that, your script should work.

Alonshow · August 4, 2016, 11:00pm

I know this is a very old thread, but I’d like to know if you had any success with your project. It is just what I am looking for, it would save me an awful lot of time.

kozikowski · August 5, 2016, 12:14am

I show no account activity since 2009.

Did you try any of those programming links or Nyquist tools. We could use a few good Nyquist programmers.

I can guarantee your popularity simply by being able to code in Nyquist.

Let us know.

Koz

Alonshow · August 5, 2016, 3:17am

Really, will I get laid and all?

Seriously, I’m not sure if what you propose is very realistic. Off the top of my head it requires me to learn the Nyquist language, its development tools and environment, the basics of the Audacity code architecture, the VAD theory, and a suitable algorithm. Only then would I be able to start programming a plugin, with all the coding, testing and management that involves. The whole thing sounds like it would take months, maybe years. It seems more sensible to look for a program that is already created. I know that such programs exist, what I don’t know is whether they are available to the public, since VAD is mainly useful for big companies.

Robert_J_H · August 5, 2016, 12:46pm

It’s not as bad as that…
Nyquist is an interpreted language, i.e. the source code is at the same time the execution code–a plain text file or the content of the Nyquist prompt.

I’m pretty sure that I could write a VAD plug-in in a couple of hours.
However, it will be an offline algorithm and not real-time.
Do you have any special requirements?

Robert

Alonshow · August 6, 2016, 1:02am

Wow, that’s so generous! I don’t want to abuse your generosity. As far as I know, only two Audacity users have ever expressed interest in this, and the other one hasn’t been active in seven years. Still, I answer your question in case you want to do it anyway:

I don’t think so. I don’t need it to be real time, I just want to process my recordings, so offline is fine. Some of the recordings have only my own voice, some have the voices of several people. Some have a lot of noise in the background, some have a silent background. I’ve used different recorders, so I have several formats, including wma, mp3, amr, and aac. But any format would do, of course, because I can always convert between formats.

Needless to say, if you decide to do it I would be happy to assist you in any way I can. In any case, thank you for your interest!

androclus · August 19, 2016, 12:02am

okay, #3 here.

i record a brilliant lecturer, and post the lectures/dialogues online for free.

but unfortunately the surroundings are less than ideal (refrigerators, chimes, birds, airplanes, garbage trucks and street sweepers, coffee pot, etc. etc. etc.)

there are obvious recording strategies i have taken (better mikes – especially dynamic – and placed closer, etc.)

but then in editing the recording in audacity, to clean up and boost the signal-noise ratio for listeners (who will often be listening in their cars, without headphones, and in other less-than-ideal listening environments such as coffee shops), i also often use effects such as dynamic compression (the 3rd party one detailed at https://theaudacitytopodcast.com/chriss-dynamic-compressor-plugin-for-audacity/), noise reduction, low-cut / high-pass filters, and even simple de/amplification.

however, as far as i can tell, these tools are all based in various ways on amplitude and frequency. i would like something that would (again, not in real-time) simply reduce to 0 amplitude (silence) all sections which did not have voice detected. THEN, once that was done, any effects/filters which i applied (like those listed above) would obviously work MUCH better, because all the intervening junk (between voice segments) would be gone. (of course, the junk is still there DURING the speech segments too, but that is a different issue, and i can deal with it).

i myself had thought of programming something in nyquist (i do love the elegance of lisp’s, and emacs can make the matching parens colored), but i am super busy with tons of other projects. but it does sound (if i could find the time) like it would be a great learning experience, and a wonderful way to learn about audio. but then again, if someone programmed a Nyquist VAD already, i wouldn’t complain.

please let me know if anyone is still working on one.

steve · August 19, 2016, 12:48pm

That’s the hard part. Your computer has no idea whether the audio data is a voice, or a TV, or car horn, or probably even a spreadsheet. All it sees is “data”.
Assuming that the data is a valid audio signal, we can analyze certain properties quite easily. Peak amplitude is one of the easiest to detect. Approximation of the frequency spectrum is more difficult but possible. Automatically detecting whether a voice is a “live recording” or a TV show is virtually impossible.

A simple approach is to use a Noise Gate. This operates on peak level, so the assumption is that if the peak level is above a specified threshold, then the voice is present.
There is a Nyquist Noise Gate available here: Missing features - Audacity Support

This Noise Gate could be modified to better identify voices by pre-filtering the audio so as to reduce frequencies that are outside of the (main) range of voices, for example, with a 300 to 3000 Hz band-pass filter.

(highpass2 (lowpass2 signal 3000) 300)  ;"signal" is the audio to be filtered.

Robert_J_H · August 19, 2016, 2:54pm

androclus:

okay, #3 here.

i record a brilliant lecturer, and post the lectures/dialogues online for free.

but unfortunately the surroundings are less than ideal (refrigerators, chimes, birds, airplanes, garbage trucks and street sweepers, coffee pot, etc. etc. etc.)

there are obvious recording strategies i have taken (better mikes – especially dynamic – and placed closer, etc.)

but then in editing the recording in audacity, to clean up and boost the signal-noise ratio for listeners (who will often be listening in their cars, without headphones, and in other less-than-ideal listening environments such as coffee shops), i also often use effects such as dynamic compression (the 3rd party one detailed at Chris's Dynamic Compressor plugin for Audacity), noise reduction, low-cut / high-pass filters, and even simple de/amplification.

however, as far as i can tell, these tools are all based in various ways on amplitude and frequency. i would like something that would (again, not in real-time) simply reduce to 0 amplitude (silence) all sections which did not have voice detected. THEN, once that was done, any effects/filters which i applied (like those listed above) would obviously work MUCH better, because all the intervening junk (between voice segments) would be gone. (of course, the junk is still there DURING the speech segments too, but that is a different issue, and i can deal with it).

i myself had thought of programming something in nyquist (i do love the elegance of lisp’s, and emacs can make the matching parens colored), but i am super busy with tons of other projects. but it does sound (if i could find the time) like it would be a great learning experience, and a wonderful way to learn about audio. but then again, if someone programmed a Nyquist VAD already, i wouldn’t complain.

please let me know if anyone is still working on one.

Sorry for not replying back (to Alonshow).
I haven’t forgotten the project but as you (androclus) say, we are all busy in one or another way.

I’ve accumulated some code snippets in order to extract some audio features, such as:

zero crossing rate
fundamental frequency
RMS/Peak/Crest
linear prediction error
Spectral features

It might be worthwhile to follow an established standard, such as GSM 729 (if I don’t err), at least as one algorithm choice.

The spectrum of possible algorithms is very wide, from simple energy/ZCR processing to something that is almost speaker recognition.
More sophisticated algorithms do often require sample data (with all segments properly labelled as voiced/unvoiced/noise/silence) and training.
With or without that, finding the proper threshold for the feature vectors is the crucial part of any VAD.
Robert

Alonshow · August 20, 2016, 11:10pm

I created another thread in the Adobe Audition forums in the hope that Audition could provide something similar to what we are looking for: https://forums.adobe.com/message/8951906. Unfortunately, it doesn’t. Still, I got some interesting replies. I’ll try to summarize them (and hope I got them right):

This is an easy operation when the signal to noise ratio is high. Both Audacity and Audition provide simple tools which will detect the signals with an amplitude above a certain level. One of those tools is called Noise Gate.
Unfortunately that solution doesn’t work when the levels of the noise are similar to the levels of the voice. In that case, the operation becomes much more complex. Still, plenty of software exists that performs this kind of operation. However, it’s not clear whether that kind of software exists in a form that we can use for the purpose described here, i. e., processing an audio file and providing an output with the sections of that audio file that contain speech.
An example of that kind of software would be CMU Sphinx: http://cmusphinx.sourceforge.net/. This is an open source toolkit for speech recognition developed by the Carnegie Mellon University. It seems to provide, among many other things, the kind of functionality we’re looking for. I have asked in their forum about the possibility to use it, but the reply I received and the information I have found so far are way beyond my very limited skills: https://sourceforge.net/p/cmusphinx/discussion/help/thread/e31404b4/.
There are several public papers that discuss how this kind of software operates, for example: http://www.ece.umd.edu/merit/archives/merit2011/merit_fair11_reports/report_Kola.pdf. These papers only provide the theory, though, not a practical implementation.

It looks like the operation of VAD depends heavily on the kind of audio been processed. I have created a sample recording which might help in identifying what I’m looking for. It’s a 14 minute long recording of a person who talks in his sleep. In this sample he talks for the first 8-9 seconds of the recording. The rest of the recording is noise, which in some parts is louder than the voice segment. The spectrogram view of the first few seconds shows a clearly recognizable pattern of human voice:
Voice spectrogram.png
On the other hand, the spectrogram of one of the loud noise segments shows a very different pattern (or lack thereof):
Noise spectrogram.png
Hope this helps.