http://forum.audacityteam.org/viewtopic ... 42&t=72160
I have been working off and on at this. I am a narrator. My intention is to identify boundaries between speech segments in vocal tracks, then classify speech segments into broad classes like vowels, pauses, stops, sibilants, then apply effects selectively to segments.
A larger ambition might be a one pass combination speech cleanup tool that could also eliminate many mouth crackles without muddy results elsewhere, by selective low-pass filtering of regions. That is more important, I suppose, for unaccompanied voice than for all vocal tracks. But meanwhile I think I have accomplished a passable de-esser.
Wheel reinvention perhaps, and I have little notion how other such software proceeds, but hey it's educational to try naively.
Problems:
- 1) Examine fft data and calculate certain statistics that can identify speech sound boundaries.
2) Use those or other statistics to identify sounds as sibilant.
3) Apply certain effects selectively to sibilants.
- 1) Compute spectral standard deviation, make boundaries where absolute value of second derivative of that (or of its logarithm) exceeds some threshold; refine boundaries to zero crossings. (I have also tried mean and median and other things. Any might be good enough with the right values just for de-essing but I was also trying to make it separate stops from vowels, without too many extra boundaries. Various mixed success with the different criteria.)
2) Identify a sibilant sound as having the average value in excess of some threshold.
3)- Do I simply de-amplify by some fixed factor? A more sophisticated approach (not yet tried) might make the factor a function of rms and change softer sounds less.
- I also do this: identify the peak frequency in the spectrum and notch it; then repeat. That fixes the occasional s sounds that come out with a painful whistle somewhere between 5 and 8 kHz which is evident when you look at the spectrogram. To my ears, this treatment does not noticeably affect the quality of other sibilant sounds so I apply it indiscriminately.
- Also: perhaps a crossfading of the effect might be desirable, I've written a bit to do that, but my experience tells me I can get away without that provided I solve part 1 precisely enough.
Who is curious to share code and make suggestions? Or tell me not to waste time and just use this or that package.