I have been working off and on at this. I am a narrator. My intention is to identify boundaries between speech segments in vocal tracks, then classify speech segments into broad classes like vowels, pauses, stops, sibilants, then apply effects selectively to segments.
A larger ambition might be a one pass combination speech cleanup tool that could also eliminate many mouth crackles without muddy results elsewhere, by selective low-pass filtering of regions. That is more important, I suppose, for unaccompanied voice than for all vocal tracks. But meanwhile I think I have accomplished a passable de-esser.
Wheel reinvention perhaps, and I have little notion how other such software proceeds, but hey it’s educational to try naively.
Examine fft data and calculate certain statistics that can identify speech sound boundaries.
Use those or other statistics to identify sounds as sibilant.
Apply certain effects selectively to sibilants.
My provisional solutions (of course I continue experimenting):
Compute spectral standard deviation, make boundaries where absolute value of second derivative of that (or of its logarithm) exceeds some threshold; refine boundaries to zero crossings. (I have also tried mean and median and other things. Any might be good enough with the right values just for de-essing but I was also trying to make it separate stops from vowels, without too many extra boundaries. Various mixed success with the different criteria.)
Identify a sibilant sound as having the average value in excess of some threshold.
Do I simply de-amplify by some fixed factor? A more sophisticated approach (not yet tried) might make the factor a function of rms and change softer sounds less.
I also do this: identify the peak frequency in the spectrum and notch it; then repeat. That fixes the occasional s sounds that come out with a painful whistle somewhere between 5 and 8 kHz which is evident when you look at the spectrogram. To my ears, this treatment does not noticeably affect the quality of other sibilant sounds so I apply it indiscriminately.
Also: perhaps a crossfading of the effect might be desirable, I’ve written a bit to do that, but my experience tells me I can get away without that provided I solve part 1 precisely enough.
Another neat trick I’m using is to return the difference between the fixes and the original rather than the fixed version. Then I duplicate a track, fix one, listen to the combination, or hear the original again solo; if I don’t like any fix, I can fix the fix by silencing or fading part of the diff track. Then I mix when all is done.
Who is curious to share code and make suggestions? Or tell me not to waste time and just use this or that package.
Unfortunately that is often the way. I wish we were able to persuade people that download plug-ins to provide feedback as it is a great help for plug-in developers.
I found your previous work very interesting, and it’s an area that I am interested in, though as I wrote, from a different perspective.
No harm in that - it can be very educational, and you may even come up with “better wheel”.
The way that most de-essers work is much simpler than that. Usually they are just dynamic compressors that operate on a fairly narrow high frequency band. It will be interesting to see if your approach works better (a “better wheel”)
One advantage of using a compressor as the basis of the processing is that the attack and release time, in effect, “fade” the effect in and out, so that precise alignment to zero crossings is unnecessary. However, I realise that you may want zero crossing detection for other aspects of your voice processing project.
What I think could be interesting would be to use your “sibilant detection” method, and then apply a more conventional “dynamic compression” to the detected sibilants. If the “S detector” works well, then potentially it could provide a de-esser that avoids the “dulling” of other sounds that can occur with more simple effects.
In order to operate on one specific frequency band, the audio is usually split into three parts by frequency. The low pass and high pass parts are left unprocessed, while the band pass part is processed through the compressor.
I started working on a de-esser some time ago - I’ll have a look and dig out the code if you’re interested.
Sorry about the delay - it was a long time ago that I was working on a de-esser and it took me a while to find the stuff.
Attached is one of the files that I found. Note that this is experimental, badly written, and probably does not work very well but it demonstrates the basic idea of compressing a high frequency band. deesser.ny (2.48 KB)
I have not yet tried it nor figured out every detail of the code, but I gather that you have options to apply the effect or to see first the graph of the control signal as one experiments with frequency settings. The complicated thing I put on the PlugIns board earlier similarly had an option to display a non-sound and a great many experimental dials. Its ultimate output was only labels and not an effect.
Surely your one page is less complicated than all of my stuff. I might discover that my calculations from fft data are not worth it but I don’t really know yet. I am trying to graph curves whose levels or slopes have some relation to what we would subjectively identify with the boundaries between speech sounds. Absolutely sharp divisions might not exist yet sibilants at least seem well marked to my eyes in spectrogram view. They are noisy sounds with dispersed spectra unlike vowels and stops. Spectral standard deviation (not in the version yet that I shared earlier) is one way to quantify that dispersedness of the timbre, independently of the amplitude. Though it is f and th that are noisiest yet they do not tend to harshness. There must be some imprecision in my boundaries as snd-fft skips, if I am to have acceptable performance, but I can still catch the brief aspirations of t and k,which is good because I think those sounds often need treatment too.
I understand your method for applying the effect is to cut the sound into three bands with lowpass, highpass, and bandpass, then make a control signal based on the amplitude of the middle band, deamplify the middle band, then put the pieces back together. Do other de-essers do that? As I said, I was simply deamplifying sibilant slices by a constant factor but also identifying any whistling frequency and notching it. I find that some harsh esses have a white stripe in the spectrogram that you can simply see and this works to eliminate that. In fact I wrote a standalone effect for just that fix on any slected region and it was useful.
I also see that your default boundary between the modified middle band and the unchanged high frequency band is 8 kHz.
My productions via the Audiobook Creation Exchange ultimately get sold through Audible.
I noticed something about downloaded titles from Audible. If I play them and capture the waveforms in Audacity, the spectrograms look neatly truncated just at 8kHZ in the “4” (highest quality) format. But do the same with the free sample excerpts available with any title, and the cutoff is 10 kHz! Is either truncation perceptible to sharper, younger ears than mine?
If I highpass my own speech at 8kHz with severe 48 dB rolloff, this discarded part is barely audible whistling to my slightly damaged ears. It sounds like the calls of waxwings. But does it make a subtle difference in the crispness of speech?
That was just one of the more readable parts of a series of experiments. With careful tuning I found that it could be very successful, especially with cases that had severe whistling sibilance. The major problem is to “tune it” correctly, but that is where your work with phonemes looks really interesting.
The ones that I’ve looked at use a similar approach, though it may be handled with FFT rather than biquad filters. The key part is the use of compression on the frequencies that need to be reduced. Using compression solves the problems of “fading” the effect in and out by using a little “lookahead” to smoothly reduce the gain of the required frequency band in time to catch the sibilance, then smoothly release the gain back to unity as the the sibilance passes.
Perhaps this veers into another topic, but anyway here is the first phrase of Audible’s demo of four file formats, as captured by Audacity’s “Stereo Mix” (which seems to add a faint amount of noise) and as displayed with the defaults of a 256 Hanning window. What appears to be a bit of signal above the cutoff frequencies might only be a mathematical artifact of those display choices.
I see no difference between the first two, “AM Radio” and “FM Radio” qualities. Did Audible make a mistake?
I suppose the free samples are all in the highest quality but the default download is 4. I do not hear the difference between those two.
Do .mp3 files work by storing spectral data and play back with inverse fft in real time? That is what these pictures suggest to me.
Curious, the technical guidelines at acx.com say this:
Format and Method
Audiobooks should be recorded in 16 bit / 44.1 kHz wav file format, which is considered CD quality and is best for archiving. Once you have fully produced your audio file it should be saved as a 192kbps mp3. This is the format that you will upload to ACX.
But what is distributed to customers in best quality has lesser sample rate and kbps than that.
I notice when I reimport .mp3 data into Audacity and look at spectrograms, I see a similar truncation of high frequencies, but the cutoff frequencies are higher and variable.
I don’t archive as .wav files when I can just keep my .aup projects.
Personally, I don’t like to archive as Audacity Projects - It’s a great format for Audacity to work in, but to me it seems rather precarious for archiving. Also, it is not an uncommon question on this forum: “How do I automate conversion of “big number” of projects into “…” format?” Answer: “sorry, but you can’t”.
Yes, I noticed that. I guess that they are using the 192kbps mp3 as their archive version. I don’t know why they don’t make that available for customers to purchase - I presume they think it is too big a download.
Because it’s not a file, rather an Audacity project is a set of files: a project manager file (the one with the .aup extension) and a folder with lots of little soundclips - and some projects can additionally have references to external files (depending on your Preferences settings). Moving or renaming any of these can sometimes cause problems for the project unless you know exactly what you are doing.
In addition to waxcylinder’s response, RIFF and FLAC files are remarkably robust. Storage media is not infallible, and data corruption is a possibility (especially on optical media - the jury is still out on ageing flash media and SSD). If there is a small amount of corruption within WAV file audio data, it will often do no more harm than cause a slight glitch in the sound. If the WAV file header is corrupted then it is often possible to salvage the data by reading as RAW. On the other hand, just a small amount of data corruption to an Audacity project is quite likely to render the project unreadable, and then you are in the nightmare situation of trying to stitch together thousands of randomly numbered data fragments (Missing features - Audacity Support).
Regarding the right frequency bands for steve’s band-stop de-essing… I repeated the trick of capturing some sound from audiobooks in Audacity, to get a hint about what more professional producers than I are doing.
There is one audiobook whose story and performance I enjoyed a lot, though curiously I noted a distinct lack of good de-essing, at least early in the first chapter where one s was really painful.
But the curious thing I noticed in the spectrogram was that all the non-sibilant sounds seemed to have the frequencies between 5 and 7 kHz very attenuated, but not so the sibilants.
Was some studio using some software doing something like what I am attempting, like applying a certain equalization only to selected phonemes?
Curious that 5 kHz is the cutoff for Audible’s “AM Radio” quality files. Could it be that only sibilants suffer noticeably from that?
Perhaps the essing was worse in the original recording and they tried to fix it by just filtering out between 5 to 7 kHz, thereby reducing the sibilance “a bit” but also substantially attenuating that frequency band where there was no sibilance?