Speech segmentation (not recognition!)

Update 23 April 5 PM EDT: Here it is, with an explanation of the many controls.
Update 26 April 11 PM EDT: Fixed a performance problem. Now it should scale to selections of a few minutes in length.

I’m developing an experimental Nyquist plug-in to take recorded speech and put labels around consonants and vowels. Preliminary work shows promise. The tool will have a dialog with lots of sliders for tuning parameters. I haven’t discovered the best tunings.

The goal is only segmentation, not recognition: putting boundaries between speech sounds (“phones”), not identification of those sounds.

As a later goal I might implement CRUDE recognition distinguishing vowels, stops, sibilants, and pauses, and apply effects selectively to different segments. But first I need reliable segmentation.

Who’s curious to play with it too?

Try the default settings first on some speech, then read the explanations, then play around.

  • Action: Make labels, or draw curves that may assist you in the selection of better parameters. To draw curves, it is best to make a mute duplicate of the track and apply the effect. A “sound” graphs the data for you, which you can view in linear scale with Waveform or logarithmic with Waveform (dB).

Next six controls determine the function that is computed.

  • Discard: lets you throw away high frequencies from consideration. This has a noticeable effect on the expensive inner loop of the computation.

Percentile: Find the frequency at this fixed fraction of the summation of the power spectrum, for each FFT frame.

FFT window length, skip length, window type: Familiar to users of snd-fft. A longer window will distinguish a finer scale of frequency values but at the expense of less precise detection of changes in time.

“Smoothing window,” if at least twice the skip length, applies a convolution (to the logarithm, that is, the curve as it appears in waveform-db view) after everything else is computed and its width can vary independently of the FFT window. (Increasing the FFT window increases the resolution of the vertical scale, reducing the problem of sudden steps in the low end of log-frequency, but loses precise time resolution and adds computational expense, so let’s try a simple post-processing convolution instead.) This might remove extraneous boundaries that are detected with lower sensitivity thresholds.
Next controls choose a criterion for finding boundaries from the function.

  • Derivatives: If finding boundaries or labels, take either the rate of change, or rate of change of rate of change, of the (smoothed) logarithm.

Threshold: Expressed in octaves per second (for one derivative) or per second squared (for two), find where the absolute value makes rising crossings of this threshold. Much larger numbers are needed for useful thresholds with two derivatives.

Multiples: if drawing Boundaries, draw lines of multiple heights to show triggerings of multiples of the threshold. Best viewed as Waveform. Helps indicate how strongly marked the boundaries are. May assist in deciding on a threshold – set a low value with many multiples, see what is just sensitive enough to get the intended boundaries and avoid extraneous ones.
Minimum length of labels: controls the discarding of boundaries that come “too close” together. Label boundaries are then refined to zero crossings. (Boundaries view does not have zero crossing refinement applied.)

Vertical scale: what frequency corresponds to 1.0 in Graph view.
Segmentation.ny (17.8 KB)
Segmentation.ny (17.4 KB)

I’m interested in “beat detection” for music, so we may have some cross-over of interest.

I thought there exists a “beat finder” already in the Analyze menu. I haven’t tried it or read it.

I will update the original posting of this thread with an attachment when I’ve cleaned it up enough to share it which should be soon. I will likely update the attachment often. Older experimental versions might not be worth keeping up if fixes are minor.