Questions about convolve

Paul_L · April 20, 2013, 2:22pm

This is all that docs say about this function…

(convolve sound response)
Convolves two signals. The first can be any length, but the computation time per sample and the total space required are proportional to the length of response.

I presume the result has the sample rate of sound.

Does the first sample of the result hold the summation of response times for the corresponding first samples of sound? Thus each time of sound would correspond to a weighted summation of that time and FOLLOWING times, int he result.

But it would be nicer to have each time correspond to a weighted sum CENTERED on that time.

Now how can I figure out the Nyquist prompt experiments to answer my question?

Paul_L · April 20, 2013, 2:33pm

Here one simple test for starters… seems that yes, I must shift the start time of the answer. Look where the 1 is in sound and where the 0 between 1 and -1 is in the answer.

(let*(
    (response (snd-from-array 0 44100 (vector 1 0 -1)))
    (sound (snd-from-array 0 44100 (vector 0 0 0 1 0 0 0)))
    (convolution (convolve sound response))
)

(print (snd-samples convolution ny:all))
(print (snd-t0 convolution))
nil
)

Paul_L · April 20, 2013, 6:32pm

The behavior seems to be:

Result has as many samples as in signal plus response; last sample of the result is always zero. Starting time of result equals that of sound. Starting time of response can be varied with no effect at all on results.

At least so it is when signal and response have the same sample rate. I am not interesting in making them differ.

Robert_J_H · April 20, 2013, 9:48pm

It seems you’ve finally found the last sample from snd-inverse.
The length of a convolution is normally ((n+m)-1). The last sample is meaningless.
Beside this floor, the function is ordinary and does the right Kind of calculation (right order, right weighting).
The Response is limited to 100000 samples (like a table).
However, this takes Ages to calculate.

Paul_L · April 20, 2013, 10:19pm

It’s just computing a moving weighted sum of the signals of sound, right? Surely not worse than computing the fft with a window of equal length to response, which computes much more?

Robert_J_H · April 21, 2013, 3:41am

convolution is only faster for Responses under about 64 Tabs. In all other cases, FFT is prefered.
The fft must be taken of both signals and the window length equal to (n+m)-1 (zero-padded).
Convolution is equal to the multiplication of the two Frames.
For longer sequences, STFT is used with an corresponding overlap.

Paul_L · April 21, 2013, 4:27am

I mean to use a response of perhaps 1024 samples.

Robert_J_H · April 21, 2013, 7:48am

Do you use it to identify certain sounds? ([auto-]correlation) or for a FIR filter?

Paul_L · April 21, 2013, 2:21pm

Did you look at my post about the experiments in speech segmentation, where I use an effect to draw graphs?

I want to try a convolution to get a smoothing of the first or second derivative of those curves and then try putting boundaries where that triggers certain positive and negative levels and see if it works well enough. I think I need a convolution to eliminate excessive boundaries.

I could make those curves smoother by first taking fft with a wider window which will make the graph resolve the vertical scale more finely, so i get gentle slopes instead of stair-steps. But that may be too expensive in time, and this post-processing, less so.

Robert_J_H · April 21, 2013, 3:33pm

Why not a lowpassfilter?
Or the simpler snd-avg with a few sample average and 1 sample advance.
I believe I’ve posted a simple lowpass filter example in the convolution DSP Topic in this subforum.
It also contains a window function (hamming or blackman, don’t remember) that you’ve not implemented so far (along with your FFT Experiment).
1024 Tabs is rather a Long filter for smoothing, I think a tenth will work.

Paul_L · April 21, 2013, 4:28pm

I studied some math decades ago but I’m not up on the dsp jargon.

By “tabs” do you mean samples?

snd-avg sounds like the special case of convolution with a rectangular function. Perhaps I want the greater generality. It’s also true that the derivative of a convolution is convolution with the derivative of the response, so I want first to define a smooth response function that integrates to 1, then take its derivative, and convolve so I do two steps in one and save some time.

I don’t think low pass filter would be appropriate because my signal is not really a sound, but some strange thing derived from a sound by other steps. Does low-pass filter work anyway by means of fft, operations on the spectrum, then ifft? Wouldn’t that be more calculation than I need?

Robert_J_H · April 21, 2013, 9:48pm

Lowpass filters are used in all Kinds of data gathering and analysis.
Your Graphs don’t mean anything to me, I can’t see them and therefore I can’t help you in this direction.

Paul_L · April 21, 2013, 10:39pm

Well, I may need more education in the relevant math. I can describe the function I am computing from a sound:

Do fft for some choice of length, step, and window.
Find the “integral” of the power spectrum. That is, first divide DC and Nyquist coefficients by 2, then find the partial sums of the sequence of the squares.
Find some fixed percentile, such as the median, of those squares, which gives a frequency that is a multiple of the reciprocal of window length. Or take a weighted sum of certain percentiles.

Now that is a function with a sample rate equal to the sound’s sample length, divided by the fft step.

What I’m experimenting with is taking the log of that frequency valued function, then taking either the first or second derivative, then take absolute value, then trigger on that crossing some threshold. The units of the threshold value would therefore be octaves per second or octaves per second squared.

Maybe this is all bungling amateur pseudoscience but some of my experiments suggest this may do a decent job of segmenting speech. But there are extraneous large derivative values when the frequency is lower, because log frequency makes larger discrete jumps and its derivative has repeated spikes, and so I get extraneous boundaries. I want to try some kind of smoothing. Making the fft window wider increases the expense of the earlier part of the calculation. Applying a smoothing to my function, while keeping the fft window as narrow as possible, might be better.

I hardly intend something that works real time, but something that will be decent for examining one recorded phrase or sentence at a time.

Robert_J_H · April 22, 2013, 7:32am

So you want to built up a Kind of power spectrum density function.
But instead of equal distanced bins you want one that is in octaves.
I believe /3 and /7 octave is quite common for sonargrams.
The first part is clear so far, White noise should have an flat spectrum with the same energy in all frequencies.
I am not sure if I understand the rest.
I gather the spectrum is a Kind of histogram and and that’s where the Integration Comes in, i.e. to determine how many power (or the square root as energy) is contained within a certain band.
I’ve seen that your plug-in gives values for 25 50 and 75%. What do those actually mean for the graph? How many power is below that quantile or what?
I imagine you want to end up with Graphs (or at least the values) for the different formants of speech in order to Analyse them afterwards.
I’ve read an interesting article about the Interpretation of such sonargrams, where the differenciation between the Phonems is quite good explained.
But if your Output is not similar, those instructions will not be of much use, I fear.

Paul_L · April 22, 2013, 12:22pm

I think you misinterpret me. For each fft frame, I find some fixed percentile of the power spectrum. No logarithms involved there. That defines a frequency.

That frequency varies frame to frame, defining a frequency-valued function of time. I’m trying to detect boundaries of phones by some criterion applied to the rate of change of that function. I take a logarithm of that function, before differentiating once or twice, because I thought it’s a good idea to make the criterion independent of pitch level. So it’s changes of the musical step, not the Hz, that I really test.

Then I take absolute value, then trigger at crossings above a cutoff value. So the cutoff value has dimensions of log frequency per second or per second squared. And instead of “log frequency” I can call it “octaves” with appropriate choice of units.

I also made a Nyquist “effect” that replaces a sound with a waveform that graphs my frequency-valued function, so I can trick Audacity into presenting it visually to me. I can examine that function to get a better idea how to devise my criterion. Labelling phones is my goal, and graphing the function is an aid to that goal.

In one of the examples I posted, I showed a sound waveform and in parallel tracks, the results of my graphing applied to three duplicates of the sound, with three different quartiles shown for illustration. I can choose any percentile I like for one graph, it’s a parameter of my plug-in. The three curves look similar for the example but suggest that the first quartile, say, might be better than the third for finding some sounds and worse for others. Some weighted combination of the percentiles might therefore be a useful function to plot instead, but I have not added that capability.

I think the proper linguist’s term is phone, not phoneme. When a native English speaker says “tot,” the two t’s are articulated differently, one aspirated, one not. They are different “phones.” Whether different phones are one phoneme depends on language. English does not contrast words by the aspiration of voiceless stops, but Hindi does. Hindi has a writing system that distinguishes those two t’s with different letters, but English has no need for such writing distinctions. So we say aspiration is a phonemic distinction for Hindi and not for English.

Robert_J_H · April 22, 2013, 7:30pm

Ok, maybe Phonems are one step to far when dealing with the speech recording itself.
Your example “tot” has three phones but only two Phonems, namely the exact same as in german “Tod/tot”.
But the Analysis does not yet involve the handling of different languages and the transcription into graphems. So we will stick to phones for the time being.

I don’t think that I’ve misinterpreted you.
You want essentially different curves that describe the frequency Change within certain bans or boundaries, ideally for each formant and the remaining upper regions with the noisy/unvoiced phones (e.g. “sss”).
Modern analitic Systems work with different filterbanks (similar to loudness curves) to weight certain frequencies.
It is also not unusual to Combine different short time Fourier transformations of different window length (multi-rate).
There are a lot possible ways to pursuit such as cepstrum/liftering, lpc Analysis etc.

Paul_L · April 22, 2013, 7:43pm

“tot” may actually have four “phones” if you count the aspiration of the first t as like a brief h. In fact my tool so far seems very good at segmenting out such brief aspirations. When I graph the percentile-of-power-spectrum function, then the sibilants, and aspirations which are rather like them, stand up like towers.

When a stop consonant is in intervocalic context, as in connected speech, “a tot” or “ago,” it seems the closure of the tongue to palate that precedes the aspiration also splits out pretty well from the vowel before. Sometimes.

Not all sibilants are equally white-noise-like. Looking at spectrograms, I see that f and th seem to have a more evenly spread spectrum, s has more concentration in 4-5 kHz, and sh has a concentration lower than that. German ch concentrates yet lower and is harder for me to separate, but I hope to apply this mostly to my own English speech.

Robert_J_H · April 22, 2013, 8:40pm

Maybe you could modify your code such that it Returns a Sound that varies in frequency and Amplitude, in order to make the result “accessible”.
I know, that’s quite egoistic…
Maybe you can Play the original Sound in the left channel and the percentile in the right channel.
But that’s only a Suggestion for the future, when the effect is more evolved.

Paul_L · April 22, 2013, 11:17pm

The “graphs” are not much to listen to, weird crackly noise. They are ultimately only a computational middle step, and a visual aid to suggest better means to the end of better segmentation. I’d like to be able to neatly pick my vowels and consonants without zooming in and out.

Next goal after that might be crude phone recognition good enough to distinguish stops and sibilants from vowels, and automate the application of certain clean-up procedures that I now do by hand as part of a labor intensive cleanup of the crackles in my speech.

Certain phone boundaries appear to be strongly marked by this criterion and easily labelled. Not bad, I dare say, for just a few day’s work, a few pages of code, and only a bit of superficial dabbling in the relevant math.