Cross correlation is a reasonable starting point, it has it’s draw backs though.
Let’s take a simple example:
The search pattern is roughly in the first 18000 samples – nearly a second at 22.05 kHz.
We will now try to search for it with convolution.
This is nothing but correlation, if the pattern is reversed.
That’s the snippet to do it:
;; Store pattern as array
(setf pattern (snd-samples s 18000))
;; reverse array
(do* ((i 0 (1+ i)) (j (1- (length pattern)) (1- j)) temp)
((or (= j (1- i)) (= i j)))
(setf temp (aref pattern i))
(setf (aref pattern i) (aref pattern j))
(setf (aref pattern j) temp))
;; Back to a sound
(setf pattern (snd-from-array 0 *sound-srate* pattern))
;; cross correlation by convolution
(setf result (convolve s pattern))
;; peak is way to high (thousands of times)
(setf result (scale (/ 1.0 (peak result 36000)) result))
;; The mask will last for 18000 samples,
;; after a threshold of 0.3 is detected
;; Since the peak is at the end of the pattern,
;; all is shifted back by this amount
(setf mask (extract-abs (/ 18000 *sound-srate*) 3600
(snd-oneshot result 0.3 (/ 18000 *sound-srate*))))
;; mask original
(mult s mask)
Copy the code into the nyquist prompt and press ok.
(Of course, you have to import the sample file above first.)
There should now only the occurrances of “Audacity” be audible in the returned sound.
The threshold of 0.3 is somewhat arbitrary.
That’s perhaps something that had to be set by a plug-in control.
Also, the marked places do not exclusively hold “Audacity” because the word is sometimes longer and sometimes shorter.
Thus a perfect pattern matching algorithm has to work in two dimensions.
The code above is admittedly slow because I’ve not averaged the samples.
The result is getting worse, the more we do that. However, you can try to down sample, take the rms values or whatever you want.