Understanding yin

Paul_L · April 19, 2013, 3:08pm

The next thing I’m trying out is yin. The following seems to work to estimate the fundamental frequency of something a few cycles in length. But I am not sure of the best inputs for yin. I gave it the bottom and top notes of the piano, but what’s best for the last argument? Is the last argument a number of samples or do I misunderstand?

How few cycles can I select and still get good answers?

I want to develop it into a tool that will label a whole number of cycles of a spoken vowel. Ever looked at the waveform of “ah?” You can have a dozen zero crossings per cycle! If I select a piece of a vowel and hit Z, then, it doesn’t always give me a whole number of cycles. Hit shift-space, and you don’t always hear a smooth sound or hear the same note as was spoken.

This is why I’ve wished for Nyquist plug-ins that could make Audacity simply adjust the selection boundaries for you. Making labels that you have to delete later is not quite so convenient.

;nyquist plug-in
;version 1
;type analyze
;categories "http://lv2plug.in/ns/lv2core#AnalyserPlugin"
;name "Label Cycles..."
;action "Labelling..."

;control debug-me "Debug Me" int "" 10 1 10000

(setq twelfth-root-2 (expt 2.0 (/ 12.0)))
(setq note-0 (/ 440 (expt twelfth-root-2 69)))

;;; takes the output of yin, returns Hz or nil
(defun find-best-frequency (y)
  (let* ((notes (aref y 0))
	 (confidences (aref y 1))
	 best-confidence
	 best-note)
    (do ((note (snd-fetch notes) (snd-fetch notes))
	 (confidence (snd-fetch confidences) (snd-fetch confidences)))
	((or (not note) (not confidence)))
      ;; beware the IND values sometime in confidences,
      ;;plusp should screen them out
      (if (plusp confidence)
	  (if (or (not best-confidence) (< confidence best-confidence))
	      (setq best-confidence confidence
		    best-note note))))
    (and best-note
	 ;; convert to Hz
	 (* note-0 (expt twelfth-root-2 best-note)))))

(defun label-cycles (snd)
  (let* ((len (snd-length snd ny:all))
	 (y (yin snd 21 108 (/ len debug-me)))
	 (freq (find-best-frequency y)))
    freq))

(label-cycles s)

steve · April 19, 2013, 4:04pm

It depends on how periodic the waveform. For highly periodic waveforms, 10 or so cycles will probably be enough. For waveforms that are less periodic you are likely to need more.

The step size is in samples. You may need to experiment for optimum results, but I found that about 4x the lowest frequency worked well.

Paul_L · April 19, 2013, 4:07pm

Does the rest of my math look good?

Paul_L · April 20, 2013, 2:58am

Here is a fairly final draft. It seems to work to pick zero crossings selectively so I get cycles of my voiced speech sounds. I also developed my technique of zero-crossing-finding using an array obtained from a sound made by snd-inverse.

;nyquist plug-in
;version 1
;type analyze
;categories "http://lv2plug.in/ns/lv2core#AnalyserPlugin"
;name "Label Cycles..."
;action "Labelling..."

;control debug-me "Debug Me" int "" 10 1 10000

(defun make-zero-crossing-finder (snd)
  (let* ((srate (snd-srate snd))
	 (blip (snd-from-array 0 srate (vector srate)))
	 (blips (trigger snd blip))
	 (count-of-crossings-snd (integrate blips))
	 (inv-snd (snd-inverse count-of-crossings-snd 0 1))
	 (inv-array (snd-fetch-array inv-snd (snd-length inv-snd ny:all) 1))
	 (inv-array-len (length inv-array))
	 (count-of-crossings-array
	  (snd-fetch-array (snd-copy count-of-crossings-snd)
			   (snd-length count-of-crossings-snd ny:all) 1))
	 (count-of-crossings-len (length count-of-crossings-array)))
    ;; global time
    ;; find the nearest crossing to either side
    ;; or return nil if out of bounds
    #'(lambda (time)
	(let ((index (truncate (* time srate))))
	  (and (>= index 0)
	       (< index count-of-crossings-len)
	       (let ((num (truncate
			   (aref count-of-crossings-array index))))
		 (and (>= num 0)
		      (< num inv-array-len)
		      (let*
			  ((t2 (aref inv-array num))
			   (t1 (if (plusp num)
				   (aref inv-array (1- num))
				   t2))
			   (diff2 (abs (- time t2)))
			   (diff1 (abs (- time t1))))
			(if (< diff1 diff2) t1 t2)))))))))

;;; takes the output of yin, returns Hz or nil
(defun find-best-frequency (y)
  (let* ((notes (aref y 0))
	 (confidences (aref y 1))
	 best-confidence
	 best-note)
    (do ((note (snd-fetch notes) (snd-fetch notes))
	 (confidence (snd-fetch confidences) (snd-fetch confidences)))
	((or (not note) (not confidence)))
      ;; beware the IND values sometimes in confidences,
      ;;plusp should screen them out
      (if (plusp confidence)
	  (if (or (not best-confidence) (< confidence best-confidence))
	      (setq best-confidence confidence
		    best-note note))))
    (and best-note (step-to-hz best-note))))

(defun label-cycles (snd)
  (let* ((srate (snd-srate snd))
	 (len (snd-length snd ny:all))
	 (seconds (/ len srate))
	 (y (yin snd 21 108 (/ len debug-me)))
	 (freq (find-best-frequency y))
	 (text ""))
    ;;(text (format nil "~A Hz" freq)))
    (if (not freq)
	"Can't find the fundamental"
	(let*
	    ((period (/ freq))
	     (finder (make-zero-crossing-finder snd))
	     (prev-zero (funcall finder 0.0))
	     results)
	  (if (numberp prev-zero)
	      (do* ((next-zero (funcall finder (+ prev-zero period))
			       (funcall finder (+ prev-zero period))))
		  ((not (and (numberp next-zero) (< prev-zero next-zero))))
		(setq results (cons (list prev-zero next-zero text) results)
		      prev-zero next-zero)))
	  ;; what to do at the right boundary, if it rises to zero and
	  ;; does not cross?  Make one more label?
	  (if (and (numberp prev-zero)
		   ;; improve this criterion for good-enough?
		   (> (- seconds prev-zero) (* .9 period)))
	     (setq results (cons (list prev-zero seconds text) results)))
	  results))))

(or
 (label-cycles s)
 "Can't find zeroes")

steve · April 20, 2013, 2:38pm

Seems OK to me.

Paul_L · April 21, 2013, 4:20pm

What I learned today: You can feed yin one of your vowels, but don’t give it a sibilant! Even with a stepsize parameter that works well with a vowel of similar length, yin given a sibilant starts spinning the CPU and eating memory. I walked away for a minute and came back and saw “Nyquist did not return audio.”

A caution for me if my crude “speech recognition” goes anywhere. Be sure I can identify the noisier speech segments by other means first.

steve · April 25, 2013, 3:25am

Did you mean “INF” (rather than “IND”)?
If so, then you need to be careful with “plusp” because it only catches -inf and not +inf
If you get the same results in Windows (below) then checking for equality looks like the best bet.

(setq test (linear-to-db 0))
(print test) ; returns -inf on Linux

(if (numberp test) ; returns "Is number" 
  (print "Is number")
  (print "Is not number"))

(if (= test test) ; returns "Is not number"
  (print "Is number")
  (print "Is not number"))

(if (plusp test) ; returns "Is not number"
  (print "Is number")
  (print "Is not number"))

(if (minusp test) ; returns "Is number"
  (print "Is number")
  (print "Is not number"))

(setq test (- (linear-to-db 0)))
(print test) ; returns inf on Linux

(if (numberp test) ; returns "Is number" 
  (print "Is number")
  (print "Is not number"))

(if (= test test) ; returns "Is not number"
  (print "Is number")
  (print "Is not number"))

(if (plusp test) ; returns "Is number"
  (print "Is number")
  (print "Is not number"))

(if (minusp test) ; returns "Is not number"
  (print "Is number")
  (print "Is not number"))

Paul_L · April 25, 2013, 6:35pm

I do mean #-1.IND which I did see when I experimented with yin in the Nyquist prompt. I don’t know what IND means as opposed to INF but plusp does give false for it.

Besides which, if yin ever really does report infinity as a confidence value, then there might be no harm in treating that as a number greater than all finite numbers. I want to use the frequency for the smallest confidence that is a number or an infinity.

steve · April 25, 2013, 7:01pm

Google is a wonderful thing
#-1.IND is how Windows represents a NAN (not a number)
A useful property of a NAN is that it does not have equality with anything, even itself, so (/= x x) is a good way to test.

Robert_J_H · April 25, 2013, 10:21pm

The best way is to eliminate the -#ind values right in the Sound itself.

;; create sound with all undesired values + -1 0 1 
(setf sig (s-log (snd-from-array 0 1 (vector 
  -1 0 (- (linear-to-db 0))(/ 2.718281828)1 2.718281828 ))))
(snd-display sig) (terpri)
;; set positive infinity + -#ind to 2
 (setf sig2 (s-min 2 sig))
(snd-display sig2)(terpri)
;; set negative infinity + -#ind to -2
 (setf sig3 (s-max  -2  sig))
(snd-display sig3) (terpri)
;; set positive NANs to 3, negative infinity to -2
(snd-display (s-max -2 (s-min 3 sig)))

For the values in YIN which return the confidence it is enough to use ‘(s-min 1 (aref yin-result 1))’ since all values are positive and 1 means non-periodic.

Paul_L · April 27, 2013, 3:15pm

I don’t understand the usefulness of that. My sound contained no illegal values. It was just a recording of voice. Yin returns two arrays in which some of the frequencies are legal values but far from the correct answer, and corresponding confidence values for them were non-numbers.

Robert_J_H · April 27, 2013, 4:27pm

Yin Returns an Array of sounds, not two Arrays.
My method sets all illegal values to a confidence of 1 which means non-periodic.

Paul_L · April 27, 2013, 6:37pm

I misspoke, technically, but of course the “sounds” from yin are disguised sequences of numbers, practically arrays in other form, and that’s how I think of them. They are not “sounds” in the commonplace sense of the word.

I understand you now. This method does prevent eliminate bad results coming out of yin, it only filters them out. This is not a thing to apply to the inputs.

steve · April 27, 2013, 8:19pm

but they are in the technical sense of the word, in that they have a sample rate, start time, stop time and logical-stop.

Robert_J_H · April 27, 2013, 8:42pm

… and you can Profit from there structure as sounds.
You can use them as envelopes and control-signals.
You can use the first one as cut-off frequency in any variable filter (lp, hp, eq-band etc.) and the second one to attenuate noisy parts of a Signal.
Furthermore, Sounds use 4 Bytes per sample whereas Arrays use 14.