I’m tinkering with writing a piano tuning program for my own personal use. Part of that process is a real-time pitch detection process whose aim is to detect which note on the piano is being played and to compare its frequency to a previously computed target frequency. Here is my current approach and current results. I’ve read a little bit about smoothing functions and windowing functions which may help to improve the accuracy of this process but I’m pretty clueless about which smoother and window would be best for this application.
- I’m currently using the Windows waveform API (e.g., waveInOpen()) to collect buffers of size 2^14 (1 channel, 44.1kHz, 16 bits, Samson CO1U microphone).
- I collect 4 such 2^14 buffers in a 2^16 buffer. As I get more samples, I discard the first sample and shift the others to the left. This gives 44100 / 2^14 (about 2.7) pitch estimations per second.
- I then run a FFT on the current 2^16 buffer.
- I convert the output of the FFT into power spectra at two granularities. The finer granularity uses 12000 bins which gives 1 cent resolution from 10Hz to 10kHz. I didn’t think such a fine resolution was useful for step 5 so I also create a power spectra with 200 bin granularity. The idea with this number was that it was about 2 bins per piano key. Maybe this should be exactly 88 bins centered on each equal temper frequency or maybe it should be an exact integer multilple…any thoughts?
- I’m currently using the HPS approach to pitch estimation. For this, you basically have 4 more power spectra of the 200 bin granularity where you divide the frequency by 2, 3, 4, and 5 respectively before adding to the spectra. Then you multiply the original 200 bin power spectra by the 4 others and scan the result for the biggest peak. The idea is that when you divide by 2, 3, 4 and 5 that the harmonic peaks in the signal will overlap and so when you multiply you get a large spike at the real frequency.
- The biggest peak in the 200 bin case gives me a small frequency range to search in the 12000 bin case for 1 cent pitch resolution.
This seems to work okay for notes in the middle of the keyboard. However, for the lowest notes on the keyboard it has the following behavior. Let’s say I play C1. It may first identify the note as G2 (the third harmonic of C1) and then a couple samples later it identifies the result as C2 and then a couple samples later it gets the correct key of C1. I have some spectra I can post of this process if that would help. The upper notes have some problems as well. I wouldn’t expect much width to the peaks for the 200 bin case but there still seems to be some width to them. Is this the so-called variance problem that might improve if a windowing function is applied to original signal to prevent a sudden truncation?
Also, I originally tried to use the WASAPI but didn’t have any luck. pAudioClient->Initialize always failed in exclusive mode (I made multiple attempts to get 1 channel, 44.1kHz…). In shared mode, none of the functions failed but the data returned by pCaptureClient->GetBuffer() didn’t look like a PCM signal. What format is this data supposed to be in?
All ideas for improvement appreciated!
Todd