01-24-08 - 1

Some notes on accurate pitch detection in sounds :

You have a sound buffer in time space buf[i]. You captured the last N samples. Obviously using more samples gives you more precision, but longer latency (less time resolution). This is like the Heisenberg uncertainty principle kind of thing. For accurate pitch detection, you have to eat some temporal delay, around 1 second, and use 16k-32k samples. (pitch resolution will be like 1/N - but if your sounds are shorter than N using too long of a buffer will give you lots of nasty noise from partials)

Obviously an FFT is the first place to start. If you have N samples, there will be N/2 frequencies in the spectrum. The bottom frequency is 0, and the top frequency is (samples per second)/2. The resolution is thus (samples per second)/N Hz. So for example if you have a 44k sample data and you do an FFT with 44k bins, your resolution is 1 Hz. That sucks bad so obviously we need to do more. Note that you're doing an FFT of only a portion of the sound, which is technically a "Short Time Fourier Transform" (STFT).

Before you FFT you want to apply a window to the buffer. The reason is that if you just FFT a raw sine wave, if it happens to line up exactly on one of the FFT bins you will get a nice single peak, but if it's between two bins, it will spread out in this evil spectrum. If you multiply the data by a window, then the spectrum will be the *convolution* of the original spectrum by the spectrum of the window. Nice smooth windows have fourier transforms that are like sinc functions, which act to sharpen the result around the true peak. In any case a "Hann" window (simple 0-1 cosine) works great.

Once you have the FFT you create a power spectrum with the square of the real and imaginary parts. You now have N/2 power samples.

Another note on the FFT. Your input buffer had N samples, but there's a funny thing you can do to get more resolution. Do your FFT with M samples (M > N) and just pad the buffer up with (M-N) zeros. Since you windowed your buffer it already goes to zero on the edges, so you can just tack on a bunch of zeros. This effectively improves your FFT res to 1/M instead of 1/N. There's some weirdism here I don't quite get; you're not really improving your resolution at all, you're just getting more bins in the FFT and the FFT of the zero-padded data is really just some interpolation of the non-padded data. Also the PARSHL guys do this thing where they shift the buffer so the zeros are in the middle; I'm not really sure why you would do that, applying a constant shift to the buffer before FFT is equivalent to applying a constant phase shift to the output of the FFT. Anyhoo.

A practical issue you're going to have to address is noise. Normal computer mics are just awful. I found that the noise is primarily low frequency. There are two hacky things you can do. One is set a treshold intensity for the power spectrum; anything below that is treated as zero. Another is a threshold frequency, any values below that are treated as zero (more generally you could set a region of interest, and ignore all entries outside that region). A more general thing that I found to work nicely is just to sample the noise; tell the user to shut up and read the mic for a while and average those readings. Then scale that up by 1.5 and set that as your noise base. Subtract that noise base off all future power spectrums.

So, now you can just find the peaks in the power spectrum and those are your principal frequencies. I'm working on pretty clean sounds so I have a simple heuristic (just find the one sample of highest intensity). The audio compression algorithms have complicated heuristics to track peaks coherently across frames.

Once you have a peak bin index you can refine its location. You want to try to make a guess at where the true center of the peak is, which may not be at an integer bin location. Since we windowed, we know that a pure sine will turn into a sinc, so we want to just fit a sinc near the peak and use the location of the sinc. In practice, you want to only use a few samples near the peak so that you don't run into other peaks when you do the fit. Also in practice you don't need to actually fit a sinc, you can fit anything with a smooth peak, such a quadratic or a gaussian. Finding the center of a gaussian fit is super easy. For an array of data[] the center of the Gaussian fit is Sum{ i * data[i] } / Sum{ data[i] }

Okay, the next thing you need to deal with are the harmonics. Most sounds have harmonics, which means if the fundamental frequency is at F, there will also be peaks at 2F and 4F etc. There are a few issues here. One is that with naive peak tracking the peak could jump around between the harmonics. That's easy to deal with by treating your frequencies in "octaveless" space, that is treat frequency as a toroidal value where 2F is equivalent to F. The other is a little more subtle, which is that the spacing between the harmonics can actually be more reliable than the frequencies of the haronics themselves. Some of the frequency algorithms use the FFT of the FFT, which will give you the spacing between the harmonics, there's this whole cepstrum thing, bleh I don't do any of that.

I should take a second to remind you that musical notes are geometrically related. Octave frequencies are 2* apart, and each of the 12 notes are *(2^(1/12)) from each other. Standard A is 110 Hz and all other notes you can find from there. You can put a frequency into "note space" by doing 12*log2( freq / 110 ). That will be 0-11 in the fundamental octave, or you can do a mod(12).

One thing you can do with the harmonics is wrap your FFT to improve accuracy. For one thing, the higher harmonics have more frequency accuracy. If you divide your spectrum up into octaves, each octave gets twice as many bins as the one below it. The bottom octaves only get 1-2 bins so you can't really measure notes at all there. The top octave gets roughly half the bins of the whole FFT. Actually this is one of the really shitty properties of the FFT, that most of your resolution is wasted up in super high frequency land where nothing good happens. Most music and voice is in the 100 - 1000 Hz range. Anyway, to take advantage of this you want to use the highest harmonic you can. The other thing is to take the highest octave in the FFT spectrum, and bilinear filter from the lower octaves and add them onto the highest octave. By adding them up the little errors in each harmonic average out and you get more accuracy.

Okay, at this point we should already have a really nice pitch estimate, but now that we have a really good estimate we can refine it. We've been working on the spectrum but now we're going to go back and work on the time-sampled wave data. A peak of frequency F in the spectrum means that the samples have a periodic component of period P = (samples per second) / F. (P is an integer number of samples). (one quick note : you can improve accuracy here by resampling your input buffer to give more samples per second; that's equivalent to working with fractional number of samples; in theory it would be best to make all your code work with non-integer bin sampling).

So we have a good estimate of the period of the data (P) which came from the frequency spectrum work. Now we can search around to see if there's a better P by trying P+1,P-1, etc. There are two standard ways to measure the period of the samples. Note that these measures go nuts if you allow P to roam globally (because it's garbage for small P and also the periodicity is itself periodic in P), so you need this very good seed value to start the search, then you just look for a local max.

1. Autocorrelation. This is the linear correlation of samples[t] with samples[t-P]. You're basically shifting and multiplying by yourself. Find the P with highest correlation. This thing sucks balls. The problem is that it is very highly affected by noise or other signals at lower frequencies. Basically for it to work ok, you need the autocorrelation of samples[t] with samples[t-2*P] and samples[t-3*P] to all be very similar. In fact they are not if there are big low frequency waves running through the data. Now in theory you should be able to fix this by running a high pass filter over your data with a frequency that's like F/2 to get rid of all the low frequencies and just leave the stuff near F. Hell maybe you could just do a bandpass around F and then autocorrelate on that. But there's no point to muck around here cuz there's a better way :

2. Noll "Maximum Likelihood". I prefer to call this "Coherent Average". Basically what we want to do is to look at the waveform in a periodic window of length P. You put your waveform in this window by wrapping on the period and averaging all the values that go in the periodic window. Any components which are periodic at length P will add up coherently; all other components will add destructively. This is actually a cool way to look at your waveform too. It's sort of like a tuned oscilloscope. If you have your oscilloscope tuned to the right frequency, the waveform will appear to stand still. If you tune it out of phase the waveform on the scope will move; here it will destructively add and go to zero. To find the best P, select the periodic window with the highest rmse.

There's another approach to doing this refinement on the frequency, and that's to use some kind of bandpass filter response. What we want to do is take the sample array and apply a bandpass centered on some probe frequency F. We then measure the total power output by the bandpass and that tells us how close to are to tuned to the actual frequency in the sound. Then you can search around with F using a binary search or whatever to maximize the power output and thus find your true frequency. The nice thing is we can explore fractional F without writing a bunch of code to do correct non-integer bin sampling.

The coolest bandpass for this purpose is the "Goertzel algorithm" which is a really fast simple filter nicely peaked at the probe. One page that actually talks about Goertzel as a bandpass . I dunno maybe there are actually more appropriate bandpass filters for this purpose, you want one that is kind of like Gaussian shaped or something with a single point max, not a flat top one.

BTW I like to think of this is as a damped resonant harmonic oscillator; you're banging it with a bunch of forces, and then you measure how much its moving at the end. It will only really get moving if you're banging it on resonance, so then you tune it to go searching around to find the resonance where it responds most. Looking at it this way it's really also just like a radio receiver. You're tuning your radio oscillator around the frequency that you think you should get a signal and you look at how much power you're receiving, and tune to maximize the output.

With these methods you should be able to measure a pitch to around 0.01 Hz (@ 110 Hz).

BTW there are more things you can do using frame-to-frame information, such as tracking how the peak is moving from frame to frame with phase information (see the "phase vocoder"). That could theoretically give you even more accuracy but I'm not yet doing any of that. When we converted to a power spectrum we threw away all the phase information in the FFT which is good information that we could use. Basically if you measured a peak at frequency F1 and phase P1 in the first frame, and then you see F2 and P2 in the following frame, you can figure out that that's one wave of a linearly varying frequency F + dF*t with a single phase.

Another note : frequency-space pitch detection is very accurate & robust for high frequencies, but breaks down badly for low frequencies. Really any octave below a guitar's low E at 82 Hz gets pretty ugly because the spacing between notes becomes close to or smaller than one FFT bin. On the other hand, time-domain pitch detection is really ugly for very high frequencies, because the period is only a few samples which means that the quantization due to the sampling rate becomes very hard to deal with and the period is very inaccurate and sensitive to noise, but conversely time-domain detection is quite good for low frequency pitches (as long as the frequency isn't so low that you have too few periods in your buffer; if you have a 1 second buffer you can't really work below 8 Hz or so).

There are other obvious things you could do that I'm not bothering with. One is dynamic resolution. When you detect that you have a nice long run at a steady pitch (when a string is sounding) you can use a big window which will give you better pitch resolution. If you use a big window all the time though it would give you horrible latency and bad noise from the transients and attack. So if you detect that the signal is highly varying in the big buffer you shrink your buffer down so that during times of rapid change you can still respond quickly, and you just have less accuracy during that time.

No comments:

old rants