How does autotune work?
October 22, 2007 1:48 PM   Subscribe

How does autotune work?

My best guess is that it does a Fourier analysis to guess the fundamental of the incoming sound, finds the closest pitch in the scale you've set it to, and does rate adjustment of a ring buffer to try to make it match.

Answers with as much technical/mathematical detail as possible would be great, as would a "dummy with basic understanding of DSP" breakdown.
posted by phrontist to Technology (7 answers total) 7 users marked this as a favorite
Response by poster: To clarify, I specifically want to know how it does it live.
posted by phrontist at 1:50 PM on October 22, 2007

Best answer: This Auto-Tune manual [pdf] says some things about it that might help, though isn't very technical. See "How Auto-Tune Detects Pitch" and "How Auto-Tune Corrects Pitch."
posted by wemayfreeze at 1:56 PM on October 22, 2007

I've run the Autotune plugin "live", with my soundcard's ASIO latency set to around 100ms. Whatever they're doing, their algorithm is fast as hell.

As far as their rackmount module goes, 30-50 ms is more than enough time for an embedded system (or FPGA with an embedded DSP) to process the sound. I'd guess their DSP runs at a fairly high frequency (100 MHz at least) and the sampling rate is only around 48K. Now consider that in a live performance, there's already line delay from the mics, mixers, etc. so another 50ms isn't going to matter.

Your algorithm description makes sense to me, but then again, I have no idea how Antares does it, and I don't have enough DSP knowledge to answer that.
posted by spiderskull at 2:21 PM on October 22, 2007

You wouldn't need an FFT, just slide a chunk of signal against itself and total the absolute value of the differences. The minimum value for that happens when one cycle of the pitch lines up with itself. That time delay over one is the pitch.
posted by StickyCarpet at 2:55 PM on October 22, 2007

Uh. one over that.
posted by StickyCarpet at 2:56 PM on October 22, 2007

posted by StickyCarpet at 2:57 PM on October 22, 2007

Best answer: My guess is that it's "find the pitch, round it off to nearest permitted unit, pitch-shift to correct". You can find pitch-shifting algorithms on the net if you google for them, it's not complicated.

The latency doesn't have to be real large either since they can do it using fairly short frames of data; e.g. 1024-bin FFT gives you ~25ms latency and 50 Hz bin size, then you can interpolate between bins to get a few Hz precision on the assumption that there are relatively pure tones being corrected.

I would assume there's some filtering going on so that only the inaccurate tones (a voice) and its harmonics are shifted, everything else should get left alone.

Considering the power of a modern CPU or FPGA, this is quite the trivial task. People build radars using FPGAs now and they process data coming in at 2GS/s while audio is only 48 or 96kS/s.
posted by polyglot at 5:16 PM on October 22, 2007

« Older How did Civil War soldiers reload and fire the .58...   |   Repairing a whining iBook? Newer »
This thread is closed to new comments.