Digital Sound: Variable Speeds with same pitch: How?
May 28, 2009 2:44 PM   Subscribe

many programs and devices have a function that slows down or speeds up audio without changing the pitch. my rockheaded layman's theory on illustrating how this must be done is that it divides the whole audio track into something like audio pixels that are so short that they can be shortened without actually removing any of the details that you're hearing-- it's just you're hearing every small detail for less of an amount of time (likewise with slower speed, it lengthens the amount of time that you hear every of those little "audio pixels" so that it is twice as slow).

my question for the hive is:

whether anyone knows what that process is called, or what of any number of names it has;

what is a more sophisticated explanation of the way that effect functions (or, 'an accurate explanation', as the case may be, though I think my theory is pretty sound, again, in terms of telling it like a four-year-old would understand rather than examining and presenting the finer points for whatever the process consists of);

and lastly, whether there are any interesting articles, resources, or discussions about it, especially regarding its introduction to the sound production/editing world and its early days in use.
posted by candyhammer to Media & Arts (13 answers total) 5 users marked this as a favorite
time stretching
posted by gyusan at 2:52 PM on May 28, 2009

The common term is stretching.
posted by saeculorum at 2:54 PM on May 28, 2009

Best answer: There are several ways of doing it. You can use a fast fourier transform to break a sound into blocks of single-frequency sine waves, and then stretch the duration of those blocks. You can also use granular resynthesis, breaking down the sample into lots of overlapping chunks and then moving them closer or further apart. A widely known early example is "The Rockafeller Skank" by Fatboy Slim.
posted by scose at 2:56 PM on May 28, 2009

Your fundamental idea is incorrect. All time stretching and compression involves fourier transforms.
posted by Chocolate Pickle at 3:04 PM on May 28, 2009

Chocolate Pickle is incorrect. The Fatboy Slim track that scose mentions was indeed done via granular resynthesis. In fact you can actually hear the grains (your "audio pixels") in the most slowed-down sections.
posted by flabdablet at 3:41 PM on May 28, 2009

Not sure if it's helpful, but here's an example in a Nine Inch Nails remix called Erased, Over, Out. (Ignore the video) He's repeating "erase me" with the pitch unmodified, but the duration stretched to an extreme... You can hear some of the choppiness in the sound that is introduced as an artifact of the stretching algorithm.
posted by knave at 3:43 PM on May 28, 2009

Wikipedia has a nice page with some more details and links
posted by Z303 at 3:51 PM on May 28, 2009

It's a complicated process because the goal only makes sense in the context of human perception— in an abstract sense there's no notion of speeding something up without changing its pitch. In practice, you have to divide the qualities of the sound into the parts that are perceived by humans as pitch, and the parts that are perceived by humans as rhythm/sequence/etc., then do your manipulation, then recombine them. You can do this in the frequency domain, using Fourier transforms (or, I suppose, other transforms); you can also do this in the time domain, by breaking up the audio into carefully chosen chunks (a few cycles of the waveform) and repeating or dropping them.

(Basically, this is what scose said, in a bit more detail.)
posted by hattifattener at 3:52 PM on May 28, 2009

Best answer: You can think of sound in two ways, either as a single 'waveform' the kind you would see in an oscilloscope, or as a combination of lots of diffrent frequencies, like you would see in a level meter

If you have a pulse code modulated (PCM) sound file, you basically have a sequence of values that 'draw' the wave form. A 44khz sampled waveform has 44 thousand samples for one second. If you had one sample that was "all the way on" followed by another sample that was "all the way off" and then repeating, you would hear (if you're ears could handle it) a 22khz tone. If you had 100 samples 'all the way on' and 100 samples 'all the way off' (and repeat) then you would hear a Concert A note. If those samples moved up and down in a sinusoidal way, rather then being all off and all on, you would hear just that note with no distortion.

And, if you took that sound sample you just created (a 440hz sinewave) and graphed it on a level meter, you would see just one bar for the single note.

Now suppose your sound file contained just 10 seconds of the 440hz sine wave. It would be really easy to shorten or extend it, right? If you wanted it 50% shorter, you would just have to generate a 440hz sine wave for half as long, or you could do twice as long to double the length.

Now, lets say you had two tones overlay on top of eachother. When you looked at it in the level meeter, you would see two bars. To shorten or extend the sample, you would just need to generate those tones for longer or shorter time.

And that's usually how stretching and compressing sound works. You use a fast Fourier transform to convert the audio into a bunch of 'bars' and then you shorten the time that those bars are 'there' you shorten the time that the sound sample plays.

Another way to think about it: When you're look at a Spectrogram of a sound wave you're looking at the output of those 'bars' at each moment of time. The Y axis is the frequency, the X axis is time, and the color represents the intensity of that frequency at that moment of time. If you want to shorten the amount of time a sound plays without changing the pitch, you just stretch or squish the spectrogram, just as you would an image in photoshop.

Oh, and so actually there is another kind of 'sound pixel' that's just like a pixel in a picture, it's the intensity of the sound at a specific time and a specific frequency in a spectrogram.

You can also do granular resynthesis, as other people mentioned, which is (as far as I know) just taking chunks of a sound out or repeating them. Just like if you took big strips of a photograph and removed some of them, or duplicated them. It would look very different then if you just did a regular stretch.
posted by delmoi at 4:07 PM on May 28, 2009 [3 favorites]

Oh also, I should point out that I might have gotten something wrong in the above post. That's just my understanding of how it works.
posted by delmoi at 4:07 PM on May 28, 2009

Best answer: I just recently took a course in music informatics, and one of our homework assignments involved writing software to do just this. There are multiple ways to do this, but the technique we used involves taking a short-time fast Fourier transform, the result of which can be (mostly) visualized in a spectrogram. A Fourier transform decomposes a segment of a waveform (like an audio waveform) into a sum of pure sine waves at all the different frequencies (pitches), each with a particular amplitude (volume) and phases (the phase part, i.e., where in the oscillation the wave starts from, is what won't show up in a spectrogram). The "fast" or "finite" Fourier transform (FFT) is an approximation of the theoretical Fourier transform that can be easily implemented, which gives only a finite number of frequencies in the transform, but it works quite well. Your idea of breaking the sound up into short snippets (frames) is part of the basic idea. That's what the "short-time" part is in this method.

The difficulty is that with very short frames, you lose the ability to get some of the frequencies. The solution to this problem is to have your frames overlap a bit. So your frame length is actually larger than your hop length. There's also some finicky bits with the way the frame starts and stops and also with the way you try to reconstruct the sound when you have overlapping frames. These are both fixed by "windowing" the waveform in each frame before taking the FFT. This gives a higher amplitude to the waveform in the center of the frame and tapers off the amplitude to zero at the edges of the frame. If you pick a good window, then overlapping frames taper off just right so that when you add them back up you get the original amplitude.

Another problem occurs if you ignore the phases. If you just do as I described and then try to "speed up" the audio by just dropping every other frame, you get this jarring jump at the beginning of each frame caused by the unnatural shift in phase. So over the top of your sped-up sound, you hear a metallic-sounding hum sound with a pitch corresponding to the frame rate. You can get this same effect without speeding things up by zeroing out (I think) all the phases before reconstructing the sound. It's sometimes called "robotization", and when you hear the effect, you'll definitely recognize the effect from early electronic music, sound effects, etc. Anyway, to fix the robotization problem, you can store, instead of the phases, the difference in phases between the frames.

An algorithm that does what I've just described is called a "phase vocoder" (if I remember correctly), and it's actually not that hard to implement. I can send you some R code if you're curious. Once you've got a short-time FFT with the phase differences instead of the phases stored, you can do all sorts of fun stuff, like dropping frames to make it go faster, duplicating frames or inserting linear combinations of adjacent frames to make it go slower, even just playing the same frame over and over to make the sound just "stop" and hold steady. You can reverse the sound and play it backwards (okay, that you can do without a phase vocoder), slow it down with the vocoder and then speed it back up the normal way to increase the pitch without increasing the tempo, or do the opposite to drop the pitch without changing the tempo. I'm particularly fond of the effect you get when you randomize all the frames in an audio clip, resulting in a bizarre frantic mish-mash of sounds that almost, but not quite, sound like the sort of thing that you were sampling.

Anyway, much of this repeats what's already been said, but at the very least, I hope this helps you put some names to the concepts. Also, I know that there are many software (free and otherwise) tools (I think there's a WinAmp) plug-in that can do stretching, pitch-shifting, etc., but I'm pretty sure most of them use a different technique.
posted by ErWenn at 8:52 PM on May 28, 2009 [1 favorite]

As it happens, I have a tarball sitting on my desktop that builds a program that will do what you want. It has 3 dependencies:

* libsndfile, to read and write various sound formats with a uniform API
* libsamplerate, which I think is used to do pitch shifts
* fftw, a fourrier transform lib used to do the time stretch.

Other people have mentioned how it can be done, the program's called RubberBand and you might see how it compares to what's been suggested.
posted by pwnguin at 12:41 AM on May 29, 2009

See also the work of Curtis Roads.
posted by Dean King at 9:24 AM on May 29, 2009

« Older What the best way to track my work projects?   |   My Mouth Feels Awkward Newer »
This thread is closed to new comments.