How to measure a person's average pitch?
August 18, 2009 11:38 PM   Subscribe

I have some recordings of people saying a short phrase. How can I use these recordings to measure the average pitch of each speaker's voice?

For a research project, I want to identify how high-pitched or low-pitched different people's voices are based on some short recordings. Unfortunately, I know very little about anything audio/video related and am clumsily trying to make sense of the "export spectrum" graphs that Audacity produces. Is there an accepted technique people use to measure the average pitch of audio recordings?

Also: if anybody with more knowledge about audio thinks that average pitch is a terrible measure to consider, I'd love to learn why. I'm pretty far from my research comfort zone with these recordings.
posted by eisenkr to Technology (10 answers total) 3 users marked this as a favorite
Best answer: Personally, I would download Praat and use that to look at pitch tracks and the like, but if you're familiar with Audacity, then stick to what you know.

While pitch is interesting, I don't know what it buys you to compare averages, especially between speakers. People use pitch changes to convey all sorts of things, but a flattened out average doesn't tell you anything about what people are doing, how they're doing it, or how often they're doing it.

Are your speakers all saying the same phrase? That could be really cool if so! If you want to stay focused on pitch, you could take a look at the overall contours, compare them, and see what's interesting. Maybe pick out an individual word/part that has a pitch reset and compare differences.

Now, if you're not set on pitch, you could do some really interesting comparisons of a specific vowel across speakers. Or you could measure rhythm or stressed syllable length. Or look at voicing of normally devoiced consonants. It's endless.

Maybe if you tell us what field of study you're in and what your general research goals are, we could give some more specific advice for what you're ultimately trying to do? Either way, happy to help you out if I can!
posted by iamkimiam at 12:10 AM on August 19, 2009

Response by poster: Thanks for the response iankimiam. I'll try to provide a little bit more detail below.

1. The speakers are all saying the same short phrase (the phrase is "Hi Dan") and the recordings are all about 650ms long (sd = 164ms).

2. Without going into too much detail in a public forum about what this in-progress research is (if you're really curious, me-fi mail me), I'm planning to study how listeners characterize the people who recorded these very short audio clips. Although my main hypotheses have nothing to do with pitch, I'm guessing that some reviewers are going to want to see whether measurable aspects of the recordings (e.g., pitch, length) are driving the results.

Given this extra information, if you have any other suggestions about how to measure pitch, or what type of variables people usually measure for speech, I'd love to hear it. I primarily study things like dyadic perceptions/interactions/reactions and publish in social psychology/applied psychology journals so using recordings is still new to me.
posted by eisenkr at 12:36 AM on August 19, 2009

You'll want to read up on the Fourier transform, available in lots of math packages such as MATLAB or Mathematica.
posted by fatllama at 12:47 AM on August 19, 2009

Best answer: Take the Fourier transform of your audio, which will move it from the time domain to the frequency domain.

In the frequency domain, it may be that the highest peak (i.e. the most contributory) frequency is what you want. If that doesn't seem right, you could do a weighted mean of each of your major peaks. But keep in mind that the mean is often a pretty bad definition of "average" for this sort of analysis. Median is often far more stable and informative.
posted by Netzapper at 1:33 AM on August 19, 2009

Best answer: I know Audacity has a simple frequency spectrum analyzer, but in my noodling about I've never gotten it to give me a decent output. From this link, it looks like it plots the power density spectrum and some other autocorrelation functions. The link notes that
The Enhanced Autocorrelation function is very good at identifying the pitch of a note.
so that may be what you're looking for.

What I'd do is separate the sound file into windows (say, each syllable of "Hi Dan", or each audibly noticeable pitch-change) and find the Autocorrelation spectrum of each window separately. That should give you two or three different "pitches" that you can average.
posted by muddgirl at 7:05 AM on August 19, 2009

I am not a scientist but it seems to that reducing your data to a single number before doing any comparisons is a mistake. I would convert your samples (via FFT) into frequency spectrum graphs and visually use those graphs to see if any patterns emerge before going into numerical mining. I'm picturing something akin to sparklines but with frequency domain data.
posted by chairface at 9:33 AM on August 19, 2009

As fatllama pointed out, MATLAB would be a great tool to use for this - especially if you have a large number of recordings.

Once you import the recordings into MATLAB, you'll be able to easily extract all the statistics you want from them. Also, there is tonnes of information already out there about doing speech processing in MATLAB if you want to get into more complex analysis.

Of course, if you don't have any sort of programming background, the learning curve might be a little steep.
posted by toftflin at 10:53 AM on August 19, 2009

Autotune it.
posted by Sebmojo at 7:02 PM on August 19, 2009

Seriously, you can do this with a piano. I don't know if it meets scientific standards, but if you have a good ear you can listen to someone talk and then play notes on the piano until it matches what you hear. If you can't do it a musician friend with a good ear can. Then you could write: "The pitch of this sample was judged to center around E-flat'" or something similar.

I would guess these kinds of judgements would be closer to the mark than something automated. For instance, someone says "Hi, Dan" and has a tiny squeak on the first word. You'll be able to hear that the rest of the phrase is at a level pitch, but the autoanalysis would average the whole thing and come out too high.
posted by argybarg at 7:24 AM on August 20, 2009

You should (if you haven't already) read up on formants. Pitch is primarily observed by measuring the lowest frequency peak whereas vowel sounds are determined by the formant frequencies (largest amplitude frequency peaks). There are already well established average ranges of all these frequencies for adult and child males and females. Pitch will be mostly determined by the geometry of the speaker's vocal tract.
posted by achmorrison at 8:38 PM on August 20, 2009 [1 favorite]

« Older How do I Filter out Facebook group mails?   |   No video Newer »
This thread is closed to new comments.