Looking for realtime vocal feature tracking algorithms
March 23, 2011 12:20 PM   Subscribe

I'm looking for realtime vocal feature tracking algorithms that I can buy or implement. I'd like to be able to track pitch, volume, formants, vibrato, and as much more low-level info as I can get, as close to real-time as possible. I don't need speech recognition.

Other features I'd love to be able to track might include (but aren't limited to) occurrences of consonant & vowel sounds, breath sounds, syllable boundaries, estimates about whether it's a male/female/child's voice, measures of vocal qualities like nasal or gravelly tones, etc.

I know I can't have everything I want, but I'd love to find state of the art techniques that I can use in my own software, whether by implementing published algorithms or by buying VST plugins or the like. I'm sure there's a lot of great technology locked up in Autotune, Melodyne and such; I'd consider buying that kind of software if there are ways to get realtime feature data (not audio) out of it and into MAX/MSP.

Please point me to published research and experimental or commercial software in this area!
posted by moonmilk to Technology (10 answers total) 11 users marked this as a favorite
You can use EchoNest data in Max/MSP.
posted by mkb at 12:57 PM on March 23, 2011

If you are handy with code at all, the csound programming language will run inside a csound~ max object, and has excellent facilities for feature extraction, including phase vocoding (the feature extraction technique used in autotune), as well as many others that are hard to find elsewhere.
posted by idiopath at 1:02 PM on March 23, 2011

Also there are many people using csound for feature extraction to make "auto-accompaniment" style systems in the tradition of George Lewis. Of particular note is Ma++ Ingalls, who has shared his code with the csound mailing list (scroll down to matt's post, the file is called "CLAIRE_06.csd".
posted by idiopath at 1:09 PM on March 23, 2011

Response by poster: EchoNest API is intriguing, mkb, but unfortunately it's not available in real time.

idiopath, csound sounds promising, and I'm up for learning it. I welcome more csound-specific recommendations!
posted by moonmilk at 1:25 PM on March 23, 2011

Best answer: some of the tools for spectral extraction there are:

the pvs opcodes (phase vocoder analysis, like that used by autotune - I think it is just a dft but there could be more to it?).

the spectrum opcode for exponentially spaced spectrum analysis (as opposed to the linear spacing of the standard dft used in most algorithms) which drives the finely tweakable spectprk opcode for deriving the fundamental pitch that can be customized to be more accurate for a particular instrument or voice (it even has an option to display it's internal list of "candidate fundamentals", besides the one that it decided was the best fit).

There is also the much simpler "pitch" pitch tracker (which uses a similar aproach to spectrum but seems a more "plug and play" type alternative with less finicky options), and pitchamdf which uses a method called the "average maginitude difference function".

there are also the lp opcodes (linear predictive coding), ats ("Audio Transformation and Synthesis") loris, and the pvoc (phase vocoder, mostly duplicated by pvs I think), but those require analysis of a static sound file, and are not meant to be realtime tools.

There of course are the standard set of amplitude followers etc. but that kind of thing is probably easy enough to do in max as well.

I have heard people talking about doing amazing things with these tools like isolating an instrument based on knowing its stereo position in a sound file, but I haven't done a whole lot with feature extraction myself, I just know about it from following the mailing list and observing the conversations there.

This interesting, and frustrating, thing with csound is that is very accessible as a research tool for implementing experimental algorithms, so many of the opcodes were PHD projects of various contributers. This means that there are many exciting experimental things that show up there first, and many half finished or sub optimal algorithms with various overlaps of functionality between them. These are the ones I found while doing a quick run through the manual to remind myself of them, I may have left some important ones out.

If you already mostly get max, the underlying model (control signals, audio signals, data flow processing) is identical, it is just represented textually in csound instead of visually as in max. For some purposes this makes things much easier, for others it makes for a steeper learning curve at the very least.
posted by idiopath at 2:06 PM on March 23, 2011 [1 favorite]

Best answer: I forgot to include the tl;dr

the "pvs" family of opcodes, the "spec" family, and "pitch"/"pitchamdf" seem like the best bets - I have used pvs quite a bit and it is great for real time grabbing and mutating of spectrums but I have not done much with feature extraction.

Also, on a pragmatic side, many of these algorithms are CPU bound in terms of realtime performance, so you may get better results if instead of using the embedded csound~ object you use OSC and run separate algorithms (they each have upsides and downsides) on separate computers.

Also, it occurs to me just now that there is also sphinx, which is open source, so I would be surprised if there were any difficulty accessing its internal data, and though it is focused on speech recognition, many of the problems of speech recognition and music feature extraction are identical. I have heard of sphinx and it looks very active (google summer of code project and all), but I have no first hand knowledge of it.
posted by idiopath at 2:15 PM on March 23, 2011

Response by poster: Thanks for all the great info, idiopath! And you guessed right early on: I'm working on something in the auto-accompaniment vein.

In case it's helpful to anyone else, here are some recommendations I've received from outside metafilter:

MAX/MSP externals by Miller Puckette, David Zicarelli, etc: http://crca.ucsd.edu/~tapel/software.html

MAX/MSP externals by Tristan Jehan: http://web.media.mit.edu/~tristan/maxmsp.html
posted by moonmilk at 2:35 PM on March 23, 2011

That's pretty funny; Tristan Jehan is the co-founder of EchoNest!
posted by mkb at 5:35 PM on March 23, 2011

UCL's Pychology and Languages Science dept. has some useful software for working with speech/voice.

(I really need to take a closer look into csound)
posted by MrFTBN at 4:14 PM on March 24, 2011

Response by poster: Interesting speech software - thanks, MrFTBN!
posted by moonmilk at 2:40 PM on March 26, 2011

« Older What are the odds?   |   Learned rental home is for sale, after we moved in... Newer »
This thread is closed to new comments.