Tracking individual voices.
February 1, 2006 2:32 AM Subscribe
Does software (or hardware) exist to identify individuals from a given set by voice print. If so, can it identify multiple individuals simultaneously?
I'm not particularly interested in being able to have a computer understand what is being said, but rather if it can sort the voices. Another similar thought is if there is a way to take a room full of people conversing, and break it into "tracks" by person.
I'm not particularly interested in being able to have a computer understand what is being said, but rather if it can sort the voices. Another similar thought is if there is a way to take a room full of people conversing, and break it into "tracks" by person.
This thread is closed to new comments.
Now, separating out multiple speakers is a whole different ball game. Imagine you have two sets of numbers, corresponding to intensity over time:
alice: [0,100,-100,800,-10,20]
bob: [100,8,32,-10,-1000,16]
Imagine Alice and Bob are speaking over eachother. Your microphone ends up picking up:
alice+bob: [100,108,-78,790,990,36]
What you're asking for is to extract Alice and Bob's set of numbers, given only Alice+Bob. Now, Alice+Bob contains some hints about the source signals, but they're not great.
How do we do it? Speech is just absurdly redundant, and we aggressively search for channels that aren't being masked by other noise sources. You can chop out high frequencies, or low frequencies, or e-v-e-n s-m-a-l-l chunks of time in between words, and you'll still get the message. In fact, the optic nerve and the auditory nerve meet at a structure called the LGN -- the Lateral Genticular Nuclei -- where what we see of people's lips moving actually informs / subtely alters how we hear what they say. This is how we're able to disambiguate voices amongst a large room of people conversing -- effectively, we train our auditory system to look for signals that correlate strongly with the lip motions we're seeing.
Could this be done w/o optical assistance, and with just a really really good algorithm? Possibly. A computer could be very highly trained to look for the formants from a particular speaker, and look for formant chains that correspond with that speaker's language. Particularly with multiple microphones some distance from eachother, a system might be able to disambiguate each speaker on a per speaker basis. But reconstructing the track would be messy -- you'd need to resynthesize the rest of the lost signal from the few "cues" that were actually extracted.
This is feasible; Praat (the audio manipulation framework of the gods) has a full PSOLA-based analysis and resynthesis engine. It's really cool.
Ultimately, it's probable that systems like you describe have actually been built, that they struggle mightily with cocktail parties but have experienced a revolution with the increase in computing power, and that you'll never ever ever see one because they're classified beyond belief :)
posted by effugas at 3:10 AM on February 1, 2006