What should I use to convert speech to text?
June 22, 2006 3:02 PM   Subscribe

I need to convert speech on wav files into text and I don't want to do it via manually transcribing them. Do you have any suggestions or stories from experience?

Oh, and either Mac or PC solutions are fine, too. Thanks!
posted by stevis to Computers & Internet (19 answers total)
Yes, give up and do it manually, or hire someone. Voice-to-text barely works well enough when the software is trained to a particular voice, that voice speaks into a quality microphone from a close distance, and the vocabulary and syntax is standard written style. What you want remains, in my extensive experience, a pipe dream.
posted by fourcheesemac at 3:14 PM on June 22, 2006

fourcheesemac is right, the solution you seek does not yet exist. When manually transcribing, it will help to use some audio speed-control software such as Express Scribe (Windows/Mac). An alternative is to use a service bureau that specialises in such work.
posted by cbrody at 3:25 PM on June 22, 2006

Hire a transcriber. Really, there is no other way, and transcribers will still have trouble with getting all of the text down for interviews.
posted by blahblahblah at 3:25 PM on June 22, 2006

Get voice recognition software. Take a couple hours to train it to your voice. Then put on headphones and as you listen to your files, speak the words at the same time. Then the voice recognition software should get it maybe 95% right. That's only one word wrong for every 20!
posted by CrazyJoel at 3:45 PM on June 22, 2006

CrazyJoel - no wonder they call you CrazyJoel, you're crazy!
posted by thilmony at 3:53 PM on June 22, 2006

Well, it should save him from typing most of the words. It's a pretty good solution, I think.
posted by CrazyJoel at 4:00 PM on June 22, 2006

Go to the Mechanical Turk. Offer a modest fee (say, ten bucks) per 1/2 hour or so of speech. Publish transcribing guidelines to have the work done consistently. Wait a while. Give the output a quick pass through, then pay the folks.
posted by gage at 5:40 PM on June 22, 2006

Rather than going the Turk route, there are a number of specialised offshore transription services.

I've used one before but can't remember the name and I'm currently at home. I remember it was charged at 1 US cent per word. I sent them an mp3 file, they emailed the transcription back within an hour. Bloody useful. A google for "transcription" should sort you out.
posted by blag at 5:53 PM on June 22, 2006

Transcription services are not cheap. idictate.com charges 1 cent per word, which is around 25$ an hour. They mail you back the text after around 1 hour. At that price, you do better hiring a temp. You can find temps for less than 25$ per hour for sure. Try craigslist.

The only voice reconition software that has a chance of handling such a difficult task is Dragon NaturallySpeaking.

Fourcheesemac is right. Dragon needs recordings of clear spoken, neutral-accent voice recorded in a silent room with a good (50$-100$) microphone. Compare your recordings with the example recordings at emicrophones, to see if you stand a chance.
posted by gmarceau at 8:51 PM on June 22, 2006

There is no automated solution that does not require extensive training (for example, Dragon NaturallySpeaking).

I've found four instances of this question coming up before (by searching yahoo for text and speech, with a few added terms to try and clean up the mess):Please add to this list if you can. If I can remember, I'll check in a few days and add it to the wiki (anyone else is welcome to do the grunt work if they want, :P).
posted by Chuckles at 9:52 PM on June 22, 2006

Best answer: Almost any naturally occurring speech you record, unless it is solo dictation, will contain overlapping sections, incomplete sentences and words, many placeholders and pause markers, and lots of other good stuff. Believe me, if it were at all possible, the people in my profession would know about it (let's just say we are obsessive transcribers of naturally occurring discourse).

There is no way to do this unless you have an absurdly controlled environment and expensive technology and time to train the system and endless time to correct all the errors. In the end, there is no point to it. Software can make life easier -- slow speech down without changing the pitch, allow you loop sections repeatedly, etc. There are foot-operated computer-based transcribing systems too, which adds a lot of speed -- your hands don't leave the keyboard.

But it is a total pipe dream to imagine taking a human brain and ears and typing fingers out of the process. We are probably at least a decade from having anything even close that works.

People have experimented with this in controlled settings, with every conversant wearing a high quality lavalier mic, etc., low-reflectivity rooms, limited vocabularies, etc. It is still impossible for reasons that have to do with how much more complex natural social speech is than dictation, among other things. "Naturally Speaking" is a complete hyperbole. Just try "speaking naturally" into your Dragon system, not as if you were speaking for the purpose of generating a clear written document. Good luck correcting the typos.

Can. Not. Be. Done.
posted by fourcheesemac at 11:10 PM on June 22, 2006

posted by fourcheesemac at 11:11 PM on June 22, 2006

I don't think it's a total pipedream. Voice recognition software is put to work in tons of hospitals where doctors create medical reports. And without transcriptionists.
posted by CrazyJoel at 4:01 AM on June 23, 2006

CrazyJoel, I am intimately familiar with what is done in hospitals. I've seen every step of the process. Doctors speak into dedicated voice recorders with excelent microphones held a few inches from their mouths, slowly and using a limited vocabulary already trained into the system. Those recorders are sent downstairs to transcription where they are uploaded and VR'd in Dragon Naturally Speaking (in almost every setting I have ever seen), and then hand corrected by transcriptionists, because errors in these things can be the basis for huge lawsuits. Nowhere in medicine, that I am aware of, is the process completely automated. The rish would be too high.

That is a far cry from auto-transcribing naturally occurring conversational discourse from a recording. By "pipedream," I mean about a decade out, at least. Not impossible, but not yet possible.
posted by fourcheesemac at 10:50 AM on June 23, 2006

risk, sorry
posted by fourcheesemac at 10:51 AM on June 23, 2006

Automatic transcription is far beyond even human capabilities, unless the transcribing machine is given the ability to interrupt a conversation and say "I'm sorry, could you repeat that." Even then, human conversation only requires that you understand the meaning, understanding each individual word is not that important. Hell, the mapping from spoken language to written language isn't even 1:1 (remember the Budwieser whazzup commercial).

So sure, voice recognition will improve a lot, but if you consider the full scope of the problem.. Well, I don't like to say impossible, but close enough.
posted by Chuckles at 11:28 AM on June 23, 2006

Chuckles has it, again. Are you a linguist? Transcription of oral discourse involves a very complex conventionalization of a huge range of oral discourse features into writing, and generally ignores many other oral discourse features necessary to natural conversational communication (except for linguists who work on this subject, and who try to formalize the representation of oral discourse features in writing as a scientific goal). It is not a matter of recognizing words, alone. Dictation typically involves language removed from a conversational context and largely devoid of features that make ordinary language intelligible. A human mind has to process the conversion. The rule descriptions required to do a systematic transcription of even very standardized oral discourse events -- let's say a board meeting or a trial hearing -- are so complex that no one has yet fully described them. They may not be describable. They involve very gradient acoustic and gestural phenomena as well as the discrete gramatical concepts of phoneme, word, etc. These gradient and non-audible features *replace* words and context-free referential language in most natural oral communication. It is amazing how much reconstruction of meaning we do when we transcribe audio of natural speech "faithfully" by recognizing features of the discourse that are not discoverable by a discrete analysis, that do not lie in the lexical domain, etc. The processing power required to replace a normal human brain making sense of spoken conversation is enormous. The rules barely understood. I think "pipe dream" is no exaggeration.
posted by fourcheesemac at 6:00 PM on June 23, 2006

Electrical engineer, actually. Normally my spelling and punctuation make that obvious :)

I am kind of interested, in an amateur way, by these high level signal processing problems.. I keep hearing highly optimistic predictions of what will be possible with biometrics, for example. I don't really believe the claims, but I'm not sure, so..

Of course a transcription system doesn't have to be perfect. Maybe, if the computer is smart enough, you won't mind if it asks you to repeat a phrase. But, when I thought about that, I realised I often fail to understand things, but I don't always ask for it to be repeated. Understanding normally comes eventually anyway..
posted by Chuckles at 7:43 PM on June 23, 2006

Right, because we have context sensitivity to a million variables that allow us to infer meanings that would otherwise be lost or unclear or ambiguous, etc. when we engage in normal oral conversational discourse, interviews, or even formal speech events. This is basically a problem of aritficial intelligence at a fairly high level. It sounds like you, like a linguist, have the proper awe of language and are not enslaved by the hegemony of the written word when you think about it. Speaking (or signing) "normally" -- in social discourse -- is one of the most computationally complex things humans do, and all of us do it all the time without ever needing to be taught. We use massive processing power to understand speech as well. The variables to which we attend are so diverse and complex that describing them is a project for the ages, not the moment.
posted by fourcheesemac at 1:01 AM on June 24, 2006

« Older Web-Based Technology Roadmap Application Needed   |   Animation of an Audio Tape Newer »
This thread is closed to new comments.