How to record and transcribe all conversations
July 9, 2013 8:00 PM   Subscribe

Suppose I would like to automatically record and transcribe all conversations that happen around me. Is this technically feasible? What products and services would you use to accomplish this goal?

What I envision is something like this: in the morning when I get up, I attach some small device to my clothes or pack it in my bag. When a conversation takes place in my vicinity, a voice-activated microphone records it and saves it digitally. After the conversation is over, the recording is saved in a file annotated with date and time and perhaps other metadata (GPS?). When I hit a wifi network, the file is uploaded to the cloud and transcribed with rough accuracy, and the transcript is saved and emailed to me.

I don't need all the details to work as I've described; that's more like an ideal. What I want is advice on how to implement some rough approximation of what I'm describing without heroic expense.

Possibly relevant information:
  • Cost is certainly an object. I'd like to do this for not too much cash outlay. I imagine buying the recording equipment could run $100 or so, and the computing maybe $1/day. Solutions in this cost range would be great.
  • I have an iPhone, so iPhone apps or accessories are fair game.
  • Technically involved solutions are fine; I have time and can get help.
I do not need advice as to whether this is a cool idea, or advice as to its legality.

posted by grobstein to Technology (13 answers total) 8 users marked this as a favorite
I guess the difficulty and expense of implementing your idea depends largely upon how automatic you want the process to be.

There are quite a few free and low-cost iPhone apps which will record audio at various bitrates; you could clear out some memory on your phone and run such an app all day, then, at the end of the day, transfer the file to a computer and run a voice-detection algorithm on it to save only the relevant parts of the recording.

I am not aware of any currently-existing apps which would do automatic uploads of these files when your phone connects to a Wi-Fi network, but it seems like something along these lines could be coded relatively easily.

Per your request, I won't offer an opinion as to whether or not this is a wise course of action.
posted by Juffo-Wup at 8:17 PM on July 9, 2013

Heard buffers the past 5 minutes, but you have to tell it to commit that to memory.
posted by disillusioned at 8:20 PM on July 9, 2013 [1 favorite]

Oh, I'm sorry - I forgot to address the transcription part of your question.

Unless your transcription process involves hiring a human being to listen to the audio file and transcribe it, the quality/intelligibility of the transcribed text will be... minimal, at best.

Unlimited-vocabulary speech-to-text systems do not work well, no matter how much money you pay for them. This is a "hard" problem in computer science and is not likely to be "solved" soon.

The degree to which this is (or is not) a problem for you will depend upon your intended use case for the data.
posted by Juffo-Wup at 8:20 PM on July 9, 2013 [4 favorites]

Response by poster: Unlimited-vocabulary speech-to-text systems do not work well, no matter how much money you pay for them. This is a "hard" problem in computer science and is not likely to be "solved" soon.


I would like automatic transcription with no human in the loop. I find the voicemail transcripts produced by Google Voice to be good enough to be useful -- that sort of quality would be great. (I'm open in principle to low-cost human transcription options, but I don't think it would be economical for every conversation.)
posted by grobstein at 8:36 PM on July 9, 2013

Response by poster: Note also that perfect 24-hour coverage is not necessarily -- what I'm looking for is something close to total coverage, with something close to total automation.
posted by grobstein at 8:37 PM on July 9, 2013

Ohh, if Google Voice type transcripts are good enough, you might look at CMU Sphinx / VoxForge / etc.

Otherwise, I suspect Amazon Mechanical Turk is probably the cheapest human-involved option available.
posted by Juffo-Wup at 8:48 PM on July 9, 2013

How are you defining "all conversations" and "around me"? And how unobtrusive does your rig need to be? (These are rhetorical questions, things for you to think about, they don't necessarily need an answer here.)

Because one potential technical consideration is that many microphones are directional, and all microphones are subject to the inverse-square law as it applies to acoustics, where a doubling of distance between the sound source and the microphone decreases the sound pressure level at the microphone by a factor of 4. IOW, being even a couple of feet away from the conversation you're trying to record wouldn't give you enough level for a useable recording. Probably especially if you're trying to attempt this with, say, your iPhone in your coat pocket.

And of course, cloth can absorb sound, and human bodies can absorb or block sound waves, and other physical things in the environment can absorb or reflect or diffuse or block sound waves.

So if this is for like an art-project thing where you're just looking to capture the vibe of conversations you pass on the street, an iPhone app or the "concert taper" trick of clipping two small mics to the brim of a ball cap connected to a small digital recorder in your pocket would probably be fine.

But if you genuinely want (for whatever reasons) a clear record of all conversations that happen in a 10 foot radius & 360 degrees around you, that's kind of a whole other ball of wax.
posted by soundguy99 at 9:04 PM on July 9, 2013 [1 favorite]

First of all, an app + a smartphone in your pocket would not work. Their mics are not setup for collecting conversation at a distance or beyond the layers of cloths that would be present.

I don't know of a "package" which covers all of your requirements. There are components available which you can use.

1. Call recording: What you would need is specialized equipment which can record and send out the conversation using a sim card. Some one needs to call the sim card to listen to the call. Its available on amazon just search for "audio recorder surveillance" ...

2. Ability to call the above number and record the call: there are multiple web sites which can record incoming/outgoing calls on a number. Just Google for them, pick one.

3. Transcribing the calls: As others have said, you can use one of the software mentioned above for transcription. I hope you don't need instant transcription. If you have some programming chops, or can pay a programmer, scripts to download the call recordings from the above websites and run them through the transcription software can be created. Google has a transcription API which can be used for transcription after payment.
posted by TheLittlePrince at 10:05 PM on July 9, 2013

Smartphones draw too much power to make them practical for day-long recording. Maybe one of these USB thingies since they can just be left on continuously. Could be combined with location data captured with some other app on a smartphone.

The USB thingie could be plugged into a PC occasionally to process the audio data (Microsoft Speech API is the only free recognizer I'm aware of that works out-of-the-box, can't vouch for its quality).

Now for something to detect speech and split utterances, that's not a solved problem in the context of multiple speakers in a noisy environment. Maybe one could search GitHub for "voice activity detection".
posted by RobotVoodooPower at 10:26 PM on July 9, 2013

I have an Olympus voice recorder that has a mode to only record when noise levels hit a certain threshold. That plus a butterfly mic would work fairly well for recording just your own speech. Last I checked, it was one of the top five on amazon. Btw, this was kind of an arcane mode, located deep in the user manual.

The real challenge is to consider when on earth you'll find the time to listen to all of the things you talked about all day!
posted by oceanjesse at 11:28 PM on July 9, 2013

I did some pretty exhaustive research on this question for a local attorney who had an enormous collection of taped conversations from a local police officer and could find no speech recognition program that provided acceptable transcription quality. Apparently, many police officers routinely carry voice activated digital recording devices, so that aspect of the problem is easily solved. The best results we found were from Dragon Naturally Speaking, which would accept MP3 files as input and was the most effective at identifying the actual speech components of the recordings, but still had such massive error rates as to be unacceptable to my client. We looked at a number of other products on Windows and Linux and found them to be even worse than Dragon for this purpose. I did this research in 2010, so maybe there are newer programs that are more effective, but I suspect not.
posted by Lame_username at 3:40 AM on July 10, 2013 [1 favorite]

Mod note: As the OP mentions, the legality of this undertaking is not a topic under consideration here.
posted by goodnewsfortheinsane (staff) at 7:17 AM on July 10, 2013

Another thing to consider is that any automated transcription will not be able to distinguish individual speakers and break out the transcription like a play.

I recently worked in speech recognition for 12 years and anything computer based will give you terrible results for "out in the world" situations. Siri, Google Now, and Google voicemail transcriptions all rely on the mic being close to the speaker's mouth.

You can so some experiments yourself to see the kind of quality degredations you'll get. Go outside to an area you feel is representative of the kind if place you'll want to do the real recordings. Put your phone in speakerphone mode and make a call to a google voice number and leave a message. Repeat this a number of times at various distances. Use the same script each time but include an identifier at the start, like how far away you are. For an even more realistic test, do this in a place where there are multiple conversations happening at once.

To measure accuracy, so how many words google voice missed from the script.

Ambient noise, distance from the mic, and overlapping conversations will cause high failure rates.
posted by reddot at 7:17 AM on July 10, 2013 [2 favorites]

« Older If you read Hebrew, come on in!   |   B&B in/or near Rehoboth Beach, DE? Newer »
This thread is closed to new comments.