Video Transcription
August 9, 2024 7:58 AM   Subscribe

I have a bunch of hour long Zoom video interviews to transcribe. How?

Video timecodes are needed - not not too granular - just a timecode every 2 minutes or so would be fine.

Efficient page layout for readability - Some software breaks up the dialogue into tiny sentence fragments meaning you can't skim it quickly. I want the transcript to look more like paragraphs so it's very readable.

My footage is about 40 one-hour Zoom interviews, recorded on Zoom with full permission. Some are already recorded, some still to come.

The videos are 2 people speaking (interview)... so if the software can break it down by person, that would be awesome. But not absolutely necessary.

The transcript is needed for video editing purposes, and the transcript won't be published, so spelling or formatting errors aren't a big deal. As long as it's readable by the editors so they can make rough choices of what parts of the video to keep.

I can pay... but free would be awesome!

What should I use? Thanks!
posted by nouvelle-personne to Computers & Internet (11 answers total) 4 users marked this as a favorite
 
Not positive either will be exactly what you need bc I haven’t used them with recorded zoom videos. but check out Descript and Fireflies.ai they both work great for audio and in descript there’s definitely an option to export transcript as linear plaintext
posted by seemoorglass at 8:24 AM on August 9


I found TranscribeMe to be excellent with Zoom files.
posted by jacobean at 8:31 AM on August 9


I've used Otter.ai for this exact purpose, and it works quite well. I believe there's a "free trial" level.

You'll have to download the Zoom audio file, but that's very easy to do.
posted by Dr. Wu at 8:53 AM on August 9 [2 favorites]


My partner John has been using Word to transcribe Zoom meeting recordings with good results
posted by olopua at 10:05 AM on August 9


For the recordings "to come" you can turn on Zoom's transcription service
posted by garbanzilla at 12:12 PM on August 9 [1 favorite]


Trint also does this quite well and I believe they have a free trial.
posted by Mender at 1:05 PM on August 9


I've used Whisper for transcribing Zoom calls (and other things, like podcasts), and I'm quite impressed with it.

Here's a few seconds of output from a video:
00:00.000 --> 00:13.000
Welcome back to the ECA live session to create another sample model for the ECA library.

00:13.000 --> 00:28.000
Today we want to build a fairly simple notifications example where we want to send emails to all users of a specific role.

00:28.000 --> 00:37.000
Now I remember that we had done something fairly similar already but not quite so.

00:37.000 --> 00:56.000
So what this demo also includes how to use an existing model, clone that, make it a new one and then modify it towards your needs that are slightly different to the original example.
(It automatically outputs .txt, .srt, and .vtt for me.)

MeFite brainwane wrote a great blog post about using Whisper.

Whisper is open source and free, and you can install it on your local computer.
posted by kristi at 3:37 PM on August 9


I've used alllll the tools but Turboscribe has my heart - affordable, just works, and can cope with my accent (unlike Otter).
posted by socky_puppy at 6:16 PM on August 9


It's too bad that you couldn't have used Zoom's built-in transcription option. If you already have or are willing to pay for access to Office 365, the online version of Word does a decent job of making an initial transcript. It still needs to be read and edited but that is probably the case with nearly every service or tool if your transcripts have any jargon or context-specific language. I particularly like that Word has a few different formatting options - my recent interview projects haven't needed timecodes and Word will happily omit them if you select the appropriate option.

If you're considering free tools (or any tool, really), you may want to pay attention to their terms of service if you're working with confidential data or promised your participants confidentiality. My employer provides access to Office 365 and that includes a confidentiality agreement with Microsoft so that was another reason why we used that specific tool to create our initial transcripts.
posted by ElKevbo at 12:34 PM on August 10


I had this problem yesterday and there are several apps. You can try, a few minutes are always free:

https://app.notta.ai/

https://app.maestra.ai/

https://otter.ai/signup

This was on Hackernews yesterday: https://otranscribe.com/
I don't really get it. I think it helps you transscribe by hand.

Free is tricky. In the end I used this script on the Linux shell to have it done for free. Result was not great but good enough for my purpose:

import os
import sys
import wave
import json
from vosk import Model, KaldiRecognizer

def transcribe_audio(audio_path, model_path):
wf = wave.open(audio_path, "rb")

if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() not in [8000, 16000, 32000, 44100, 48000]:
print("Audio file must be WAV format mono PCM.")
return

model = Model(model_path)
recognizer = KaldiRecognizer(model, wf.getframerate())
recognizer.SetWords(True)

results = []

while True:
data = wf.readframes(4000)
if len(data) == 0:
break
if recognizer.AcceptWaveform(data):
results.append(json.loads(recognizer.Result()))
else:
results.append(json.loads(recognizer.PartialResult()))

results.append(json.loads(recognizer.FinalResult()))

# Combine the results into a single string
transcript = " ".join([res.get('text', '') for res in results])
return transcript

if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python3 transcribe.py ")
exit(1)

audio_path = sys.argv[1]
model_path = sys.argv[2]

transcript = transcribe_audio(audio_path, model_path)
print("Transcription: ", transcript)

posted by maloqueiro at 2:42 AM on August 12


At my job we use rev.ai for this exact purpose, it ends up being maybe a buck for the type of video you describe. It's an API that requires some amount of programming chops to use (or at least, a willingness to learn curl) and so it may not be ideal for you. But it does speaker identification and will even let you pay extra for things like translation or sentiment analysis. We use the results in WebVTT format but I think they support others like plaintext and their own JSON-based one.
posted by axiom at 10:31 AM on August 23


« Older Crowdsourced clever sayings to etch onto bricks   |   Shared royalties model for digital content. Or... Newer »

You are not logged in, either login or create an account to post comments