Transcription. Lots and lots of transcription.
August 1, 2016 4:50 AM   Subscribe

I have scanned copies of 60 documents that I need transcribed. It's 1,318 pages in total, single-space typed, with each page a separate JPG. I usually self-transcribe stuff like this, but this is a bit more than I want to take on. What are my options for getting this done without breaking the bank?

I've tried some OCR software, but I'm either not getting great results, or the work involved in cleaning up the file afterwards is substantial.

Someone suggested Amazon MT. If you've used it for a task like this, what's the quality like, and what's the normal pay per page?

Other options?
posted by NotMyselfRightNow to Computers & Internet (14 answers total) 10 users marked this as a favorite
 
fiverr?
posted by alon at 5:54 AM on August 1, 2016


I've had really great results using Google Cloud Vision for OCRing documents. You can see their pricing at the bottom of the page. Right now the first 1000 units are free.
posted by dilaudid at 5:55 AM on August 1, 2016 [2 favorites]


dilaudid, is there an easy means of feeding Google Cloud Vision a few hundred images, without writing up your own code?

I, too, need to OCR some stuff, and it did a pretty good job on the sample image that I gave it -- but I am just a humble fisherman and can't write anything of my own.
posted by wenestvedt at 6:10 AM on August 1, 2016


The full version of Adobe Acrobat will do this. They have a free trial that might get you through all your pages.
posted by Confess, Fletch at 6:30 AM on August 1, 2016 [2 favorites]


Acrobat's OCR is good. I've also found tesseract to be a very high quality OCR package. You'll have to be comfortable with command line stuff.
posted by dis_integration at 8:00 AM on August 1, 2016


I don't know what OCR software you tried out, but in this software category you pretty much get what you pay for. Omnipage or ABBYY finereader will give you better results than cheaper products or online services. That said, I work with OCR'd documents all the time and there's always a time element involved in setting up the process and cleaning up afterwards. How fast you can get through that depends on the quality of your scans, amount of formatting, etc. There's always going to be misrecognized characters, word spaces it thinks are tabs, line breaks that you don't want, etc. If your scans are crisp and minimally formatted, with the right software, good computing power, etc. I'd think it should be possible to process everything at a pace of about 100 pages an hour but that's still a fair time commitment.

I can't answer your question directly, but I think you can use my extremely ballpark guestimate above to figure out how to price the project if you want to farm it out. There are basically two possibilities: somebody with experience doing this sort of work, and good software/hardware, will get it done faster but expect a higher hourly. For the same flat rate you might get someone who has lesser skills/equipment who will take significantly longer but who doesn't care because they're willing to work for what works out to be a pretty low hourly rate. Would you be able to budget $0.20-.25/page for the project? Because I think you're talking a bare minimum of 1-2 days work for the right person.
posted by drlith at 8:04 AM on August 1, 2016 [1 favorite]


1-2 days? That's a bit more than 80 pages a day. Let's say there are 250 words per page, times 1,318 pages, for 329,500 words. At 40 words per minute, that's 8,237.5 minutes, 137.2916666666667 hours, or a bit more than 17 days.

How word-dense are the pages? And how is the formatting? OCR can do some funky things with formatting, if you find a good solution there.

Also, what do you want in the end: replicated formatting from these documents, or just having searchable text? Or somewhere between?

This site cites "Special Report on the Business Support Services Industry, Brenner Information Group" and finds that rates range from $15 to $75 per hour. The average is about $30 an hour, and legal and medical transcription rates have at slightly higher rates.
posted by filthy light thief at 8:23 AM on August 1, 2016


I assume he is talking about outsourcing the OCR, and so I was talking about how much labor is involved in processing OCR typically, not having someone retype it. That's why I used the word "process," and talk about OCR software. It takes really, REALLY bad copy to make retyping faster than OCR for a large project.
posted by drlith at 8:50 AM on August 1, 2016 [1 favorite]


What about source text with a lot of acronyms, dates, and other weird text that will not match a standard dictionary? Does that skew the equation of whether re-typing beats OCR?
posted by wenestvedt at 9:50 AM on August 1, 2016


What OCR software have you tried? I find that Nuance's OCR technology (Omnipage or PowerPDF) works much better than Adobe Acrobat's or ABBYY's Finereader (I currently use the latest versions of Power PDF and Acrobat Pro; my experience with ABBYY is from several years ago).
posted by odin53 at 10:01 AM on August 1, 2016


Abbyy Finereader is my OCR of choice for image files but it tends to work better with clear scans of DPI 300 or greater.
posted by JJ86 at 10:43 AM on August 1, 2016 [1 favorite]


Tesseract works pretty well on JPGs of 200 DPI and over; possibly even better than 300 DPI mono scans that were the recommended standard years back.

If these aren't too confidential I'd be happy to run 'em as a batch. Would take me 5 minutes at most.
posted by scruss at 12:38 PM on August 1, 2016


http://transcribblers.com/
posted by Grimp0teuthis at 7:27 PM on August 1, 2016


I convert scans to text all the time. Spreadsheets, catalogs, lists, marketing docs, legal docs etc. The key is always the source file - dpi, general clarity, font used, age of the original document and more. Without seeing the original, it is not possible to answer the question accurately as the quality of the source is what is in question.

Can you post some a sample page or pages for evaluation? From there, it can be determined what may be the best option for you. Some OCR software packages handle various file types and layouts better than others. Also, sometimes using 2 or more softwares in concert will achieve a better result than attempting to knuckle it through with a single program.
posted by lampshade at 7:37 PM on August 1, 2016


« Older What should we do about valuable(?) items at our...   |   Can you learn to like poetry? Newer »
This thread is closed to new comments.