Best practices for transcribing a typed manuscript?
November 11, 2015 7:11 AM   Subscribe

I have about 500 digital photos, each of one page from a hand-typed, 1940s-era text. How do I best transcribe it as part of an effort to share it digitally, and then include it in my own project?

I am writing about my grandpa's WWII history. I borrowed a unique book via ILL that I could only keep for a few weeks, so I photographed every page. Now I'm typing these up into text files on my computer, one file per photo; my eventual intent is to make an EPub file out of it so other people can read it, and then to include bits of it in my paper.

I am asking because I know I must be doing it wrong. Like, I am probably being inconsistent in my formatting (e.g., how do I handle words that break in the original text, which any software could handle better? how do I put in footnotes?), and I can't believe that I won't need to reformat it all later.

Is there standard practice for doing this kind of transcription? What search terms should I use, or resources should I read?

Thanks in advance!
Response by poster: FWIW, right now I am using BBEdit to make text files; the images are poor quality and inconsistent JPEGs, so OCR is right out.

Here is the current version of my paper, which is slated for lots of additions Real Soon Now:

(My thanks to all who wore a uniform on this Veteran's Day.)
Is this a book that you can make available? If it was published postwar in the US, it will still be in copyright.

One of the right ways of doing this is to generate the book in an archival format, such as TEI Lite. An easier way to key this might be to use AsciiDoc: it represents complex documents as plain text (as a typewriter would do), and has ways of representing footnotes. It can be converted to pretty much any other format using Pandoc.

Are the line breaks at hyphen-
ated words? Some archivists may insist in keeping hyphenation, others less picky. If the texts is clearly running paragraphs and these breaks are natural line breaks, I would tend not to keep them.

OCR might be better than you think. If the pictures are the equivalent of about 200 dpi or better ( so ~ 1700 x 2200 px) and have a bit of contrast, you might be in luck. There will be a lot of proof correction anyway.
Response by poster: It's a government publication -- an Army unit history -- so it should automatically be in the pubic domain…right?

Are the line breaks at hyphen-
ated words?

Yep. Which is why I am asking myself whether I should simply type everything out "properly" (with reasonable footnotes), or try to replicate that Olde Timey appearance.

Really, I guess that's the big question: do I try to capture the text as "cleanly" as possible, or do I try to preserve the old layout, warts, broken words, and all?
I'd try OCRing your page images. Surely correcting text that was incorrectly OCRed will be easier than retyping an entire book by hand. Personally, I'd clean up the weird line breaks and make the text as clean as you can.
There's a couple of tools that I would try before you resign yourself to typing in 500 pages of text by hand. Start by giving OCR a stab. It keeps getting better all the time and current software can sometimes yield usable results with JPG photographs. Obviously you will have to go through by hand and fix a lot of stuff, but if you can even get decent results for a portion of your pages it's going to save you a TON of time. I use ABBY Finereader Pro in my workflow all the time, including things like phone camera pictures of birth certificates, and it can handle a certain amount of blurriness and skewed angles and so on. I think it's edged out Omnipage over the past decade as the best OCR tool on the market. You can import multiple one-page files and have it process them all together. I'm not sure I'd do all 500 at once, but batches of 25-50 pages shouldn't choke it).

If the image quality is just so poor that you can't get OCR to work, I'd personally used speech recognition software to enter the text, rather than typing, unless you can type for long periods of time at 100+ WPM and have wrists of steel. I use Window's baked-in speech-to-text with a $25 headset microphone. It sounds like you're in a Mac environment, so I would give its Dictation feature a try, but Dragon Dictate may be worth the investment.
Response by poster: Trying not to thread-sit buuuuut... I don't mind the manual typing because it gives me a chance to read the thing slowly! :7)
I've done this before with a slim out of print book: 80 pages or so. 500 is a lot. Break up your transcribing into short sessions. Your wrists and attention span will thank you. And keep the formatting: 1) it's nice to have for archive-minded people and 2) it's good for keeping your place in the text. It's easy enough to make a second edition with normal formatting.
P.S. That was more than a dozen years ago. If I had to do it now I would OCR it. You will spend lots of time enjoying the text as you fix scannos.
I am asking because I know I must be doing it wrong.

I've taken a quick look at your link, and you're not doing it wrong. There may be some choices you want to change; there may be some things you could do better (or just differently); but you're definitely not doing it wrong.

Like, I am probably being inconsistent in my formatting (e.g., how do I handle words that break in the original text, which any software could handle better? how do I put in footnotes?), and I can't believe that I won't need to reformat it all later.

You MIGHT need to reformat it later, but that's okay, isn't it? The main thing is to get it into a format that you can then work with. If you're willing to put in the time to type up 500 pages of text, you can surely spare a few hours in a year or a decade if you decide you'd like to change the formatting. Just make a copy and work on the copy. Then, none of your effort on the original is wasted.

scruss's link to TEI Lite offers some great things to think about section 2, A Short Example. It says, "This particular encoding represents a set of choices or priorities" - including exactly the things you're asking about, like hyphenation.

The answer is, these are choices that depend on your ultimate goal. If you're most interested in producing a readable text, I would suggest removing hyphenation. If you're most interested in carefully preserving every visual aspect of the page, you'd probably want to keep the hyphens. But again, if you choose to reformat after typing it up, you could pretty easily remove the hyphens then with some search and replace (although I would do it word by word rather than using Replace All).


* I think your current approach is great (although I'm not a trained archivist, just someone who likes to read old documents)
* I think that TEI document can offer a lot of guidance on your specific questions
* changing the formatting later on is a much smaller project, once you've got the text all typed in
