Getting Access to Books which aren't available Digitally
July 4, 2018 12:45 PM Subscribe
I've been thinking lately about books which aren't available in an electronic/digital format, whether they're older titles, things like RPG/game books for which distribution was print-only, and similar situations. I'd love suggestions for ways to get some of these in an accessible electronic format.
The traditional approach for this kind of thing is to scan a book myself and then hope there aren't too many OCR errors. Similarly, I know about services like this one which will do the scanning/OCR for me. Finally, of course, I can try to contact publishers and/or authors, but that can be difficult to impossible.
I'd love some suggestions for ways to make this whole process a little easier. I'm open to legal ways to get things I haven't considered. If I could pay someone a small fee to fix OCR errors for me, I'd be open to that.
The traditional approach for this kind of thing is to scan a book myself and then hope there aren't too many OCR errors. Similarly, I know about services like this one which will do the scanning/OCR for me. Finally, of course, I can try to contact publishers and/or authors, but that can be difficult to impossible.
I'd love some suggestions for ways to make this whole process a little easier. I'm open to legal ways to get things I haven't considered. If I could pay someone a small fee to fix OCR errors for me, I'd be open to that.
Best answer: The scan services are pretty good. (I haven't used them, but I've been in contact with people who have.) It's a simple process that gets easier when done in bulk - good scanning machinery ranges from pricey to very expensive, but once you have it, it's easy to set up a system to do a lot of scanning. If you're ordering, make sure they scan at at least 300dpi; ideally, you want most of the scans to be black and white, and only get color scans where needed, because that keeps the filesize manageable.
OCR software these days is pretty good - Acrobat's OCR is fine for most purposes, although it'd have problems with exotic fonts or anytime there's text over a picture. For practical use, most gaming books will convert just fine with auto-OCR; a few words here or there would have a typo (OCR-o? We don't have a good word for those.), but you'd be able to search the PDF for most words that you'd need to find.
Paying to fix OCR errors gets expensive, because that's nitpicky proofreading work. Even if you find a student willing to do it for minimum wage, it's a lot of hours for very small results. That said, it's work I love doing, and I have practice with scanning print-only RPG books; I'd be happy to give specific advice for how to proceed, and might be available for some work (depends on what kinds of books and my schedule and so on).
Best OCR software on the market is ABBYY Finereader, if you want to convert them yourself, or you're asking someone else what they use. Acrobat and Google are about on par with each other for OCR errors; both are terrific with normal fonts in normal book layouts but start having problems with script fonts (like many headings in RPG books), complex tables (like character sheets), and very tiny text.
OCR so you can search a PDF for keywords is easy, and you don't need proofreading for most documents. OCR so you can convert to Word or something else and re-create the book takes a lot more time and effort, and requires both error-proofing and advanced formatting skills - the auto-layout resulting from OCR is always barely usable.
Suggestion - One to three books: Look into scanning for yourself with standard office tech. More than that: Hire the scanning out, or invest in a serious scanning production system. I'm happy to talk about the details.
posted by ErisLordFreedom at 1:22 PM on July 4, 2018 [3 favorites]
OCR software these days is pretty good - Acrobat's OCR is fine for most purposes, although it'd have problems with exotic fonts or anytime there's text over a picture. For practical use, most gaming books will convert just fine with auto-OCR; a few words here or there would have a typo (OCR-o? We don't have a good word for those.), but you'd be able to search the PDF for most words that you'd need to find.
Paying to fix OCR errors gets expensive, because that's nitpicky proofreading work. Even if you find a student willing to do it for minimum wage, it's a lot of hours for very small results. That said, it's work I love doing, and I have practice with scanning print-only RPG books; I'd be happy to give specific advice for how to proceed, and might be available for some work (depends on what kinds of books and my schedule and so on).
Best OCR software on the market is ABBYY Finereader, if you want to convert them yourself, or you're asking someone else what they use. Acrobat and Google are about on par with each other for OCR errors; both are terrific with normal fonts in normal book layouts but start having problems with script fonts (like many headings in RPG books), complex tables (like character sheets), and very tiny text.
OCR so you can search a PDF for keywords is easy, and you don't need proofreading for most documents. OCR so you can convert to Word or something else and re-create the book takes a lot more time and effort, and requires both error-proofing and advanced formatting skills - the auto-layout resulting from OCR is always barely usable.
Suggestion - One to three books: Look into scanning for yourself with standard office tech. More than that: Hire the scanning out, or invest in a serious scanning production system. I'm happy to talk about the details.
posted by ErisLordFreedom at 1:22 PM on July 4, 2018 [3 favorites]
Response by poster: Thanks! :)
Rhaomi, I wish I could use a comics reader, or read comics at all, for that matter. As someone totally blind, I can't read the non-OCR scans myself. If I could this would probably be much easier all around.
posted by Alensin at 1:33 PM on July 4, 2018
Rhaomi, I wish I could use a comics reader, or read comics at all, for that matter. As someone totally blind, I can't read the non-OCR scans myself. If I could this would probably be much easier all around.
posted by Alensin at 1:33 PM on July 4, 2018
My bad, I missed your tags!
If it helps any, a lot of the books in Open Library are also available as DAISY audiobooks, which include a text component. I can't speak to the quality of the OCR there since access is limited to verified patrons with government-issued decryption keys, but the breadth of material covered is pretty amazing, including many e-books that are only available in DAISY format.
posted by Rhaomi at 3:49 PM on July 4, 2018 [1 favorite]
If it helps any, a lot of the books in Open Library are also available as DAISY audiobooks, which include a text component. I can't speak to the quality of the OCR there since access is limited to verified patrons with government-issued decryption keys, but the breadth of material covered is pretty amazing, including many e-books that are only available in DAISY format.
posted by Rhaomi at 3:49 PM on July 4, 2018 [1 favorite]
Best answer: I can't speak to the quality of the OCR there since access is limited to verified patrons with government-issued decryption keys
I can. It's OK, not amazing. Getting access to the DAISY files through them can be a pain because you need a BARD Reader code from the National Library for the Blind. I used to work at Open Library, it's definitely worth trying to go this route and let me know if I can help you get ahold of someone.
There is a similar service in Canada that is better (you might want to contact them anyhow because they are resourceful, also these guys). If you are a student you can often get your school's accessibility office to request this sort of stuff from the publisher. As I'm sure you know, people are (finally) getting not only smarter about accessibility but it's also getting a lot quicker to actually scan the stuff so sometimes a DIY solution (if there's a local scanning center near you or if you could partner with a public library) might be the way to go.
Basically since you are what library people call "Print-disabled" it should be 100% legal to format shift basically anything into a format that is readable by you without negative legal implications. Apologies if some of this is stuff you already know.
posted by jessamyn at 7:00 AM on July 5, 2018 [4 favorites]
I can. It's OK, not amazing. Getting access to the DAISY files through them can be a pain because you need a BARD Reader code from the National Library for the Blind. I used to work at Open Library, it's definitely worth trying to go this route and let me know if I can help you get ahold of someone.
There is a similar service in Canada that is better (you might want to contact them anyhow because they are resourceful, also these guys). If you are a student you can often get your school's accessibility office to request this sort of stuff from the publisher. As I'm sure you know, people are (finally) getting not only smarter about accessibility but it's also getting a lot quicker to actually scan the stuff so sometimes a DIY solution (if there's a local scanning center near you or if you could partner with a public library) might be the way to go.
Basically since you are what library people call "Print-disabled" it should be 100% legal to format shift basically anything into a format that is readable by you without negative legal implications. Apologies if some of this is stuff you already know.
posted by jessamyn at 7:00 AM on July 5, 2018 [4 favorites]
A company called Dolphin makes software called EasyConverter - it can create both MP3 and DAISY from OCR material (it has OCR built in, but also accepts PDF, Word and other files). It can also output the OCR as text and Braille. The single-user license is around $700 US, but there's a free trial version.
Quite a few years ago I worked as an editor on an OCR conversion project - as ErisLordFreedom mentions, it's pretty fiddly manual work to fully clean up OCR conversion errors. However (and this dates me a bit), the work I did was in WordPerfect 5.1 and the company I worked for had built a whole series of cleanup macros that we would run to fix the bulk of the garbage OCR output in the text files, but a second, manual line-by-line editing pass was required to fine-tune the files, which we compared to the print copies. Good times.
In terms of OCR to PDF, if you want to use a vendor, just make sure they understand accessible PDF markup (too many don't), but that's something you probably already knew.
posted by mandolin conspiracy at 8:31 AM on July 5, 2018 [1 favorite]
Quite a few years ago I worked as an editor on an OCR conversion project - as ErisLordFreedom mentions, it's pretty fiddly manual work to fully clean up OCR conversion errors. However (and this dates me a bit), the work I did was in WordPerfect 5.1 and the company I worked for had built a whole series of cleanup macros that we would run to fix the bulk of the garbage OCR output in the text files, but a second, manual line-by-line editing pass was required to fine-tune the files, which we compared to the print copies. Good times.
In terms of OCR to PDF, if you want to use a vendor, just make sure they understand accessible PDF markup (too many don't), but that's something you probably already knew.
posted by mandolin conspiracy at 8:31 AM on July 5, 2018 [1 favorite]
I also missed that you want this done because you can't look at the pages; in that case, the OCR quality matters a lot more.
The most streamlined, quick version, without the expense of FineReader and the time/money for extensive proofing, would be:
1) Get good quality scans. Most of the paid services are fine for this; just make sure they're scanning for OCR-ability. Tell them NOT to scan in color unless it's absolutely necessary - you don't mind dropping pale color shading behind text.
2) OCR with something like Acrobat or GoogleDocs - it's great for normal text; will botch strange-font headers, and there'll be layout problems with charts/tables/complex columns.
3) Export to Word. Have someone who's quick with Word formatting clean up the basic structure and convert it to all-one-column text.
Next depends on your software. If you're reading PDFs, have them structure the Word doc with a TOC and headers as needed, include alt descriptions of pictures instead of the pics themselves, and probably change any footnotes to end-of-chapter notes. Then convert the Word doc back into a PDF, which will look ugly but you won't care; it'll be readable. I have some practice with accessibility markup in PDFs and would be happy to help with that part.
If you can read things other than PDFs, format the Word doc for that - I don't know the processes involved, but they're probably similar. It probably wants extensive use of styles.
If the reader is just basic and isn't going to pay attention to styles or headings - grab the Word document text, and throw it into Notepad (or Notepad++ or whatever), copy that, and paste back into Word. That strips out all the formatting; OCR errors will be easier to spot, and anything that wasn't noticeable will either be very obvious or gone because it didn't convert. (That'll lose any text that was a "text box" instead of inline text.)
Feel free to memail me for more details, including resources that may have already scanned some of the books you want. There's a huge debate in the legal world about whether it's okay to create digital files of abandoned-but-still-copyrighted publications (Sony v Tenenbaum was the key case); the issue falls more solidly into "it's legal" when the purpose is accessibility.
posted by ErisLordFreedom at 10:09 AM on July 5, 2018 [2 favorites]
The most streamlined, quick version, without the expense of FineReader and the time/money for extensive proofing, would be:
1) Get good quality scans. Most of the paid services are fine for this; just make sure they're scanning for OCR-ability. Tell them NOT to scan in color unless it's absolutely necessary - you don't mind dropping pale color shading behind text.
2) OCR with something like Acrobat or GoogleDocs - it's great for normal text; will botch strange-font headers, and there'll be layout problems with charts/tables/complex columns.
3) Export to Word. Have someone who's quick with Word formatting clean up the basic structure and convert it to all-one-column text.
Next depends on your software. If you're reading PDFs, have them structure the Word doc with a TOC and headers as needed, include alt descriptions of pictures instead of the pics themselves, and probably change any footnotes to end-of-chapter notes. Then convert the Word doc back into a PDF, which will look ugly but you won't care; it'll be readable. I have some practice with accessibility markup in PDFs and would be happy to help with that part.
If you can read things other than PDFs, format the Word doc for that - I don't know the processes involved, but they're probably similar. It probably wants extensive use of styles.
If the reader is just basic and isn't going to pay attention to styles or headings - grab the Word document text, and throw it into Notepad (or Notepad++ or whatever), copy that, and paste back into Word. That strips out all the formatting; OCR errors will be easier to spot, and anything that wasn't noticeable will either be very obvious or gone because it didn't convert. (That'll lose any text that was a "text box" instead of inline text.)
Feel free to memail me for more details, including resources that may have already scanned some of the books you want. There's a huge debate in the legal world about whether it's okay to create digital files of abandoned-but-still-copyrighted publications (Sony v Tenenbaum was the key case); the issue falls more solidly into "it's legal" when the purpose is accessibility.
posted by ErisLordFreedom at 10:09 AM on July 5, 2018 [2 favorites]
« Older Fast-turnaround poster printing in Stockholm? | Tweens and chores - talking points needed Newer »
This thread is closed to new comments.
posted by Rhaomi at 1:16 PM on July 4, 2018 [1 favorite]