Join 3,416 readers in helping fund MetaFilter (Hide)


How can I convert a scanned pdf to searchable text?
May 6, 2004 12:23 AM   Subscribe

I need to convert a scanned pdf to searchable text, without printing it out and scanning it back in using OCR. Also, I'd like a cheap or free solution since I'm not likely to use it often ever again.
posted by nomis to Technology (17 answers total)
 
Google: "convert a scanned pdf to searchable text" --> Planet PDF --> "To convert a scanned PDF Image file to a searchable, editable PDF Normal file, use Acrobat Capture or the Paper Capture Online service offered by Adobe." --> Paper Capture Free Trial.
posted by Jairus at 12:28 AM on May 6, 2004


Erm... I'm using Reader 6.0 on XP and it has a search facility built in...
posted by twine42 at 1:32 AM on May 6, 2004


That won't help twine42, that ofcourse only works for pdfs which contain the text as text instead of as an image. (but don't feel bad, I just on-preview-deleted a post assuming a text pdf too...)
posted by fvw at 1:40 AM on May 6, 2004


Thanks Jairus, that's put me on the right track. It seems substandard googling was my problem.
posted by nomis at 1:48 AM on May 6, 2004


if it's text as an image then you're pretty much f**ked anyway - your could convince an ocr to take a direct feed, but I've never seen an ocr I've been really impressed with.

For the record, I hate people who do text as graphics - in any medium. There's rarely a need for it. Websites are the worst for this...

*Wanders off to stab some graphic designers*
posted by twine42 at 1:50 AM on May 6, 2004


twine42: Yeah, the OCR quality is an issue, but it'll make my life a whole lot easier to get even a modicum of searchability.
posted by nomis at 2:21 AM on May 6, 2004


Fairy Nuff. ;)
posted by twine42 at 9:30 AM on May 6, 2004


nomis, If it's nothing confidential you can send it to me and I'll run it through Adobe Acrobat. The trial version might try to stamp a watermark on it or something...
posted by vito90 at 9:39 AM on May 6, 2004


if it's text as an image then you're pretty much f**ked anyway

I'm not very techy, but that's what I thought too. Is there really any program (readily available) that will "recognize" letters on an image and convert them to text? I thought it was basically impossible, although I guess the US Post Office has technology that can do it with handwritten zip-codes on letters.
posted by Shane at 10:17 AM on May 6, 2004


I guess the US Post Office has technology that can do it with handwritten zip-codes on letters.

Isn't that technology called, you know, "people"? I'm pretty sure that's how the Royal Mail do it here at least, and I can't see any system reading most people's handwriting with much success.
posted by reklaw at 11:38 AM on May 6, 2004


I believe that ABBYY Fine Reader can read images and turn them into text, but there will be errors, unless the image is really, really crisp.
posted by chaz at 11:53 AM on May 6, 2004


Isn't that technology called, you know, "people"?

No, it is not. It is very specifically technology that, without human aid, deciphers a handwritten zip-code and sorts the mail accordingly.

I know, it absolutely blows my mind too.
posted by Shane at 1:04 PM on May 6, 2004


I guess the US Post Office has technology that can do it with handwritten zip-codes on letters.

They do. It works because the addresses are from a limited domain (addresses, not essays or equations or foreign languages, just as voice recognition would work better with "yes or no?" than "what's your name?") The problem is much, much harder for arbitrary text. I believe most OCR packages would work -- to the extent that they do work -- from a pdf, although you might have to convert it to another image format. (I.e. you can run OCR directly on the image without having to lose quality by printing and scanning again.)

I don't know offhand about different OCR packages, but you can do that research on your own.
OCR mistakes will be rampant, so unless you are willing to go through and correct it or use a fuzzy-type search, you're going to have problems. In my experience, OCR is only a good solution when it's the only solution. I'd rather type tens of pages than OCR them if I need a decent quality copy. It's probably faster.
posted by callmejay at 1:18 PM on May 6, 2004


vito90: thank you for your offer, however I have now got access to Acrobat 5, which will perform the web-based conversion I need.

callmejay: I'd rather type tens of pages
Unfortunately this is nearly 400 pages, so typing isn't really an attractive solution.

Thanks everybody for the advice.
posted by nomis at 5:52 PM on May 6, 2004


nomis, it looks like I'm late to the party but since this thread is searchable, here's my 2¢

ABBY Finereader PRO does an excellent job if you are willing to go through five or six pages to "train" the OCR. This is worth it for long documents. You can then correct the mistakes in ABBY and export to PDF. I have limited experience with other OCR programs, but what experience I do have leads me to believe Finereader is one of the best. You can export to a variety of formats, including .doc and .txt . additionally, you can save the recognized document with the image, making it searchable without risking errors. In other words, you see the image but search the text. Makes big files, but useful.
posted by Grod at 7:03 PM on May 6, 2004


Is there really any program (readily available) that will "recognize" letters on an image and convert them to text?

Uh, yeah, I read OCR scanned ebooks all the time.
posted by abcde at 11:29 PM on May 6, 2004


Thanks Grod, I had a look at the ABBYY website and it looks good, but there seem to be some server errors keeping me out of the 'buy' and 'download' sections of the site. Out of interest, do you happen to know roughly what it costs?

... you see the image but search the text...
See, that sounds ideal; the more I look into this, the more uses I think of for it!
posted by nomis at 11:54 PM on May 6, 2004


« Older I got an email today from a he...   |  So yesterday my wife — Ms. We'... Newer »
This thread is closed to new comments.