When I receive a pdf document, how do I know if it's searchable?
December 4, 2015 4:20 PM   Subscribe

I obtain pdf documents many ways: website downloads, via email, subscriptions, etc. And I use Copernic Desktop search. Some pdf documents simply are not searchable, even though they contain plenty of text. When I find these, I upload them to Google Drive which makes them searchable using Google Drive search. Is there a way to know immediately from the pdf itself if it is searchable?

Of course, I could just put all my pdf docs into EverNote or onto Google Drive to allow searching for words, but I don't want to. When I do find a document that isn't searchable via a desktop search agent like Copernic or others, I put it on Google drive, but not knowing in advance if the document is searchable is frustrating.

Is there a document property I can immediately see that would help? Or is there another way around this problem?

posted by Rad_Boy to Computers & Internet (7 answers total) 6 users marked this as a favorite
an non-searchable file is one that contains only images (pdfs can contain text and images; sometimes the images are images of text).

as a heuristic, adobe checks if fonts are present (since these are needed for text, but not for images).

one way you can do that, on linux at least, is with pdffonts (other answers in that link include alternative approaches that i guess would work on windows, like using the adobe pad viewer).
posted by andrewcooke at 4:32 PM on December 4, 2015

What andrewcooke said - "non searchable" PDFs have images of text instead of actual text, and software like Google Drive and Evernote is performing OCR on the document to pull readable text out and index that for searching.

You could probably also use file size as an extremely rough indication - the bigger the file (for a given number of pages), the image-heavier it is and (assuming the documents don't have actual photos) therefore the more likely that it's images-as-text.

Fwiw scanners usually produce PDFs with images-as-text, as opposed to PDFs generated directly from a Word or LaTeX document.
posted by snap, crackle and pop at 4:40 PM on December 4, 2015

Fwiw scanners usually produce PDFs with images-as-text, as opposed to PDFs generated directly from a Word or LaTeX document.

True, but a PDF with images-as-text can still be searchable. The scanner or Acrobat can OCR the image and include the resulting text in the PDF. Whether this is the case for any particular PDF will depend on how it was generated.
posted by zachlipton at 5:02 PM on December 4, 2015

A quick and dirty way to check if there's searchable text is to try to highlight some text.
posted by crazy with stars at 5:56 PM on December 4, 2015 [5 favorites]

Acrobat can apparently do this.

Acrobat can also OCR multiple files at a time, which is quite handy.
posted by acidic at 7:08 PM on December 4, 2015

nthing crazy with stars. That is what I do with documents to tell if I need to OCR them or not. With the tool set I have enabled in Acrobat it's easiest to just try to highlight or select some text.
posted by MonsieurBon at 7:09 PM on December 4, 2015

There are ways, under both Windows and OS X, to set up a watch folder and do OCR automatically on files newly added to the folder.
posted by megatherium at 7:38 PM on December 4, 2015

« Older Saturday, in L.A....?   |   How to get rid of heating system's water pump... Newer »
This thread is closed to new comments.