Search multiple PDF files
July 11, 2010 10:26 AM   Subscribe

I have a multi-thousand PDF library of books. They are organized by directory (C:\Lib\Science, C:\Lib\History, etc..) I would like to be able to search across part or all of the library. For example, find all occurrences of "Charles Dickens" in PDF's in the C:\Lib\Literature directory. What (freeware) Windows software can do this and other features?
posted by stbalbach to Computers & Internet (17 answers total) 1 user marked this as a favorite
 
Best answer: Google Desktop Search?
Windows Live Search?
posted by claudius at 10:40 AM on July 11, 2010


Sorry, should be Windows Search. I haven't tested it with PDFs, but it seems like something that it should do...
posted by claudius at 10:41 AM on July 11, 2010


Yep, current versions of Windows will deep-search the content of PDFs just fine.
posted by Rendus at 10:44 AM on July 11, 2010


Best answer: Adobe Acrobat will search in directories. I have the full version; not sure if the the free version does it too.
posted by prenominal at 11:36 AM on July 11, 2010


I'm using a piece of software called Mendeley Desktop that sorts pdfs with an iTunes like interface and has deep search. I really, really like it. It's maybe more for journal articles, rather than books, but I find it quite helpful. On the downside, it's in beta.
posted by Made of Star Stuff at 12:15 PM on July 11, 2010


Response by poster: Adobe Acrobat will search in directories.

Hey your right. There is a "Search" and a "Find", I'd never looked at the "Find" option (under "Files" in Acrobat Reader 9). It has the option to search all files within a directory. Thanks!

Windows Search

I have a new install of Win7 and never really looked at Windows Search before. I just changed it to index all the book library directories, and to include the file contents (by default it's only metadata), and moved the index database off the system drive, since it's a SSD. It took a long time to build the index, over 15k items. It seems to find some things, but long strings in quotes it doesn't find, even though I know they are there.

Mendeley Desktop

Thanks I'll look into it. I'm wonder if it can handle 15k+ book-length PDFs. But it's the sort of application I'm looking for.

Google Desktop Search

Working on this.. seems to take a long time to index and configuration is limited .. will post if it works.
posted by stbalbach at 2:33 PM on July 11, 2010


Response by poster: Re: Google Desktop Search..

After loading and indexing, on a sample search, it found 5 hits. The same sample search using Adobe Acrobat (no index direct search) it found over 50 hits. So Google Search appears to be useless. Same with Windows Search. I wonder why they are so broken? Anyway the problem with searching with Adobe directly is it takes hours to complete a search. What I need is a full-text index of the PDF's that actually works.
posted by stbalbach at 5:18 PM on July 11, 2010


Foxit Reader can do this too. It's smaller than Acrobat Reader and starts faster; last time I looked, though, printing was much slower.
posted by flabdablet at 7:04 PM on July 11, 2010


If you're searching these things often enough to want indexing, you might also care to try Zotero.
posted by flabdablet at 7:06 PM on July 11, 2010


Response by poster: Zotero uses pdftotext to extract the text from the PDF, Google Desktop uses the exact same utility. I'm beginning to think that pdftotext doesn't work for PDF's that are images (scans), and that is why Google Desktop is not picking up as many hits as it should since many/most of my PDFs are image scans.

Re: Foxit .. I have Win7 64-bit with a multi-core i7 and 6 gigs .. so Adobe Acrobat starts up instantly - the searching though is a long slog, an hour to search through a hundred PDF's. BUT Adobe does appear to search image-based PDF's, perhaps it is doing OCR real-time? Normally things happen so fast on this machine it's instant, it's doing some serious work searching those PDFs. I'd like to get the OCR done one-time and save the text in an index that can be quickly searched multiple times in the future.

Maybe I need to find a utility to OCR all my PDF's, then index with Google Desktop or Windows Search.
posted by stbalbach at 7:52 PM on July 11, 2010


Best answer: many/most of my PDFs are image scans

Then you will definitely need them OCR'd before they become searchable.

If you scanned them yourself, you might well find that the software that came with your scanner includes OCR; or if you have the full version of Acrobat, I believe that it can add an OCR'd text layer to a scanned PDF if you ask it nicely. This might help.
posted by flabdablet at 8:18 PM on July 11, 2010 [1 favorite]


Response by poster: Yes indeed, appears to be the problem. I've searched around and it looks like there are commercial solutions - Adobe or PaperPoint (OmniPage). Thanks for the utility, since my library keeps growing, I'll have to figure out a way to automate finding and OCR'ing incoming image PDFs, and that will help. Much more difficult problem than I expected.
posted by stbalbach at 8:40 PM on July 11, 2010


flabdablet's link has some useful information. However, I'm reading through the various steps in that fairly lengthy process, and wondering if it wouldn't be easier to just batch process OCR all files, and ignore errors, rather than to detect them, then separate into folders etc.?
posted by prenominal at 9:46 PM on July 11, 2010


Never mind, at the end of that page, I see that the author says "Given the caveats above, you might just want to OCR everything" and has another link with instructions for Batch OCR.
posted by prenominal at 9:49 PM on July 11, 2010


Response by poster: New problem: Google Desktop Search only index's the first 70,000 characters of any document. So it's not a full-text index. Microsoft Search seems to fare no better (actually worse).

I might need to find a business-class Document Management System.
posted by stbalbach at 9:28 AM on July 12, 2010


Best answer: Acrobat will also do the indexing, after you OCR all the docs in your directories. The instructions are in the "searching and indexing" section of the help file.
posted by prenominal at 5:33 PM on July 12, 2010


Response by poster: Yes I found that out. Acrobat Pro (not Standard) has a "catalog" feature that will index a directory of PDFs. Or individual PDFs can have an index embedded in each. Luckily I have an old copy of Acrobat 6.0 so it's not too costly to upgrade to 9.0 Pro.
posted by stbalbach at 6:47 PM on July 12, 2010


« Older I'm asking you so I don't have to ask the PUA...   |   I want a monocle! Newer »
This thread is closed to new comments.