Solutions for searching a large collection of pdfs web available
April 25, 2013 10:50 AM   Subscribe

I have a lot of (50GB) pdfs and some epub documents. I would like to find a way to search through them quickly and accurately, in a way that is web accessible. I would like highlights of matches from inside the document to display.

I feel there has to be an obvious and easy answer but I havent found it yet. Preferably I would like to be able to search just on filename, and alternatively on the content.

Using a Mac I can have my pdfs searched, but I am not 100% thrilled with the way matches show and its not available over the net.

Lucene/Solr seems like it could do it, but I havent figured out how to do it. I tried OwnCloud, and its ok, but I really would like to be able to get at the list of matches before searching around an entire pdf file.

DtSearch can do this, but its $1000, and I would like to spend less. It doesn't have to be free, although that would be good.

I gave the search service on the amazon cloud a try, but it was less than a finished product and more something to develop around, and given my parameters it would be expensive with the way they charge.

Anyone know of a good, high quality, search engine, that I can host somewhere
that is good at searching pdfs, can display highlights, and possibly also search
enside mobi and epub???
posted by digividal to Computers & Internet (5 answers total) 8 users marked this as a favorite
Best answer: Docfetcher seems to be good at searching PDFs and displaying the highlighted text, although it doesn't have mobi/epub support yet (it's supposed to be added eventually).
posted by bluecore at 11:39 AM on April 25, 2013

Best answer: This looks like a good installer for Solr on windows. Depending on what you're trying to do you're at the border of a basic user tool and some software development. Lucene is a tool for development, flexible but can be a lot of work.
posted by sammyo at 11:43 AM on April 25, 2013

Best answer: lucene removed out-of-the-box PDF support (at least, I think that's what happened). You'd need to integrate pdfbox with lucene. Use PDF box to extract the text, and lucene to index it.

It's not a turnkey solution, you'd need to write code, interface, etc etc to create the indices and pull results out. A competent programmer should easily be able to do it easily enough (the time to process the PDFs will probably be longer than the time to write the code to do the indexing, the front end is your call. I've done it, it isn't hard)
posted by k5.user at 12:01 PM on April 25, 2013

Response by poster: @bluecore
Docfetcher looks pretty awesome and I am installing it now, but I do need something
that can be web accessible.

I will give that a chance

I will try to see if I can figure some of that out :)
posted by digividal at 1:06 PM on April 25, 2013

You've got a Mac? Try DevonThink with the built in web server. I find it's search/AI to be very good, though I only have about 16GB of content (PDF, web archives and various other things).
posted by Brian Puccio at 4:22 PM on April 25, 2013

« Older Exercises for missionary position   |   Help me find this online Spanish tutoring site Newer »
This thread is closed to new comments.