Search Engines with Better Indices
July 25, 2004 12:38 PM   Subscribe

Does anyone know of any search engines that will also index non-plaintext documents (i.e. PDF's and Powerpoint)? Google has a search appliance, but we're looking to spend less than $2K.
posted by hobbes to Computers & Internet (15 answers total)
 
PDFs -- Google will do this for you if you throw them up in a web directory. Of course, if these are private documents, then obviously that's a different story.

I wonder if something like this would work, too:

find . -exec sh "cat {} | strings | grep -nf searchterm"
posted by weston at 12:52 PM on July 25, 2004


If you can convert them to plain text (using something like pdfinfo or pdftotext), you can feed them into swish-e, which will happily index them. Then it's a matter of writing a script (perl or otherwise -- plenty of examples with swish anyways) to query the swish-e db.

Same goes for Word documents (see: abiword); I imagine there are similar for PowerPoint documents as well.

As a separate suggestion, I have heard good things about ht://dig.
posted by John Shaft at 2:03 PM on July 25, 2004


A while back, I tried several of the open-source search engines, like ht://dig and SWISH-E... The clear winner, at the time, was ASPSeek (no relation to ASP, the scripting language). It's extremely fast and stable, and drop-dead easy to install and configure. (And it works with PDFs. No word on PPT, though.)

The last release is excellent and stable, but you should know that development has stopped on ASPSeek. If that concerns you, look into mnoGoSearch, which spawned the codebase for the ASPSeek project. Or DataParkSearch, which was another stepchild of ASPSeek.

I don't recommend ht://dig or swish-e.
posted by waxpancake at 2:28 PM on July 25, 2004


Could you expand on your objections to htdig and swish-e, waxpancake? I was about to suggest them and I'm curious why you don't like them.
posted by i_am_joe's_spleen at 2:46 PM on July 25, 2004


If you can convert them to plain text (using something like pdfinfo or pdftotext), you can feed them into swish-e,

or just use "strings" (I'm assuming you've got UNIX available to you, either via Mac OS X, Linux, or for Windows, Cygwin) to create these files and feed them to swish-e. This is much more time efficient than the "find" command suggestion I'd made above.
posted by weston at 4:58 PM on July 25, 2004


Innerprise ES.Net Search although it's in beta and not quite ready for prime-time.
posted by falconred at 4:59 PM on July 25, 2004


A couple others are Spy-CD and dtSearch. Neither are free, but both seem to fit your 'less than $2K' criteria.
posted by bachelor#3 at 5:56 PM on July 25, 2004


I'm under the impression that Microsoft's Index Server supports some or all of the proprietary binary formats you want to index, and I think it comes with an API to extend it further. The search functionality is terrible, but at least it can index the data.

I wouldn't recommend Index Server, but I also don't recommend keeping your data in proprietary, difficult to index formats, and people do that anyway.
posted by majick at 5:58 PM on July 25, 2004


I'm a fan of Glimpse/WebGlimpse.
posted by mrbill at 8:43 PM on July 25, 2004


Maybe phpDig?

Requires a bit of tweaking, but I've set it up and was pretty impressed:

PhpDig indexes HTML and text files by itself.
PhpDig could index PDF, MS-Word and MS-Excel files if you install external binaries on the spidering machines to this purpose.
PhpDig is configured using catdoc, xls2csv and pstotext programs.

posted by RavinDave at 9:42 PM on July 25, 2004


Alkaline works for me, allows filters (pdf is included) to be stuck on in a custom-manner. More suited to unix than windows servers though.
posted by BigCalm at 1:24 AM on July 26, 2004


Looks like Atomz does index PDF and PPT, however you have to sit through a sales call to find out what it costs.
posted by nakedcodemonkey at 1:42 AM on July 26, 2004


lucene will do this and it is free and open source. Of course, you may have to work at it some.
posted by pissfactory at 4:23 AM on July 26, 2004


From the Apache Lucene FAQ:
"FAQ 12. How do I index other document types such as PDF and Word?

...You need to provide a parser or extractor for every document type you want to index. Hopefully future releases of Lucene will have this functionality."
Doesn't sound like it has the necessary parsers. Or perhaps the FAQ is out of date or there are third party parsers out there?
posted by majick at 6:15 AM on July 26, 2004


There are 3rd party plugins for Lucene.

You would need reasonable Java-foo though.
posted by i_am_joe's_spleen at 12:47 PM on July 26, 2004


« Older Good commercial readio stations in the US?   |   Need to plan a route : Edinburgh , Scotland to... Newer »
This thread is closed to new comments.