How to index a password-protected site for searching?
January 17, 2007 1:37 PM   Subscribe

How do I add search to a password protected site?

I'm making a website that is a mishmash of documents, pdfs and html documents, compiling papers that people have submitted. However, I need to password protect the site while keeping it searchable. Is there a way to do this with something like Google without having the results made available to the general public? Any suggestions for a service program that can index the documents within the passworded protected part?
posted by perpetualstroll to Computers & Internet (8 answers total) 2 users marked this as a favorite
 
To block google crawlers, as well as the other bots check out this link. As far as password protecting a site, that depends on the type of webserver it's hosted on i.e. Apache, IIS.

good luck
posted by ronmexico at 1:46 PM on January 17, 2007


You could add your own search engine. I use this one. I have it set up to index the site every morning at 4AM.
posted by Steven C. Den Beste at 1:57 PM on January 17, 2007


Reading your question closer, depending on your amount of access to the webserver you could implement an OSS search engine like Zilverline. It's based on Lucene. I've had great success implementing this internally to do full text searches on our webservers. This is primarily for internal use only. I also test IBM's offering but wasn't as happy with it.
posted by ronmexico at 1:59 PM on January 17, 2007


Response by poster: Do Zilverline / Perlfect search also index .docs and .pdfs?
posted by perpetualstroll at 2:11 PM on January 17, 2007


Yes, it can do PDF, Word, txt, java, CHM and HTML is supported, as well as zip and rar files. There could be some more, but I am having trouble finding the list.
posted by ronmexico at 2:25 PM on January 17, 2007


That page for Perlfect Search says:
Can index PDF files (requires pdftotext, which is part of xpdf) and MS-Word files (requires antiword).
I don't know if it's important to you: Perlfect Search does not keep track of contiguity relationships. So you can search for multiple words, but you cannot search for multi-word strings. In other words, if you search for "Golden Retriever", it won't restrict itself to cases where those two words happen together. It would turn up every page where both of those words appeared, whether next to each other or not.

On the other hand, it's database driven, so search results are blazingly fast. (I have it set up to index my site and rebuild the database every morning at 4AM.) My server has a 300 MHZ CPU, but search results come back blitz fast.

It's been solid as a rock; I've never had any trouble with it. (I'm using version 3.20, which is five years old. Never had any reason to upgrade it.)
posted by Steven C. Den Beste at 2:57 PM on January 17, 2007


Something to keep in mind...one way to *bypass* some password-protected websites is to view them through Google's cached pages...would that be something you're concerned about?
posted by edjusted at 9:18 PM on January 17, 2007


Response by poster: Yes, the site has to be protected from Google indexing it, but I figure that could be addressed through the robots.txt file.
posted by perpetualstroll at 5:53 AM on January 18, 2007


« Older What Indian film am I looking for?   |   The question of a second child Newer »
This thread is closed to new comments.