How can I use php to index a site using keywords?
April 6, 2010 7:05 AM   Subscribe

Looking for information on developing/finding a php script to generate a Site Index (A - Z index).

My boss wants an A-Z index of our site of approximately 200 pages. He wants as many common words/descriptions of services in the index as possible. Obviously this would be nightmare to manually generate and maintain.

I was thinking of developing an automated solution along the lines of using key-words/tags on all pages to be indexed and then running a script to read all the pages in the site and deposit the key-words/tags and respective urls into a database. Then build the A-Z index using a script to draw the page using data from the db. I'm most familiar with php and would like to use it to spider the site.

I'm OK on all fronts except the spidering script. Can anyone steer me in the right direction with this idea?
posted by lyam to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
Would this script be running on the same server as the site? Are the pages static? If so you can forget spidering and just walk the directory tree and index every file, see for example the opendir()/readdir() snippets. The upside is you don't have to bother parsing the contents for links, just keywords. The downside is that you have to reconstruct the URL from the filename so this only works if you have a mostly static site. You also have to be careful not to index files that might exist in the directory but aren't accessible through the web, such as those disallowed by server configuration (e.g. .htaccess.)
posted by Rhomboid at 7:42 AM on April 6, 2010


I don't know much text indexing, but in general you'll want to use a recursive function to do the actual spidering; something like
<?php

function spider($directory_path) {
  $contents = scandir($directory_path);
  foreach ($contents as $file) {
    $new_path = $directory_path . '/' . $file
    if (is_dir($new_path)) {
      spider($new_path);
    }
    else {
      //Index $new_path here
    }
  }
}
?>
You'd probably want to run a recursive script like that from the command line to avoid request timeout issues, and if you're indexing a *huge* amount of content you might need to chunk it out to avoid memory issues.

You might also look into ht://dig (It's been years since I used it, but at the time it was pretty good) or Apache SOLR (Haven't used it, but have heard very good things about it.) I'm not sure what kind of functionality either of those might provide to generate a simple alphabetical keyword index, but it would probably be worth digging through the docs a little to see if you can save yourself some work.

On preview: Good points by Rhomboid about having to construct URLs and be careful about omitting restricted files. You could implement a similar recursive function that starts with a URL instead of a local path. Some slapdash pseudocode:
function spider($url) {
  if ($url isn't already in the index) {
    $html = file_get_contents($url);
    $links = some_function_to_extract_links($html);
    ...Index $html here...
    foreach ($links as $link) {
      spider ($link);
    }
  }
}

posted by usonian at 7:56 AM on April 6, 2010


(Oh, and you'd probably also want to add some logic to follow only links to your own domain; otherwise you'd wind up trying to index the entire internet.)
posted by usonian at 7:57 AM on April 6, 2010 [1 favorite]


Yes to all question. Thanks for the lead. I think I'll try to use those functions along with possibly parse_str(?) to seek out specific keyword variables, populate an array and write the array into the db when finished. Or possibly write them into the db at each successful index. Does this sound reasonable?
posted by lyam at 8:04 AM on April 6, 2010


on "should have" previewed, excellent thoughts usonian. Thanks to you both!
posted by lyam at 8:05 AM on April 6, 2010


OK researching your suggestions I found this: PHP Directory Iterator!

So far, I've been able to successfully step through the entire directory "touching" all files I want parsed. Next step, parsing for a given variable!

Thanks again!
posted by lyam at 12:23 PM on April 7, 2010


Just wanted to give a final update. I ended up using the Directory Iterator linked above in a recursive funtion where recursion is triggered by a sub-directory. The function seeks out specific files containing the content to be indexed. Once found, I dump the page into a string, use a regex to find a particular tag containing my keywords, parse keywords into an array along with the page they are located in, then ultimately dump the whole thing into a db. It takes around 5 to 10 seconds to refresh the index. Now I have a db that can be used as a source for data on the page that clients will view to access the site index. It's updated in a flash and doesn't not require tedious tracking down of content/page changes.

Thanks all for the valuable leads!
posted by lyam at 8:29 AM on April 15, 2010


« Older How did you find a good medica...   |  How does one guess sports bett... Newer »
This thread is closed to new comments.