Build a web bot?
December 21, 2005 4:27 AM   Subscribe

Can I build/run a web bot?

My company uses an outside vendor to do some web crawling for us, and they return the results to us, which we use as part of our workflow. They look for keywords, and if the page shows up with the keywords, then it appears in our results daily.

More and more, they aren't doing a very good job. So I'd like to get away from them, and possibly do our own in-house web bot.

Any suggestions on where to begin, or prepackaged options that might help?
posted by benjh to Computers & Internet (10 answers total)
 
Both google and yahoo have web api's now... so setting that up is pretty trivial, if you want to harness the power of the search engine. 'Course, you could also probably use google's web alerts... or, I swear yahoo lets you subscribe to a search as an rss feed now. I think.

At any rate, if you want your own web bot, it's not so hard if you know a scripting language already.

Take a look at CURL and how it integrates into a scripting language like perl or php.
posted by ph00dz at 5:04 AM on December 21, 2005


Mechanize can do it for ya! And you can do yourself a favor learning python. It's the most pleasureable programming language to deal with, hands down.
posted by Mach5 at 5:50 AM on December 21, 2005


Something like Google to RSS might do the trick
posted by Sharcho at 6:31 AM on December 21, 2005


If you need RSS, there are so many free crawlers out there, it's silly.

If you're just crawling flat HTML, and you need to parse it as text and pull out your essential data, then just install PHP.

PHP's fopen() function can be passed a URL just as easily as it normally receives a file name. Once you've got the text from the URL, you can prune, parse, mangle, mash until your heart's content.
posted by thanotopsis at 6:41 AM on December 21, 2005


The Database of Web Robots has been around for years and lists the details on several dozen known bots, including source/binary availability and programming language. A link to the list sorted by type is http://www.robotstxt.org/wc/active/html/type.html.

The rest of the site has good background information on bot-builds(ing). Do make sure your bot, pre-packaged or new, follows the robots.txt exclusion protocol or you will quickly encounter woe from many webmasters. It is a big deal to a lot of people. Not following the protocol is considered site abuse which quickly leads to IP range lockouts and flaming public announcements.
posted by mdevore at 8:32 AM on December 21, 2005


websearch.alexa.com might be useful; they let you search their entire crawl database for dollars.
posted by AaronRaphael at 8:57 AM on December 21, 2005


It's not enough just to follow the robots.txt exclusion protocol; follow the full robot guidelines.
posted by mendel at 9:14 AM on December 21, 2005


Unless you are talking about a very specific and limited set of sites/domains that you want to spider, you almost certainly don't want to do it yourself -- use the Google/Yahoo/Alexa APIs already mentioned. It takes a LOT of time, CPU, bandwidth, and storage to spider any significant part of "the internet." You would have to be insane to try to do this yourself in the current day and age, rather than utilizing the engines already out there.
posted by Rhomboid at 10:41 AM on December 21, 2005


Although the web design is circa early nineties, fravia's section on Bot writing, bot trapping and bot wars on his searchlores site is quite interesting. Perhaps not exactly what you're looking for, but chock-full of information and pointers.
posted by splice at 12:04 PM on December 21, 2005


O'Reilly has a book on this very topic: Spidering Hacks.
posted by mmascolino at 1:37 PM on December 21, 2005


« Older Painless Version Control?   |   What to do see in DC area Newer »
This thread is closed to new comments.