Build a web bot?
December 21, 2005 4:27 AM Subscribe
Can I build/run a web bot?
My company uses an outside vendor to do some web crawling for us, and they return the results to us, which we use as part of our workflow. They look for keywords, and if the page shows up with the keywords, then it appears in our results daily.
More and more, they aren't doing a very good job. So I'd like to get away from them, and possibly do our own in-house web bot.
Any suggestions on where to begin, or prepackaged options that might help?
My company uses an outside vendor to do some web crawling for us, and they return the results to us, which we use as part of our workflow. They look for keywords, and if the page shows up with the keywords, then it appears in our results daily.
More and more, they aren't doing a very good job. So I'd like to get away from them, and possibly do our own in-house web bot.
Any suggestions on where to begin, or prepackaged options that might help?
Mechanize can do it for ya! And you can do yourself a favor learning python. It's the most pleasureable programming language to deal with, hands down.
posted by Mach5 at 5:50 AM on December 21, 2005
posted by Mach5 at 5:50 AM on December 21, 2005
If you need RSS, there are so many free crawlers out there, it's silly.
If you're just crawling flat HTML, and you need to parse it as text and pull out your essential data, then just install PHP.
PHP's fopen() function can be passed a URL just as easily as it normally receives a file name. Once you've got the text from the URL, you can prune, parse, mangle, mash until your heart's content.
posted by thanotopsis at 6:41 AM on December 21, 2005
If you're just crawling flat HTML, and you need to parse it as text and pull out your essential data, then just install PHP.
PHP's fopen() function can be passed a URL just as easily as it normally receives a file name. Once you've got the text from the URL, you can prune, parse, mangle, mash until your heart's content.
posted by thanotopsis at 6:41 AM on December 21, 2005
The Database of Web Robots has been around for years and lists the details on several dozen known bots, including source/binary availability and programming language. A link to the list sorted by type is http://www.robotstxt.org/wc/active/html/type.html.
The rest of the site has good background information on bot-builds(ing). Do make sure your bot, pre-packaged or new, follows the robots.txt exclusion protocol or you will quickly encounter woe from many webmasters. It is a big deal to a lot of people. Not following the protocol is considered site abuse which quickly leads to IP range lockouts and flaming public announcements.
posted by mdevore at 8:32 AM on December 21, 2005
The rest of the site has good background information on bot-builds(ing). Do make sure your bot, pre-packaged or new, follows the robots.txt exclusion protocol or you will quickly encounter woe from many webmasters. It is a big deal to a lot of people. Not following the protocol is considered site abuse which quickly leads to IP range lockouts and flaming public announcements.
posted by mdevore at 8:32 AM on December 21, 2005
websearch.alexa.com might be useful; they let you search their entire crawl database for dollars.
posted by AaronRaphael at 8:57 AM on December 21, 2005
posted by AaronRaphael at 8:57 AM on December 21, 2005
It's not enough just to follow the robots.txt exclusion protocol; follow the full robot guidelines.
posted by mendel at 9:14 AM on December 21, 2005
posted by mendel at 9:14 AM on December 21, 2005
Unless you are talking about a very specific and limited set of sites/domains that you want to spider, you almost certainly don't want to do it yourself -- use the Google/Yahoo/Alexa APIs already mentioned. It takes a LOT of time, CPU, bandwidth, and storage to spider any significant part of "the internet." You would have to be insane to try to do this yourself in the current day and age, rather than utilizing the engines already out there.
posted by Rhomboid at 10:41 AM on December 21, 2005
posted by Rhomboid at 10:41 AM on December 21, 2005
Although the web design is circa early nineties, fravia's section on Bot writing, bot trapping and bot wars on his searchlores site is quite interesting. Perhaps not exactly what you're looking for, but chock-full of information and pointers.
posted by splice at 12:04 PM on December 21, 2005
posted by splice at 12:04 PM on December 21, 2005
O'Reilly has a book on this very topic: Spidering Hacks.
posted by mmascolino at 1:37 PM on December 21, 2005
posted by mmascolino at 1:37 PM on December 21, 2005
This thread is closed to new comments.
At any rate, if you want your own web bot, it's not so hard if you know a scripting language already.
Take a look at CURL and how it integrates into a scripting language like perl or php.
posted by ph00dz at 5:04 AM on December 21, 2005