Screen scraping etiquette
June 30, 2008 2:03 PM
Subscribe
Looking to start a 20k+ request screen scraping project, what sort of guidelines (in addition to those I plan to implement) do I need to follow to avoid having the hounds sent out after me.
A corporate site has a collection of about ~20k freely available pages that I'd like to download for a personal database. This database wouldn't be for anything but personal use. Their robots.txt lists only their sitemap url and User-agent: *. Their terms of service are the standard rigamaroll. Only two points make me think they'd have a problem with me scraping them....
"Other than connecting to Service Provider's servers by http requests using a Web browser, you may not attempt to gain access to Service Provider's servers by any means - including, without limitation, by using administrator passwords or by masquerading as an administrator while using the Service or otherwise. "
and
"You may not in any way make commercial or other unauthorized use, by publication, re-transmission, distribution, performance, caching, or otherwise, of material obtained through the Service, except as permitted by the Copyright Act or other law or as expressly permitted in writing by this Agreement, Service Provider or the Service." emphasis added.
So, while they want you to access the service with a web browser, there is no specific prohibition against automated methods, crawlers, spiders, robots etc. My other concern is the "caching" which I technically would be doing. Since Google has indexed and cached their site, it seems that this isn't really a problem though.
I've never scraped that many pages before, but my plan was to be nice about it. Put in a 15-30 sec delay between requests (i don't care if it takes a long time), only run the script during off peak hours, and set the user-agent to announce myself and contact info.
They don't have a "webmaster" e-mail address, only feedback. Should I bother sending them an email to "ask permission"? Also, might I be better off not announcing myself with the user-agent?
posted by NormandyJack to computers & internet (11 comments total)
4 users marked this as a favorite
This idea might conflict with others' scruples, but I feel my answer is the best balance of ethics and utility.
posted by cellphone at 2:11 PM on June 30, 2008