Screen scraping etiquette
June 30, 2008 2:03 PM   Subscribe

Looking to start a 20k+ request screen scraping project, what sort of guidelines (in addition to those I plan to implement) do I need to follow to avoid having the hounds sent out after me.

A corporate site has a collection of about ~20k freely available pages that I'd like to download for a personal database. This database wouldn't be for anything but personal use. Their robots.txt lists only their sitemap url and User-agent: *. Their terms of service are the standard rigamaroll. Only two points make me think they'd have a problem with me scraping them....

"Other than connecting to Service Provider's servers by http requests using a Web browser, you may not attempt to gain access to Service Provider's servers by any means - including, without limitation, by using administrator passwords or by masquerading as an administrator while using the Service or otherwise. "


"You may not in any way make commercial or other unauthorized use, by publication, re-transmission, distribution, performance, caching, or otherwise, of material obtained through the Service, except as permitted by the Copyright Act or other law or as expressly permitted in writing by this Agreement, Service Provider or the Service." emphasis added.

So, while they want you to access the service with a web browser, there is no specific prohibition against automated methods, crawlers, spiders, robots etc. My other concern is the "caching" which I technically would be doing. Since Google has indexed and cached their site, it seems that this isn't really a problem though.

I've never scraped that many pages before, but my plan was to be nice about it. Put in a 15-30 sec delay between requests (i don't care if it takes a long time), only run the script during off peak hours, and set the user-agent to announce myself and contact info.

They don't have a "webmaster" e-mail address, only feedback. Should I bother sending them an email to "ask permission"? Also, might I be better off not announcing myself with the user-agent?
posted by NormandyJack to Computers & Internet (11 answers total) 3 users marked this as a favorite
For what it's worth, that sounds like a pretty standard terms of service. How can you be expected to not cache pages? There's nothing really illicit about it as long as you don't deny service to others or republish the data. I'd do it via tor, space out the requests to be nice, and let it run for a couple of days .Just use the user-agent for a browser, as well.

This idea might conflict with others' scruples, but I feel my answer is the best balance of ethics and utility.
posted by cellphone at 2:11 PM on June 30, 2008

Don't email them, just go ahead and do it. You're being more than nice enough already by doing it during off-peak hours and spacing it out. But don't announce yourself with the user-agent, just in case they're insane.
posted by equalpants at 2:18 PM on June 30, 2008

I'd say hold off on providing your info (use a default user-agent) until (if) they complain. You're already being nice about it, spreading the requests and all, so I think you're in "ask for forgiveness, not permission" territory.
posted by inigo2 at 2:26 PM on June 30, 2008

I'll echo the "just go ahead and run it off-hours" comments. Either they're going to limit you very, very quickly, or they won't at all. At least, that was my experience when extensively screen-scraping to gather a corpus for a class project; they didn't take kindly to our assault and locked us down pretty fast. So try it, see if it works - it probably will - and be 'nicer' only if you have to be.
posted by Tomorrowful at 2:36 PM on June 30, 2008

Tomorrowful has it. Throttle your requests and do it during offpeak hours. If they are unhappy with your visits, there won't be any emails or conversation; they'll just block your IP. This is one of those instances where "don't ask, don't tell" is actually a good policy.

Residents of my house have done this repeatedly with 20K+ record scrapes from a variety of sources and this has consistently worked to achieve, err, the desired end.
posted by DarlingBri at 3:04 PM on June 30, 2008

I have scraped some very large sites in the past -- and usually do something under a hit a minute. The only time I have been threatened was when I used a non-standard user agent (that had my email address in it).

I would discourage doing that.
posted by SirStan at 4:31 PM on June 30, 2008

From the moral PoV, it's pretty clear that they don't want you to grab their content by automated means. You can twist your interpretation of their ToS however you like, but don't kid yourself - you will be connecting using other than a web browser, and you will be making use of the content in a way they will see as being contrary to their copyright.

Having said that, your plan sounds OK and more respectful of their server load and bandwidth than they would be if the situation was reversed. Just remember that their concern isn't with the load on their servers, it's with controlling their data. Because of this, I'd fake a real browser ID and not give them your name / contact details. If they catch on and want that info, they can hassle your ISP for it. Depending on your ISP & the relevant laws, it's a layer of protection between you and them.

I do a daily screen scrape of around 200~400 pages from a site which is actively and overtly hostile against such activities - javascript randomisation and encryption of all the data on the pages, limiting requests to a handful per day from any one IP address, both the site owner & data supply companies actively pursuing federal court legal action against people even suspected of running scrapers for personal use only, etc. A bit of lateral thinking led to a way that gets around even this level of corporate paranoia and protectionism to grab ~300 pages/hr - but I'm afraid that if I explain it they'll notice, close that loophole, and set their QCs on me...
posted by Pinback at 5:32 PM on June 30, 2008

As a guy who has run a bunch of websites, I suspect that a 30 second delay between requests will lead to absolutely nobody noticing- you'll be below the noise level.

However, be aware that if you are only running the script for 12 hours of every day (to be non-peak), and are letting 30 seconds elapse between each page request, you'll take the better part of two weeks in order to get everything.

Also, will you be pulling down any referenced graphics as well? If so, you may have more than 20,000 files to pull down, which will make this take longer.
posted by jenkinsEar at 6:09 PM on June 30, 2008

I'd use a non-generic useragent. (e.g. RoboBot 1.0). If they block that, then you know they have a problem with it. And can decide what to do from there (either mask the UA or announce your intentions). Odds are they won't notice.

(Incidentally, if you do decide to mask the UA, I suggest using GoogleBot)
posted by meta_eli at 6:47 PM on June 30, 2008

My suggestions, from doing crawling and scraping for a dot com:

- pick a user agent that says you're a bot.

You're about to make 20k requests, and if they look at their logs at all, they're going to notice 20k requests from the same IP. So you might as well signal that you're on the up and up by using a realistic user-agent.

- do no more than 1 request every 5 seconds. For a big site, a request rate this low is not going to trip their bandwidth meter or peg their server's cpu.
posted by zippy at 9:05 PM on June 30, 2008

A couple of extra tips, from someone who's both scraped lots of data and defended a website from scrapers:

Delay is good. Random delay is better. If the requests don't show up like clockwork every N seconds they're harder to identify.

Be very, very certain your scraper does not have some failure mode where it re-fetches the request immediately if the request fails. That quickly leads to misery.

You're more likely to succeed if you emulate an MSIE user-agent string. If you want to be polite and don't care if you're caught, by all means put your email address in the user-agent.

For particularly unfriendly sites I've needed to scrape, I've gotten pages via Tor. It's much slower and less reliable, but now your requests are coming from a bunch of different IPs.
posted by Nelson at 8:51 AM on July 1, 2008 [1 favorite]

« Older Neighbor problem   |   What happened with my cell phone Saturday night? Newer »
This thread is closed to new comments.