<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Screen scraping etiquette</title>
	<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette/</link>
	<description>Comments on Ask MetaFilter post Screen scraping etiquette</description>
	<pubDate>Mon, 30 Jun 2008 14:11:48 -0800</pubDate>
	<lastBuildDate>Mon, 30 Jun 2008 14:11:48 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Screen scraping etiquette</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette</link>	
		<description>Looking to start a 20k+ request screen scraping project, what sort of guidelines (in addition to those I plan to implement) do I need to follow to avoid having the hounds sent out after me. &lt;br /&gt;&lt;br /&gt; A corporate site has a collection of about ~20k freely available pages that I&apos;d like to download for a personal database. This database wouldn&apos;t be for anything but personal use. Their robots.txt lists only their sitemap url and User-agent: *. Their terms of service are the standard rigamaroll. Only two points make me think they&apos;d have a problem with me scraping them.... &lt;br&gt;
&lt;br&gt;
&quot;Other than connecting to Service Provider&apos;s servers by http requests using a Web browser, you may not attempt to gain access to Service Provider&apos;s servers by any means - including, without limitation, by using administrator passwords or by masquerading as an administrator while using the Service or otherwise. &quot;&lt;br&gt;
&lt;br&gt;
and &lt;br&gt;
&lt;br&gt;
&quot;You may not in any way make commercial or other unauthorized use, by publication, re-transmission, distribution, performance, &lt;em&gt;caching&lt;/em&gt;, or otherwise, of material obtained through the Service, except as permitted by the Copyright Act or other law or as expressly permitted in writing by this Agreement, Service Provider or the Service.&quot; emphasis added.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
So, while they want you to access the service with a web browser, there is no specific prohibition against automated methods, crawlers, spiders, robots etc.  My other concern is the &quot;caching&quot; which I technically would be doing. Since Google has indexed and cached their site, it seems that this isn&apos;t really a problem though. &lt;br&gt;
&lt;br&gt;
I&apos;ve never scraped that many pages before, but my plan was to be nice about it. Put in a 15-30 sec delay between requests (i don&apos;t care if it takes a long time), only run the script during off peak hours, and set the user-agent to announce myself and contact info. &lt;br&gt;
&lt;br&gt;
They don&apos;t have a &quot;webmaster&quot; e-mail address, only feedback. Should I bother sending them an email to &quot;ask permission&quot;? Also, might I be better off not announcing myself with the user-agent?</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2008:site.95415</guid>
		<pubDate>Mon, 30 Jun 2008 14:03:58 -0800</pubDate>
		<dc:creator>NormandyJack</dc:creator>
		
			<category>programming</category>
		
			<category>screenscraping</category>
		
			<category>internet</category>
		
	</item> <item>
		<title>By: cellphone</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1392863</link>	
		<description>For what it&apos;s worth, that sounds like a pretty standard terms of service.  How can you be expected to not cache pages?  There&apos;s nothing really illicit about it as long as you don&apos;t deny service to others or republish the data.  I&apos;d do it via tor, space out the requests to be nice, and let it run for a couple of days  .Just use the user-agent for a browser, as well.&lt;br&gt;
&lt;br&gt;
This idea might conflict with others&apos; scruples, but I feel my answer is the best balance of ethics and utility.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1392863</guid>
		<pubDate>Mon, 30 Jun 2008 14:11:48 -0800</pubDate>
		<dc:creator>cellphone</dc:creator>
	</item><item>
		<title>By: equalpants</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1392872</link>	
		<description>Don&apos;t email them, just go ahead and do it.  You&apos;re being more than nice enough already by doing it during off-peak hours and spacing it out.  But don&apos;t announce yourself with the user-agent, just in case they&apos;re insane.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1392872</guid>
		<pubDate>Mon, 30 Jun 2008 14:18:00 -0800</pubDate>
		<dc:creator>equalpants</dc:creator>
	</item><item>
		<title>By: inigo2</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1392884</link>	
		<description>I&apos;d say hold off on providing your info (use a default user-agent) until (if) they complain. You&apos;re already being nice about it, spreading the requests and all, so I think you&apos;re in &quot;ask for forgiveness, not permission&quot; territory.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1392884</guid>
		<pubDate>Mon, 30 Jun 2008 14:26:22 -0800</pubDate>
		<dc:creator>inigo2</dc:creator>
	</item><item>
		<title>By: Tomorrowful</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1392897</link>	
		<description>I&apos;ll echo the &quot;just go ahead and run it off-hours&quot; comments. Either they&apos;re going to limit you very, very quickly, or they won&apos;t at all. At least, that was my experience when extensively screen-scraping del.icio.us to gather a corpus for a class project; they didn&apos;t take kindly to our assault and locked us down pretty fast. So try it, see if it works - it probably will - and be &apos;nicer&apos; only if you have to be.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1392897</guid>
		<pubDate>Mon, 30 Jun 2008 14:36:17 -0800</pubDate>
		<dc:creator>Tomorrowful</dc:creator>
	</item><item>
		<title>By: DarlingBri</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1392935</link>	
		<description>&lt;strong&gt;Tomorrowful&lt;/strong&gt; has it. Throttle your requests and do it during offpeak hours. If they are unhappy with your visits, there won&apos;t be any emails or conversation; they&apos;ll just  block your IP. This is one of those instances where &quot;don&apos;t ask, don&apos;t tell&quot; is actually a &lt;em&gt;good&lt;/em&gt; policy. &lt;br&gt;
&lt;br&gt;
Residents of my house have done this repeatedly with 20K+ record scrapes from a variety of sources and this has consistently worked to achieve, err, the desired end.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1392935</guid>
		<pubDate>Mon, 30 Jun 2008 15:04:08 -0800</pubDate>
		<dc:creator>DarlingBri</dc:creator>
	</item><item>
		<title>By: SirStan</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393022</link>	
		<description>I have scraped some very large sites in the past -- and usually do something under a hit a minute.  The only time I have been threatened was when I used a non-standard user agent (that had my email address in it).  &lt;br&gt;
&lt;br&gt;
I would discourage doing that.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393022</guid>
		<pubDate>Mon, 30 Jun 2008 16:31:40 -0800</pubDate>
		<dc:creator>SirStan</dc:creator>
	</item><item>
		<title>By: Pinback</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393067</link>	
		<description>From the moral PoV, it&apos;s pretty clear that they don&apos;t want you to grab their content by automated means. You can twist your interpretation of their ToS however you like, but don&apos;t kid yourself - you &lt;em&gt;will&lt;/em&gt; be connecting using other than a web browser, and you &lt;em&gt;will&lt;/em&gt; be making use of the content in a way &lt;u&gt;&lt;i&gt;they will see&lt;/i&gt;&lt;/u&gt; as being contrary to their copyright.&lt;br&gt;
&lt;br&gt;
Having said that, your plan sounds OK and more respectful of their server load and bandwidth than they would be if the situation was reversed. Just remember that their concern &lt;strong&gt;&lt;em&gt;isn&apos;t&lt;/em&gt;&lt;/strong&gt; with the load on their servers, it&apos;s with controlling their data. Because of this, I&apos;d fake a real browser ID and not give them your name / contact details. If they catch on and want that info, they can hassle your ISP for it. Depending on your ISP &amp;amp; the relevant laws, it&apos;s a layer of protection between you and them.&lt;br&gt;
&lt;br&gt;
I do a daily screen scrape of around 200~400 pages from a site which is actively and overtly hostile against such activities - javascript randomisation and encryption of all the data on the pages, limiting requests to a handful per day from any one IP address, both the site owner &amp;amp; data supply companies actively pursuing federal court legal action against people even &lt;strong&gt;&lt;em&gt;suspected&lt;/em&gt;&lt;/strong&gt; of running scrapers for personal use only, etc. A bit of lateral thinking led to a way that gets around even this level of corporate paranoia and protectionism to grab ~300 pages/hr - but I&apos;m afraid that if I explain it they&apos;ll notice, close that loophole, and set their QCs on me...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393067</guid>
		<pubDate>Mon, 30 Jun 2008 17:32:57 -0800</pubDate>
		<dc:creator>Pinback</dc:creator>
	</item><item>
		<title>By: jenkinsEar</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393095</link>	
		<description>As a guy who has run a bunch of websites, I suspect that a 30 second delay between requests will lead to absolutely nobody noticing- you&apos;ll be below the noise level.&lt;br&gt;
&lt;br&gt;
However, be aware that if you are only running the script for 12 hours of every day (to be non-peak), and are letting 30 seconds elapse between each page request, you&apos;ll take the better part of two weeks in order to get everything.&lt;br&gt;
&lt;br&gt;
Also, will you be pulling down any referenced graphics as well? If so, you may have more than 20,000 files to pull down, which will make this take longer.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393095</guid>
		<pubDate>Mon, 30 Jun 2008 18:09:09 -0800</pubDate>
		<dc:creator>jenkinsEar</dc:creator>
	</item><item>
		<title>By: meta_eli</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393125</link>	
		<description>I&apos;d use a non-generic useragent. (e.g. RoboBot 1.0). If they block that, then you know they have a problem with it. And can decide what to do from there (either mask the UA or announce your intentions). Odds are they won&apos;t notice.&lt;br&gt;
&lt;br&gt;
(Incidentally, if you do decide to mask the UA, I suggest using GoogleBot)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393125</guid>
		<pubDate>Mon, 30 Jun 2008 18:47:47 -0800</pubDate>
		<dc:creator>meta_eli</dc:creator>
	</item><item>
		<title>By: zippy</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393219</link>	
		<description>My suggestions, from doing crawling and scraping for a dot com:&lt;br&gt;
&lt;br&gt;
- pick a user agent that says you&apos;re a bot. &lt;br&gt;
&lt;br&gt;
You&apos;re about to make 20k requests, and if they look at their logs at all, they&apos;re going to notice 20k requests from the same IP. So you might as well signal that you&apos;re on the up and up by using a realistic user-agent.&lt;br&gt;
&lt;br&gt;
- do no more than 1 request every 5 seconds. For a big site, a request rate this low is not going to trip their bandwidth meter or peg their server&apos;s cpu.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393219</guid>
		<pubDate>Mon, 30 Jun 2008 21:05:07 -0800</pubDate>
		<dc:creator>zippy</dc:creator>
	</item><item>
		<title>By: Nelson</title>
		<link>http://ask.metafilter.com/95415/Screen-scraping-etiquette#1393507</link>	
		<description>A couple of extra tips, from someone who&apos;s both scraped lots of data and defended a website from scrapers:&lt;br&gt;
&lt;br&gt;
Delay is good. Random delay is better. If the requests don&apos;t show up like clockwork every N seconds they&apos;re harder to identify.&lt;br&gt;
&lt;br&gt;
Be very, very certain your scraper does not have some failure mode where it re-fetches the request immediately if the request fails. That quickly leads to misery.&lt;br&gt;
&lt;br&gt;
You&apos;re more likely to succeed if you emulate an MSIE user-agent string. If you want to be polite and don&apos;t care if you&apos;re caught, by all means put your email address in the user-agent.&lt;br&gt;
&lt;br&gt;
For particularly unfriendly sites I&apos;ve needed to scrape, I&apos;ve gotten pages via Tor. It&apos;s much slower and less reliable, but now your requests are coming from a bunch of different IPs.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.95415-1393507</guid>
		<pubDate>Tue, 01 Jul 2008 08:51:56 -0800</pubDate>
		<dc:creator>Nelson</dc:creator>
	</item>
	</channel>
</rss>
