Website scraping vs TOS
October 16, 2015 5:59 PM   Subscribe

I'm working on a personal project to teach myself a new programming skill, specifically scraping website data. The project will analyze some data that I can get from a publically available website, and I think others in my industry would be interested in the results. Before I get too far into this, I want to make sure I'm not violating any of he site's terms of service, or find out if this is a generally acceptable practice.

To get this data, I need to load a web page with certain parameters in the url, scrape the data, rinse and repeat...about 1500 times. Additionally, I'd like to publish the results of this project on a public facing and free website. I've identified a couple parts of the TOS that seem relevant:

You agree not to access (or attempt to access) the Website by any means other than through the interface that is provided by [the company], unless you have been specifically allowed to do so in a separate agreement with [the company]. You agree that you will not engage in any activity that interferes with or disrupts the Website (or the servers and networks which are connected to the Website). Unless you have been specifically permitted to do so in a separate agreement with [the company], you agree that you will not reproduce, duplicate, copy, sell, trade or resell the Website for any purpose. You agree that you are solely responsible for (and that [the company] has no responsibility to you or to any third party for) any breach of your obligations under this Agreement and for the consequences (including any loss or damage which [the company] may suffer) of any such breach.

And...

[The] Website is for your personal, non-commercial use only, if you have no other agreement with [the company].

I think I'm good for the following reasons:
- I am accessing the data through a web browser, like any other user would. In fact, I could accomplish exactly what I a doing manually, although it would be very slow, which is why I have chosen to write a program to do it.
- I plan on limiting my requests to 1/second, which should not interfere or disrupt the website. This is a site with well over 100 million users per month, my 1500 requests are a drop in the bucket.
- When I publish my project, it will be freely available to the public, i.e., I'm not using the data commercially. I also plan on citing the website as a source for my data.

Am I missing anything here? There are a couple other sites where I may need to do something similar, but this one in particular will cause a relatively large number of requests in a short period of time, which is why I'm concerned about this one in particular.
posted by noneuclidean to Computers & Internet (8 answers total) 3 users marked this as a favorite
 
Best answer: You could always ask.

About the only thing that sets me off here is the 1/second parameter you are talking about.

I'd block this faster than you can imagine. I use throttling on my sites, so it the same IP makes more than 30 requests in a minute you'll get a "You're doing this too much!" message. I'd have to actually check the numbers to see exactly where my threshold is, but it's there primary to prevent DOS attacks and scrapers.

Otherwise I think you'll be OK.

Things can get wonky though. Arron Swartz case revolves around him accessing data he was allowed to access (just not in the way they wanted it accused). The computer abuse and fraud act is pretty damn vague.
posted by cjorgensen at 6:23 PM on October 16, 2015


I think it really depends on the nature of the data. If whoever owns this website considers the data to be something they own, then they may well come after you if you scrape it and post it publicly. No matter if you may be technically in the right, you don't want a lawsuit on your hands.

Can you provide a little more info about what type of data you are dealing with?
posted by ssg at 8:46 PM on October 16, 2015


Best answer: IANAL...

But...

What you say you are going to do does sound in violation of the terms "you will not reproduce, duplicate, copy, sell, trade or resell". I read that as a list of OR's, not a list of ANDs, meaning you intend on both copying/duplicating, and reproducing (ie: sharing) the data.

Now, it's possible that this is boilerplate legal wording, and that the site wouldn't actually mind if you did this, but you won't know this unless you ask permission.

It's my understanding that the law is vague enough about things like this that if they wanted to bring a criminal complaint against you, they might be able to (as well as a civil complaint).

You might want to ask them before doing this.

Also, have you thought about using curl, wget, or perhaps a python (or other language) library to do this work? You might want to explore such options.

I agree you might want to get an agreement with the website before continuing, or perhaps consult with an actual IP lawyer (although I imagine such a consultation would cost more than this personal project can justify).
posted by el io at 10:02 PM on October 16, 2015


i think you are probably violating terms, but i would do it anyway (reading once per minute for a day). they probably won't notice. if they notice, they probably won't care. if they do notice and care, the worst that is likely to happen is that they ask you to take it down, and you've still learnt from the project.

unless you're purposefully bending the description this doesn't sound like aaron s territory.

on the other hand, if the idea that you might have to take it down causes you pain, then that's an indicator that this wasn't just a fun thing to learn that others might enjoy, but something with bigger, more commercial-like ideas, and you shouldn't be doing it.
posted by andrewcooke at 3:53 AM on October 17, 2015 [2 favorites]


For what it's worth, this guy got sued by Facebook for using a crawler to access the site in violation of the TOS.
posted by DB Cooper at 4:11 PM on October 17, 2015


Almost got sued. Accidentally a word out.
posted by DB Cooper at 4:31 PM on October 17, 2015


Use a random number for scraping intervals - some distribution resembling a human being instead of 1 second intervals. Does a person click at 1 second intervals? No, only a machine.
posted by oceanjesse at 7:37 AM on October 18, 2015


Response by poster: Thanks for the feedback everyone. Looks like it may not be wise to proceed with the full scraping and publication of what I was originally intending. Contacting them is a good idea. If they say yes, then awesome, and if they say no, I can still run a smaller, by-hand, analysis for my own personal edification. And hey, I already have working code, so at least I've learned something from that part of the project.
posted by noneuclidean at 2:55 PM on October 18, 2015


« Older How did you quickly implement lifestyle changes?   |   Displaying every possible Unicode character Newer »
This thread is closed to new comments.