Anybody has experience selling raw data?

Anybody here does a business out of selling raw data (formatted and presentable, of course). I know a few sites that do it (, for example). Would like to know the legal and ethical things to keep in mind, while collecting (scraping???) and selling data.
"data" can mean a lot of things. What kind of data is this question about?
posted by jrockway at 10:15 PM on October 24, 2009

This is a confusing question. Perhaps you should clarify what you are asking?
posted by killdevil at 11:21 PM on October 24, 2009

It sounds like he's talking about scraping data off the web and selling it in table format. Kind of an esoteric profession, but that's kinda what Fetch Technologies does.
posted by delmoi at 1:24 AM on October 25, 2009 [1 favorite]

The nature of the data is what matters: is it data people want? Without knowing what specific informational niche you are attempting to fill, the generic term "data" is not very helpful. Anyone can use off-the-shelf software to spider the web for raw data, then re-package it in various formats, and try to sell it: but yes without licensing the data, depending on the context of what it is and how it's been being used, then re-selling it can be considered illegal. But all of that's moot unless you have a specific audience in mind who is really looking for the data you hope to supply. Most of the high-end information (Scientific, Technical, Medical, Financial, Legal, etc.) that exists is already, form a business perspective, handled by major companies: and the good stuff is hidden behind firewall in paid subscription models. So that stuff is not accessible unless you pay for it, and re-packaging it for re-sale is definitely, definitely illegal.
posted by HP LaserJet P10006 at 1:32 AM on October 25, 2009

If you go to, you will see that the data the OP is referring to is of the nature described by delmoi.
posted by !Jim at 3:02 AM on October 25, 2009

Yeah, you really need to be more specific. The source of the data is the key thing with regard to legal and ethical issues. If it's made available under one of the open-source licenses which specifically allows commercial use, you should be fine (but in that case, anyone who *pays* you for it is an idiot). If not, you're almost certainly violating copyright (and generally being a parasite).

And if you're actually *scraping* data from other sites (an inherently unreliable technique, and an even bigger red flag that you're violating copyright), you are basically in the same camp as spammers, domain squatters, and Wikipedia mirrors. (Okay, the last is technically legal under Wikipedia's license, but it still hurts everyone except the owner of the mirror.)
posted by ixohoxi at 5:35 AM on October 25, 2009

Response by poster: OK, not sure why the question wasn't clear.

I speak of any data that can be collected, cleaned, repackaged (in various formats) and sold, for profit. Most data cannot be copyrighted - this means, the data itself can't be copyrighted, as per my understanding. For example, I can sell baseball scores data, nobody "owns" that data. The only thing that I (or anybody else) can claim copyright to, is the "format" of the data. That is my understanding.

I am not violating copyright, at least in many cases, if I scrape data (example - sports scores, nobel prize winner lists etc). I am only being a parasite, as you put it. Also, in some cases, the only way to get data is to scrape them. Lot of legitimate companies do it, and lot of data visualization experts get their data this way. If you pick up any book on data visualization, you'll see at least one chapter on scraping, and how to do it in a decent/meaningful/respectful way. If you want, I can give examples.

Also, what do you think search engines do? They save entire websites on the servers (cache them, I mean). If you think scraping is bad, then no search engine can exist.
posted by raghuram at 6:03 AM on October 25, 2009

No first-hand experience in this realm, but I did have an interesting conversation about it on YCombinator regarding the legal framework around copyright. The parent post was specifically about AggData. Nobody involved in the conversation was a lawyer, but it was interesting.
posted by cschneid at 10:30 AM on October 25, 2009

Most data cannot be copyrighted - this means, the data itself can't be copyrighted, as per my understanding.

Not quite. Under U.S. law the basic facts are not copywritable, however the compilation of them is, if this is considered sufficiently novel. This prevents copying of substantial proportions of a database is some instances [1].

Under EU law however (The Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases) [2] databases are afforded protections which extend beyond this, including sui generis database rights [3] that allow the owner to object to copying of a substantial portion of a database even if extracted and reconstructed.

posted by tallus at 12:05 PM on October 25, 2009

I'm not a lawyer but I did sell high-end data for information companies.

It's a fine line between "raw" data and re-packaged data.

You can definitely be sued in some instances: context is everything.

For instance, if one were to extract a lot of data (say from a "Mergers and Acquisitions" Database sold by Thompson Financial) from a proprietary database and attempt to re-sell it, there is a good chance one would get threatened with legal action.

Some data (medical and legal) is mandated by law to be free and accessible, although people re-package that stuff and re-sell it as well (such as in Abstract databases). Search engines are not a good example. Most blue-chip information is hidden by firewall in proprietary (subscription) databases: STM journals, industry newsletters, etc.

People like to pretend the internet has leveled the playing field for information: but the opposite is actually the case. If you are doing certain kinds of research, the information you need is often only accessible if your corporate (or academic) librarian subscribes to what you need.
posted by HP LaserJet P10006 at 7:02 PM on October 25, 2009

