Can I build a Craigslist Crawler?
August 27, 2009 3:01 PM   Subscribe

Can I safely and legally build a Craigslist crawler? If so, help me figure out how! See inside for details...

I want to have a crawler built to scan Craigslist in all U.S. cities. I'm launching a free online marketplace for the re-sale of niche products, and want to be able to contact Craigslist sellers to inform them about posting ads on my site. Is it legal? And if so, does anybody know what Craigslist has put in place to block this type of thing? Does CL have a daily e-mail limit? Do they change their parameters frequently so that crawlers have to be modified?

Hopefully, this doesn't raise anybody's evil-spammer-red-flags. I hope to never send an e-mail to anyone who wouldn't be interested! I'm just trying to reach a very targeted audience of sellers, and offering them a (free) alternative to CL. If there's something un-kosher about that, you can also let me know why. Nicely please :)

Thanks MeFites!
posted by wetpaint to Computers & Internet (15 answers total) 1 user marked this as a favorite
 
Kind of *previously*.
posted by edmz at 3:05 PM on August 27, 2009


Craigslist, for some reason, is extremely averse to any third parties doing basically anything to it. This month's Wired has a good story on Craiglist, including their policy of putting up roadblocks whenever someone tries to do anything like this.

See this thread (posted yesterday): http://ask.metafilter.com/131234/Craigslist-automated-notifications-would-be-a-good-idea
posted by Hargrimm at 3:06 PM on August 27, 2009


It sounds spammy as you've described. Perhaps I'm reading this wrong, but you want to automate gathering the email addresses of people who have posted classified ads onto Craiglist so that you can send them an email advertising your services. Unless I've massively misread your intentions, that's just plain spamming.
posted by jamaro at 3:07 PM on August 27, 2009 [1 favorite]


I assume you've read the recent CL AskMe about why there aren't more apps that run sort of on top of CL? I know that's not what you're planning specifically, but it might be worth reading the sorts of things that the CL people do try to thwart.
posted by jessamyn at 3:08 PM on August 27, 2009


Well, you may notice that some CL posts explicitly say "it's NOT ok to contact this poster with services or other commercial interests" - if a post says that, and you contact that poster advertising your website, that would pretty much make you a spammer.
posted by Mike1024 at 3:09 PM on August 27, 2009


thirding or fourthing spammer. how would you know whether or not you're emailing a client who is interested without emailing them in the first place?

plus, it sounds like you are deliberately trying to steal some of the CL clientbase, which definitely seems like a cheap shortcut.
posted by Think_Long at 3:13 PM on August 27, 2009


"I'm launching a free online marketplace for the re-sale of niche products"

Ignoring the issues of (1) spam, and (2) technical roadblocks thrown up by Craigslist, how, exactly, do you think this business would be successful, if Craigslist has already become a "free online marketplace for the re-sale of niche products"?

I'd be asking that question first, before any of these ancillary ones.
posted by dfriedman at 3:15 PM on August 27, 2009


I did a series of analyses of Craigslist m4m ads based on my own crawl of their data. It was quite easy; just a cron job that did a wget of the RSS file for the forum I was interested in. No one contacted me before, during, or after the project to tell me what I was doing was wrong. Just follow usual rules of politeness for scrapers and they'll proably never notice. I didn't even bother masquerading my User-Agent.

That being said, emailing a bunch of Craigslist advertisers to say "try my service!" sounds like spam. Please don't do that without explicit, informed opt-in on the part of the folks you are emailing.
posted by Nelson at 3:29 PM on August 27, 2009


And if so, does anybody know what Craigslist has put in place to block this type of thing?

/robots.txt on every site you are crawling should be your first stop for any web spidering rules. http://www.robotstxt.org/robotstxt.html should provide you some basic information about how it is used.

It looks like http://www.craigslist.org/robots.txt disallows crawlers from many/all of the sub-sections you would.
posted by prak at 3:53 PM on August 27, 2009


I know someone who crawled a popular classified site for an academic research project. She wasn't doing anything remotely commercial in nature. She didn't hear anything from them at the time, but they traced her IP address and several years later was contacted with a court summons. She was able to get a lawyer and negotiate a settlement.

Whether you're legally right or wrong, don't do this if you can't afford a lawyer.
posted by miyabo at 3:58 PM on August 27, 2009


If I'm reading it correctly, it's against the Craigslist Terms of Use to do this. From section 7r-u:
Additionally, you agree not to:

r) contact anyone who has asked not to be contacted, or make unsolicited
contact with anyone for any commercial purpose;

s) "stalk" or otherwise harass anyone;

t) collect personal data about other users for commercial or unlawful
purposes;

u) use automated means, including spiders, robots, crawlers, data mining
tools, or the like to download data from the Service - unless expressly
permitted by craigslist;
posted by bluefly at 4:11 PM on August 27, 2009 [1 favorite]


bluefly for the win.

For the OP, the aforementioned Wired article has a lot of excellent information about craigslist. There's nary a business plan that currently can compete with free, minimal, and popular. Think about why people use CL to begin with instead of a newspaper's site or classifieds or eBay. Good luck with your venture - but build up your own client base.
posted by chrisinseoul at 8:07 PM on August 27, 2009


As bluefly noted any automated crawler is a violation of their terms of use, and as miyabo noted implementing such a thing can have serious consequences. If you are indeed "just trying to reach a very targeted audience of sellers," such as folks selling couches, then you won't need to scrape all posts from a category and it should be feasible to hire a bunch of humans to search for the applicable posts and initiate a dialog. However, I doubt you'd have much success with this approach, as craigslist sellers want to hear from buyers, not people suggesting an alternate marketplace. If I received such a response I would politely tell you to fuck off, just as I do to everyone who calls me at work offering something I didn't ask for.

A better approach would be to build a site that does what craigslist does, only better, with the goal of slowly building up a userbase that can compete with theirs. Considering that they got in on the "ground floor" of the internet explosion and the primary attraction is that large local userbase, this is a tall order akin to writing a new operating system that can compete with microsoft. You might as well hire a room full of skilled humans to search craigslist and buy the targeted items below market value and build a site to resell them.
posted by waxboy at 8:14 PM on August 27, 2009


Adhering to the robots.txt policy of a site is nowadays something that is only done by "nice" people. robots.txt is an internet convention and has afaik no current legal standing. However, violating this is considered extreme bad manners and usually annoys the people running the site, and they may try to stop you if they can.

T&Cs - given that you don't have to sign up to access the site, it's a question for the lawyers whether you have accepted the T&Cs at all.

Unfortunately, it's pretty much trivial to build a bot that cannot be detected.

Most people that want to offer some kind of service based on CL will be spotted because of the volume of traffic that an interesting service must invariably generate.

From the sounds of it, your usage would be relatively low.

That said, there would be very few if any cases, where what you plan would not be considered spam.

Maybe you could consider a solution based on their RSS feed instead of scraping?

Or, just place an ad in the same section.
posted by w.fugawe at 11:31 PM on August 27, 2009


I would try to get the information out of the RSS feed using Yahoo pipes and regular expressions.
posted by Akeem at 6:01 AM on August 28, 2009


« Older Neck/shoulder injury: stretches to alleviate knots...   |   "He spoke of the English, a noble race, rulers of... Newer »
This thread is closed to new comments.