scrape web content from okcupid?
September 9, 2012 10:48 AM   Subscribe

How do I extract content from a website like OKCupid for personal use?

I basically want to spider the site to extract info from matches that'll dump into a table. This is purely for my own personal nerdy fun (ie doing some sort of clustering on the matches they give me, trying my own search algorithsm, etc, and hopefully not illegal).

I don't know anything about web programming, so I don't know where to start even to search. For example, the match results page somehow causes web scraping tools I've found on online to only be able to lift one profile at a time. I'm not sure about the terminology of any of this. Any advice or directions would be great! Thanks!
posted by ribboncake to Computers & Internet (3 answers total) 3 users marked this as a favorite
 
You could try using a web automation testing framework to simulate your input to the website and collect the resulting web page as a text file for parsing - look at things like watir.

This may be the only way to get the correct output, since the website relies on things like javascript and cookies to generate stuff dynamically from information not included in a simple HTTP request.

It is also a somewhat nontrivial programming task, as well as an exercise in frustration - lots of trial and error, things breaking for no apparent reason - and a really depressing idea overall.
posted by Dr Dracator at 11:18 AM on September 9, 2012


If you're going to do your own clustering and write custom search algorithms, I'll assume you're a competent programmer (albeit not a web programmer).

The HTML you see is the combination of an initial HTML file, plus revisions generated by executing Javascript on the local machine based on additional background communication with the server. I think the right way to go here is to read and understand the Javascript, figure out the semantics of its background communication, and then scrape the background communication.

This does require learning a bit about web programming, but on the bright side, once you're parsing the background communications, you'll be scraping an explicitly machine-readable format rather than trying to interpret a loosely specified markup that creates the appearance on human readability.

Here are some helpful google queries:

- "AJAX" is the technique that sites like OKCupid use to rewrite your web pages on the local machine.

- "JSON" or "XML" are the formats commonly used for that background communication.

- "jQuery" is a popular Javascript library, probably verging on the standard Javascript library. Reading its API will help you understand the common operations and patterns in AJAX sites, although I can't find any evidence that OKCupid uses jQuery. (OKCupid have a really neat in-house tech stack: google TAME for their C++ concurrency library, OKWS for their web server, which includes a custom templating language)
posted by d. z. wang at 11:23 AM on September 9, 2012 [2 favorites]


and of course I meant "creates the appearance of human readability"
posted by d. z. wang at 11:23 AM on September 9, 2012


« Older If you see a few fleas does that mean there are...   |   Charcuterie! Newer »
This thread is closed to new comments.