How do I screen scrap a changing website?
August 18, 2012 3:09 PM   Subscribe

How do I save a (that is constantly getting updated) webpage automatically every time it changes so that I can parse the page later with a program? Note I am trying to save someone else's website that is being updated. I want to extract their data. Screen Scrapers I have seen don't do what I want and I am not sure what this feature is called if it exists.

I understand that I can use a packet sniffer and intercept and decode the incoming traffic but it seems much simpler that i could just save the html and process it later.

Are their any open source packages that might make this simpler?

If programming is necessary I would prefer Ruby, C#, Python, or java. OR put another way anything but JavaScript.
posted by santogold to Computers & Internet (4 answers total) 1 user marked this as a favorite
 
Use a cronjob with WGET (PHP example) and set it to execute every 5/3/1 minutes depending on the update frequency of the url. Transform the downloaded file to remove unwanted html. Create a checksum for each file and use the checksum to figure out if the latest file is any different from the previous one.
posted by Foci for Analysis at 3:32 PM on August 18, 2012


Response by poster: Good answer Foci, except that I don't except the url might update ever second and I don't want to miss even one update if possible. I really don't want to hammer on their webserver like this either. (98%+ of the page does not change, it is sending JSON updates)
posted by santogold at 3:35 PM on August 18, 2012


Trying to understand, is this on your server or their server?

Unless it's on a server you control or they have some mechanism for notifying you of a changed version you will have to poll the server.

The trick to avoid excessive bandwidth usage and to avoid expensive whole-page comparisons is to use the If-Modified-Since request header. This makes it effectively a HEAD if the page hasn't changed, and a GET if it has.

I wrote a funny little thing to get the latest Courage Wolf to inline in a personal project. Feel free to use the source code as a reference. Take a look at the checkForNewerWolf() method, right after it creates the URLConnection.
posted by vsync at 3:42 PM on August 18, 2012 [1 favorite]


If the page has a Last-Modified or ETag header that changes appropriately, that might be an even easier way to detect updates.

If you're ok with Perl, Perl has a bunch of modules that are very handy for this sort of thing, like LWP::RobotUA and WWW::Mechanize, that take care of a bunch of random details that are easy to get wrong when you're doing it 'by hand'.

On preview:

it is sending JSON updates

Can you pick apart their Javascript, or observe it with Firebug or something, and simply retrieve the JSON data directly without pulling down the rest of the page? Easier for you to parse and less load for their webserver, if you can do it that way.
posted by hattifattener at 3:42 PM on August 18, 2012 [1 favorite]


« Older calling all HR/recruiters...resume update?   |   I'd really like to like this job Newer »
This thread is closed to new comments.