How do I screen scrap a changing website?
August 18, 2012 3:09 PM Subscribe
How do I save a (that is constantly getting updated) webpage automatically every time it changes so that I can parse the page later with a program? Note I am trying to save someone else's website that is being updated. I want to extract their data. Screen Scrapers I have seen don't do what I want and I am not sure what this feature is called if it exists.
I understand that I can use a packet sniffer and intercept and decode the incoming traffic but it seems much simpler that i could just save the html and process it later.
Are their any open source packages that might make this simpler?
If programming is necessary I would prefer Ruby, C#, Python, or java. OR put another way anything but JavaScript.
I understand that I can use a packet sniffer and intercept and decode the incoming traffic but it seems much simpler that i could just save the html and process it later.
Are their any open source packages that might make this simpler?
If programming is necessary I would prefer Ruby, C#, Python, or java. OR put another way anything but JavaScript.
Response by poster: Good answer Foci, except that I don't except the url might update ever second and I don't want to miss even one update if possible. I really don't want to hammer on their webserver like this either. (98%+ of the page does not change, it is sending JSON updates)
posted by santogold at 3:35 PM on August 18, 2012
posted by santogold at 3:35 PM on August 18, 2012
Trying to understand, is this on your server or their server?
Unless it's on a server you control or they have some mechanism for notifying you of a changed version you will have to poll the server.
The trick to avoid excessive bandwidth usage and to avoid expensive whole-page comparisons is to use the If-Modified-Since request header. This makes it effectively a HEAD if the page hasn't changed, and a GET if it has.
I wrote a funny little thing to get the latest Courage Wolf to inline in a personal project. Feel free to use the source code as a reference. Take a look at the
posted by vsync at 3:42 PM on August 18, 2012 [1 favorite]
Unless it's on a server you control or they have some mechanism for notifying you of a changed version you will have to poll the server.
The trick to avoid excessive bandwidth usage and to avoid expensive whole-page comparisons is to use the If-Modified-Since request header. This makes it effectively a HEAD if the page hasn't changed, and a GET if it has.
I wrote a funny little thing to get the latest Courage Wolf to inline in a personal project. Feel free to use the source code as a reference. Take a look at the
checkForNewerWolf()
method, right after it creates the URLConnection
.posted by vsync at 3:42 PM on August 18, 2012 [1 favorite]
If the page has a Last-Modified or ETag header that changes appropriately, that might be an even easier way to detect updates.
If you're ok with Perl, Perl has a bunch of modules that are very handy for this sort of thing, like LWP::RobotUA and WWW::Mechanize, that take care of a bunch of random details that are easy to get wrong when you're doing it 'by hand'.
On preview:
it is sending JSON updates
Can you pick apart their Javascript, or observe it with Firebug or something, and simply retrieve the JSON data directly without pulling down the rest of the page? Easier for you to parse and less load for their webserver, if you can do it that way.
posted by hattifattener at 3:42 PM on August 18, 2012 [1 favorite]
If you're ok with Perl, Perl has a bunch of modules that are very handy for this sort of thing, like LWP::RobotUA and WWW::Mechanize, that take care of a bunch of random details that are easy to get wrong when you're doing it 'by hand'.
On preview:
it is sending JSON updates
Can you pick apart their Javascript, or observe it with Firebug or something, and simply retrieve the JSON data directly without pulling down the rest of the page? Easier for you to parse and less load for their webserver, if you can do it that way.
posted by hattifattener at 3:42 PM on August 18, 2012 [1 favorite]
This thread is closed to new comments.
posted by Foci for Analysis at 3:32 PM on August 18, 2012