How can I track and log all the changes to a website?
December 2, 2008 10:17 AM   Subscribe

How can I track and log all the changes to a website? Ideally, I would like a program for Linux (Ubuntu) or XP that I could get to automatically check a particular website each day and log any changes for my records. In a perfect world, this program would be able to save and archive the website (and the pages that link from it) each time there is a change.

I currently use Specto on Ubuntu to notify me if any changes are made, but then I have to manually save the changes and record the date.

The website is a job postings website that my union has to monitor in order to verify that all new hires can be correlated to a real job posting, and that any employees on layoff with recall rights are contacted regarding new jobs.
posted by kaudio to Computers & Internet (6 answers total) 3 users marked this as a favorite
 
You can use a version control system like subversion to track the changes.
posted by mkb at 10:27 AM on December 2, 2008


Best answer: Your question would be easier if you wanted to specify what kinds of changes you were interested in. Are you interested in text-only changes? What if they change the markup only? Or an image?

On any Linux you can use a combination of cron (to schedule the crawl), and either curl or wget to download the website every day. Download the website to a datestamped directory, and then for text comparisons you can use 'diff'. For comparing images or other non-text data you can use md5sum.

I can't really write the above for you without more details, but it shouldn't be hard to figure out. The above solution is pretty good, but if you trust the webserver serving you the pages a more correct approach would probably be to use a combination of HTTP HEAD requests and the Last-Modified header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
posted by doteatop at 10:33 AM on December 2, 2008


Best answer: This looks like it gets you pretty close: WebSec.

"Web Secretary is a web page change monitoring software. It will detect changes based on content analysis, making sure that it's not just HTML that changed, but actual content. You can tell it what to ignore in the page (hit counters and such), and it can mail you the document with the changes highlighted or load the highlighted page in a browser.

Web Secretary is actually a suite of two Perl scripts called websec and webdiff. websec retrieves web pages and email them to you based on a URL list that you provide. webdiff compares two web pages (current and archive) and creates a new page based on the current page but with all the differences highlighted using a predefined color."
posted by jquinby at 10:38 AM on December 2, 2008


You could use httrack to download the website and rdiff-backup to compare it to the previous version, and to note and save changes. (Natually, you'd want to write a cron job script to do this.)
posted by Zed_Lopez at 11:08 AM on December 2, 2008


If you use wget -N, it'll add timestamps, and download any new or changed files.

Cron that however frequently you need to.
posted by pompomtom at 2:31 PM on December 2, 2008


Response by poster: Lots of good ideas, thank you to everyone who responded. I'll be spending the next week reading about cron scripts and such. (I'm new to Linux.)

More details: The website that I want to track is this one. When there are job postings, they will appear on this page as links to the actual posting, so those are the changes I'm looking for. Ideally, I'd like to be able to download those job postings too, so I can keep a record.
posted by kaudio at 5:27 PM on December 2, 2008


« Older Does size matter (on exams)?   |   Fix a loose connection without cutting the cord? Newer »
This thread is closed to new comments.