How can I monitor changes to an online PDF simply and for free?
April 3, 2007 10:49 PM   Subscribe

How can I get an alert when changes to an online PDF have occurred. Simple solutions will carry the day. The closer to free, the better.

It's part of my job as a reporter to know when local companies decide to lay off workers. Announcements of those events are posted in my state, California, in online PDF files here:

In an ideal world, I'll make friends with the nice people who update these files. They'll call me when there's a relavant change. This will happen eventually.

For now, I'm a geek and want to receive an email every time an update occurs. I have vague plans to use UNIX tools to fetch the files (wget/curl) on a schedule (cron), extract their contents to text files (ps2ascii) then search for relevant changes (diff | grep), which will then get shunted to my mail (mail).

There must be an easier way. Any ideas?
posted by mfitz to Computers & Internet (8 answers total) 2 users marked this as a favorite
Fetch the files on a schedule with wget/curl and cron, like you planned. Then get the md5 hash of a file by doing 'md5 filename' at a UNIX prompt. If the md5 value changes, then your file has changed, If not, then it hasn't.
posted by suedehead at 11:26 PM on April 3, 2007

See this previous question on how to create an RSS feed from a page without one. Specifically, this comment seems most appropriate.
posted by philomathoholic at 11:29 PM on April 3, 2007

Also this other question is best-answered in a similar make-your-own-rss-feed way.

This question seems to be about getting RSS changes emailed to you (via google - but not necessarily gmail as I understand it) . If your unix tricks don't work maybe these will.
posted by philomathoholic at 11:37 PM on April 3, 2007

Is the nature of the updates that you're only interested in knowing when a change occurs, and, when it does, you'd want to look at the whole document?

Or is there value in seeing the diff?

If it's the latter, I think your proposal (basically -- I'd use pdftohtml -xml) is the easy way. If it's the former, well, maybe or another of the zillion web page change notification services are either simple enough or smart enough that you could give them the URL of a PDF. (If they're in-between, they're probably trying to be clever with the HTML.)
posted by Zed_Lopez at 11:51 PM on April 3, 2007

Internet Explorer allows you to subscribe to a page as a Favorite, check it on a specific schedule, and e-mail you if there are changes. This feature was meant to check web pages, but you can try pointing it at a PDF instead. It's super easy on the Mac version of IE, but a little more hidden on the PC.

For PC IE6: First load the PDF you're interested in with your browser. Then create a favorite. Choose "organize favorites" and select the new favorited PDF. Select the checkbox for "Make available offline" and then choose properties. The Schedule tab allows you to choose when to check the PDF. The Download tab allows you to e-mail yourself when the file changes.

My other idea if that doesn't work is to use Amazon's Mechanical Turk. Get someone to manually check the PDFs for you once or twice a day.
posted by Jeff Howard at 1:16 AM on April 4, 2007

I think you and suedehead are right on. Set up a little script that pulls down the file with wget then checks the md5 hash.

If the hash is different, have the script do a pdftotext, then runs diff on the two files, and emails you the changes.

Then use cron to run the script every hour, (or twice a day, or whatever)
posted by chrisamiller at 10:49 AM on April 4, 2007

Unless there's some reason storing the most recently seen version of this (modestly-sized) PDF is a hardship, md5 seems to me like a waste in this case... just use cmp.

(Don't get me wrong -- using md5 or another hashing algorithm to test for changes to large documents is a great technique. But I think it'd add complication without added value in this case.)
posted by Zed_Lopez at 12:27 PM on April 4, 2007

Response by poster: Thank you! These are all great ideas, fully none of which I would have come to on my own.

Because the changes are important to me (there's value in seeing the diff), I think I'm going to follow the strategy suggested by suedehead and Chris, but perhaps using cmp instead of md5 as Zed suggests.

I'll post the final results here when I get around to working it out. I'm a total hack, so it might take a while! :-)

Jeff - I love your Mechanical Turk idea, but it takes some of the fun out of it for me!

Phil - Thanks for the pointers.

Thanks again, everyone.
posted by mfitz at 8:39 PM on April 4, 2007

« Older Making friends when you (basically) have none   |   How can I deal with a case of data theft? Newer »
This thread is closed to new comments.