Doing Stuff Repeatedly
July 9, 2009 9:40 AM Subscribe
I need to write a script of some sort to grab content from a series of web pages. How should I go about it?
posted by Ziggy Zaga to computers & internet (12 answers total) 3 users marked this as a favorite
Here's the deal: I have a textbook that is available in electronic form online. The problem is that each page is a JPEG that is served up individually by a PHP script. I click on a link, it takes me to another web page with the embedded image on it. I want to download all of the pages and run OCR on it so I can search the book, do copy/paste for notes, etc.
I can sit there and do this manually but there's 600 pages or so and I want to use this as an opportunity to learn a new skill.
I have tried web spiders (HTTRack) and they did not work. The PHP script serves up each file with the exact same filename. The image embedded on the page also does not have a *.jpg extension, but some random PHP stuff.
I have tried to use Selenium-IDE but apparently it can't (a) do anything incrementally without the aid of another scripting language, and (b) is unable to actually download content.
I tried FlashGot's gallery feature but I think the PHP script interferes with the referrer or something because I only end up downloading the placeholder that warns against hotlinking.
So it seems I need some way of actually walking Firefox through the steps.
Here's the pseudocode of what I'm trying to do:
- Go to URL http://www.blah.edu/students/stuff.php?eh=123
- Go to URL http://www.blah.edu/students/view.php?pg=f&id=456
- Save http://www.blah.edu/students/showpage.php?pg=f&id=456 (this is actually an image) with filename page456.jpg.
- Go back to step 2, increment the id number to 457 and repeat.
What scripting language would best facilitate this? I'm on Linux but can use Windows if I must.