Doing Stuff Repeatedly
July 9, 2009 9:40 AM
Subscribe
I need to write a script of some sort to grab content from a series of web pages. How should I go about it?
Here's the deal: I have a textbook that is available in electronic form online. The problem is that each page is a JPEG that is served up individually by a PHP script. I click on a link, it takes me to another web page with the embedded image on it. I want to download all of the pages and run OCR on it so I can search the book, do copy/paste for notes, etc.
I can sit there and do this manually but there's 600 pages or so and I want to use this as an opportunity to learn a new skill.
I have tried web spiders (HTTRack) and they did not work. The PHP script serves up each file with the exact same filename. The image embedded on the page also does not have a *.jpg extension, but some random PHP stuff.
I have tried to use Selenium-IDE but apparently it can't (a) do anything incrementally without the aid of another scripting language, and (b) is unable to actually download content.
I tried FlashGot's gallery feature but I think the PHP script interferes with the referrer or something because I only end up downloading the placeholder that warns against hotlinking.
So it seems I need some way of actually walking Firefox through the steps.
Here's the pseudocode of what I'm trying to do:
- Go to URL http://www.blah.edu/students/stuff.php?eh=123
- Go to URL http://www.blah.edu/students/view.php?pg=f&id=456
- Save http://www.blah.edu/students/showpage.php?pg=f&id=456 (this is actually an image) with filename page456.jpg.
- Go back to step 2, increment the id number to 457 and repeat.
What scripting language would best facilitate this? I'm on Linux but can use Windows if I must.
Thanks
posted by Ziggy Zaga to computers & internet (12 comments total)
3 users marked this as a favorite
If you can go directly to the showpage.php part and have it download the image, a shell or Perl script might be quicker. I'm not so great on my Perl (so I'll probably be mixing syntax with PHP), but something like:
#!/usr/bin/perl for ($i = 456;$i <> { `wget -O $i.jpg http://www.blah.edu/students/showpage.php?pg=f&id=$i` } >I'm sure someone can correct my script and make it work for you. If not, consider it an exercise in learning Perl!posted by jgunsch at 9:52 AM on July 9