Doing Stuff Repeatedly
July 9, 2009 9:40 AM Subscribe
I need to write a script of some sort to grab content from a series of web pages. How should I go about it?
Here's the deal: I have a textbook that is available in electronic form online. The problem is that each page is a JPEG that is served up individually by a PHP script. I click on a link, it takes me to another web page with the embedded image on it. I want to download all of the pages and run OCR on it so I can search the book, do copy/paste for notes, etc.
I can sit there and do this manually but there's 600 pages or so and I want to use this as an opportunity to learn a new skill.
I have tried web spiders (HTTRack) and they did not work. The PHP script serves up each file with the exact same filename. The image embedded on the page also does not have a *.jpg extension, but some random PHP stuff.
I have tried to use Selenium-IDE but apparently it can't (a) do anything incrementally without the aid of another scripting language, and (b) is unable to actually download content.
I tried FlashGot's gallery feature but I think the PHP script interferes with the referrer or something because I only end up downloading the placeholder that warns against hotlinking.
So it seems I need some way of actually walking Firefox through the steps.
Here's the pseudocode of what I'm trying to do:
- Go to URL http://www.blah.edu/students/stuff.php?eh=123
- Go to URL http://www.blah.edu/students/view.php?pg=f&id=456
- Save http://www.blah.edu/students/showpage.php?pg=f&id=456 (this is actually an image) with filename page456.jpg.
- Go back to step 2, increment the id number to 457 and repeat.
What scripting language would best facilitate this? I'm on Linux but can use Windows if I must.
Thanks
Here's the deal: I have a textbook that is available in electronic form online. The problem is that each page is a JPEG that is served up individually by a PHP script. I click on a link, it takes me to another web page with the embedded image on it. I want to download all of the pages and run OCR on it so I can search the book, do copy/paste for notes, etc.
I can sit there and do this manually but there's 600 pages or so and I want to use this as an opportunity to learn a new skill.
I have tried web spiders (HTTRack) and they did not work. The PHP script serves up each file with the exact same filename. The image embedded on the page also does not have a *.jpg extension, but some random PHP stuff.
I have tried to use Selenium-IDE but apparently it can't (a) do anything incrementally without the aid of another scripting language, and (b) is unable to actually download content.
I tried FlashGot's gallery feature but I think the PHP script interferes with the referrer or something because I only end up downloading the placeholder that warns against hotlinking.
So it seems I need some way of actually walking Firefox through the steps.
Here's the pseudocode of what I'm trying to do:
- Go to URL http://www.blah.edu/students/stuff.php?eh=123
- Go to URL http://www.blah.edu/students/view.php?pg=f&id=456
- Save http://www.blah.edu/students/showpage.php?pg=f&id=456 (this is actually an image) with filename page456.jpg.
- Go back to step 2, increment the id number to 457 and repeat.
What scripting language would best facilitate this? I'm on Linux but can use Windows if I must.
Thanks
Wow, I don't think AskMefi liked me using < in my for loop. The for line should be more like:
posted by jgunsch at 9:53 AM on July 9, 2009
for ($i = 456; $i < 1014; $i++)
posted by jgunsch at 9:53 AM on July 9, 2009
Response by poster: If you can go directly to the showpage.php part and have it download the image, a shell or Perl script might be quicker. I'm not so great on my Perl (so I'll probably be mixing syntax with PHP), but something like:
Going directly to the page isn't an option; I'm pretty sure the page is looking for a referrer at the very least.
posted by Ziggy Zaga at 10:29 AM on July 9, 2009
Going directly to the page isn't an option; I'm pretty sure the page is looking for a referrer at the very least.
posted by Ziggy Zaga at 10:29 AM on July 9, 2009
The Perl WWW::Mechanize module should be able to handle that.
posted by ellenaim at 10:38 AM on July 9, 2009
posted by ellenaim at 10:38 AM on July 9, 2009
You might take a look at Ruby and Watir for scripting interaction with a browser. It can carry out commands (clicking, entering text) just as a human would. It's not too difficult to pick up, either...check out the first tutorial.
posted by tanminivan at 10:39 AM on July 9, 2009
posted by tanminivan at 10:39 AM on July 9, 2009
you can pretty easily do this with Perl and LWP (link is to CC licensed text)
posted by namewithoutwords at 10:46 AM on July 9, 2009
posted by namewithoutwords at 10:46 AM on July 9, 2009
I have had good experience scraping pages for mp3s with Python's urllib module - which comes standard with Python. urllib could easily do the same magic with .jpgs
posted by baxter_ilion at 10:55 AM on July 9, 2009
posted by baxter_ilion at 10:55 AM on July 9, 2009
Another thing that might complicate matters is if you have to login to access the textbook or not.
posted by baxter_ilion at 10:56 AM on July 9, 2009
posted by baxter_ilion at 10:56 AM on July 9, 2009
If you prefer Ruby, Mechanize or Hpricot may be able to help.
posted by chrisamiller at 11:29 AM on July 9, 2009
posted by chrisamiller at 11:29 AM on July 9, 2009
This is a script partly stolen from another page, partly based on my suggestion earlier of PHP with cURL. Haven't tested it, but it should work (or be at least a good place to start):
posted by jgunsch at 11:38 AM on July 9, 2009 [1 favorite]
<?php for ($i = 1; $i < 100; $i++) // numbers you need here { // This is the URL you're trying to load $ch = curl_init("http://www.blah.edu/students/showpage.php?pg=f&id=' . $i); // This sets the referer curl_setopt($ch, CURLOPT_REFERER, 'URL http://www.blah.edu/students/view.php?pg=f&id=' . $i); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile.txt"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookiefile.txt"); curl_setopt($ch, curlOPT_RETURNTRANSFER, 1); curl_setopt($ch, curlOPT_BINARYTRANSFER,1); $pagedata = curl_exec ($ch); $openposter = fopen($i . ".jpg", "w+"); $saveposter = fwrite($openposter, $pagedata); curl_close ($ch); } ?>
posted by jgunsch at 11:38 AM on July 9, 2009 [1 favorite]
jgunsch has what I would recommend. Hopefully you have a local PHP server set up.
If it still doesn't work, maybe try something like this?
posted by clorox at 1:06 PM on July 9, 2009
If it still doesn't work, maybe try something like this?
posted by clorox at 1:06 PM on July 9, 2009
Response by poster: I'm currently installing a virtual machine to set up a local PHP server so that's probably the way I'll go.
Thanks for the code; I was soliciting ideas more than straight solutions but it's appreciated nonetheless.
posted by Ziggy Zaga at 3:07 PM on July 9, 2009
Thanks for the code; I was soliciting ideas more than straight solutions but it's appreciated nonetheless.
posted by Ziggy Zaga at 3:07 PM on July 9, 2009
This thread is closed to new comments.
If you can go directly to the showpage.php part and have it download the image, a shell or Perl script might be quicker. I'm not so great on my Perl (so I'll probably be mixing syntax with PHP), but something like: I'm sure someone can correct my script and make it work for you. If not, consider it an exercise in learning Perl!
posted by jgunsch at 9:52 AM on July 9, 2009