Doing Stuff Repeatedly
July 9, 2009 9:40 AM   Subscribe

I need to write a script of some sort to grab content from a series of web pages. How should I go about it?

Here's the deal: I have a textbook that is available in electronic form online. The problem is that each page is a JPEG that is served up individually by a PHP script. I click on a link, it takes me to another web page with the embedded image on it. I want to download all of the pages and run OCR on it so I can search the book, do copy/paste for notes, etc.

I can sit there and do this manually but there's 600 pages or so and I want to use this as an opportunity to learn a new skill.

I have tried web spiders (HTTRack) and they did not work. The PHP script serves up each file with the exact same filename. The image embedded on the page also does not have a *.jpg extension, but some random PHP stuff.

I have tried to use Selenium-IDE but apparently it can't (a) do anything incrementally without the aid of another scripting language, and (b) is unable to actually download content.

I tried FlashGot's gallery feature but I think the PHP script interferes with the referrer or something because I only end up downloading the placeholder that warns against hotlinking.

So it seems I need some way of actually walking Firefox through the steps.

Here's the pseudocode of what I'm trying to do:

- Go to URL
- Go to URL
- Save (this is actually an image) with filename page456.jpg.
- Go back to step 2, increment the id number to 457 and repeat.

What scripting language would best facilitate this? I'm on Linux but can use Windows if I must.

posted by Ziggy Zaga to Computers & Internet (12 answers total) 3 users marked this as a favorite
If you have to actually *go* to the page to be able to get the image, I would recommend a combination of PHP, PHP's cURL functions to fetch the page, and preg_match with regular expressions to parse the page and find your image URL.

If you can go directly to the showpage.php part and have it download the image, a shell or Perl script might be quicker. I'm not so great on my Perl (so I'll probably be mixing syntax with PHP), but something like:
for ($i = 456;$i <>
`wget -O $i.jpg$i`
I'm sure someone can correct my script and make it work for you. If not, consider it an exercise in learning Perl!
posted by jgunsch at 9:52 AM on July 9, 2009

Wow, I don't think AskMefi liked me using < in my for loop. The for line should be more like:
for ($i = 456; $i < 1014; $i++)

posted by jgunsch at 9:53 AM on July 9, 2009

Response by poster: If you can go directly to the showpage.php part and have it download the image, a shell or Perl script might be quicker. I'm not so great on my Perl (so I'll probably be mixing syntax with PHP), but something like:

Going directly to the page isn't an option; I'm pretty sure the page is looking for a referrer at the very least.
posted by Ziggy Zaga at 10:29 AM on July 9, 2009

The Perl WWW::Mechanize module should be able to handle that.
posted by ellenaim at 10:38 AM on July 9, 2009

You might take a look at Ruby and Watir for scripting interaction with a browser. It can carry out commands (clicking, entering text) just as a human would. It's not too difficult to pick up, either...check out the first tutorial.
posted by tanminivan at 10:39 AM on July 9, 2009

you can pretty easily do this with Perl and LWP  (link is to CC licensed text)
posted by namewithoutwords at 10:46 AM on July 9, 2009

I have had good experience scraping pages for mp3s with Python's urllib module - which comes standard with Python. urllib could easily do the same magic with .jpgs
posted by baxter_ilion at 10:55 AM on July 9, 2009

Another thing that might complicate matters is if you have to login to access the textbook or not.
posted by baxter_ilion at 10:56 AM on July 9, 2009

If you prefer Ruby, Mechanize or Hpricot may be able to help.
posted by chrisamiller at 11:29 AM on July 9, 2009

This is a script partly stolen from another page, partly based on my suggestion earlier of PHP with cURL. Haven't tested it, but it should work (or be at least a good place to start):

for ($i = 1; $i < 100; $i++) // numbers you need here

// This is the URL you're trying to load
$ch = curl_init("' . $i);

// This sets the referer
curl_setopt($ch, CURLOPT_REFERER, 'URL' . $i); 
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookiefile.txt");
curl_setopt($ch, curlOPT_RETURNTRANSFER, 1);
curl_setopt($ch, curlOPT_BINARYTRANSFER,1);

$pagedata = curl_exec ($ch);

$openposter = fopen($i . ".jpg", "w+");
$saveposter = fwrite($openposter, $pagedata); 

curl_close ($ch);

posted by jgunsch at 11:38 AM on July 9, 2009 [1 favorite]

jgunsch has what I would recommend. Hopefully you have a local PHP server set up.

If it still doesn't work, maybe try something like this?
posted by clorox at 1:06 PM on July 9, 2009

Response by poster: I'm currently installing a virtual machine to set up a local PHP server so that's probably the way I'll go.

Thanks for the code; I was soliciting ideas more than straight solutions but it's appreciated nonetheless.
posted by Ziggy Zaga at 3:07 PM on July 9, 2009

« Older Help me learn Perl, Python, Lisp, Haskell, Ruby   |   Searing inquisition. Newer »
This thread is closed to new comments.