Trying to download a whole batch of PDFs.
January 23, 2009 3:20 PM   Subscribe

How to download a lot of PDFs? My wife is attending a conference, and the conference provides PDF downloads of all the papers, but only as individual papers, on individual pages. With a twist.

I saw the previous questions about this sort of issue, (http://ask.metafilter.com/82531/I-want-to-download-all-files-from-a-page-but-theres-a-catch) but there's a different catch.

First, you visit the site:
http://a.org/hawaii and log in

This brings you to:
http://b.c.com/b/2009am/webprogram/meeting.html

On this page, there are links to various session pages:
http://b.c.com/b/2009am/webprogram/Session1689.html

If you click on a specific paper to be presented at the specific session, you get a page with this format:
http://b.c.com/b/2009am/webprogram/Paper3829.html

On this page, which is the actual page from which you can download the PDF, the link to download the PDF is like this:
http://b.c.com/b/2009am/recordingredirect.cgi/id/648

There are two issues:
1. The whole process seems really touchy about logging in and it frequently seems to forget the credentials.
2. The cgi seems to be borking things up WRT using things like wget and HTTrack.

With wget, I get a "port error" message. and with HTTrack I get a "blocked by robots.txt" message.

Any ideas?

Thanks!

RedDot
posted by reddot to Computers & Internet (5 answers total) 2 users marked this as a favorite
 
wget allows you to change your User Agent (-u) as well as provide basic login credentials. You theoretically should be able to use the mirroring option to leech the site as well as any links it branches off to. Changing the user agent will bypass robots.txt as long as you force the agent to anything but the default.

This is meant to be a push in the right direction as opposed to the best answer you are going to get.
posted by dinx2582 at 3:29 PM on January 23, 2009


Oh sorry, I missed a sentence or two, I didn't notice you tried wget. I don't know the answer to this port error issue.
posted by dinx2582 at 3:29 PM on January 23, 2009


Firefox Scrapbook. Capture Page As.... Set custom filter to PDF. Set depth to 3 or so.
posted by gregoreo at 4:12 PM on January 23, 2009


(Scrapbook will give a prompt that lets you pause the download process and review the current list. If not already filtered out, be sure to uncheck any links to "logout", etc. Such links can kill the session for tools like wget.)
posted by gregoreo at 5:19 PM on January 23, 2009


Response by poster: Thanks for the suggestions. Due to the metafilter site outage and my travelling, I haven't had a chance to review this since asking. I'll check out Scrapbook.
posted by reddot at 4:20 PM on February 16, 2009


« Older How do I keep smokey mits off my baby?   |   Where to collect plant & animal specimens? Newer »
This thread is closed to new comments.