Trying to download a whole batch of PDFs.
January 23, 2009 3:20 PM Subscribe
How to download a lot of PDFs? My wife is attending a conference, and the conference provides PDF downloads of all the papers, but only as individual papers, on individual pages. With a twist.
I saw the previous questions about this sort of issue, (http://ask.metafilter.com/82531/I-want-to-download-all-files-from-a-page-but-theres-a-catch) but there's a different catch.
First, you visit the site:
http://a.org/hawaii and log in
This brings you to:
http://b.c.com/b/2009am/webprogram/meeting.html
On this page, there are links to various session pages:
http://b.c.com/b/2009am/webprogram/Session1689.html
If you click on a specific paper to be presented at the specific session, you get a page with this format:
http://b.c.com/b/2009am/webprogram/Paper3829.html
On this page, which is the actual page from which you can download the PDF, the link to download the PDF is like this:
http://b.c.com/b/2009am/recordingredirect.cgi/id/648
There are two issues:
1. The whole process seems really touchy about logging in and it frequently seems to forget the credentials.
2. The cgi seems to be borking things up WRT using things like wget and HTTrack.
With wget, I get a "port error" message. and with HTTrack I get a "blocked by robots.txt" message.
Any ideas?
Thanks!
RedDot
I saw the previous questions about this sort of issue, (http://ask.metafilter.com/82531/I-want-to-download-all-files-from-a-page-but-theres-a-catch) but there's a different catch.
First, you visit the site:
http://a.org/hawaii and log in
This brings you to:
http://b.c.com/b/2009am/webprogram/meeting.html
On this page, there are links to various session pages:
http://b.c.com/b/2009am/webprogram/Session1689.html
If you click on a specific paper to be presented at the specific session, you get a page with this format:
http://b.c.com/b/2009am/webprogram/Paper3829.html
On this page, which is the actual page from which you can download the PDF, the link to download the PDF is like this:
http://b.c.com/b/2009am/recordingredirect.cgi/id/648
There are two issues:
1. The whole process seems really touchy about logging in and it frequently seems to forget the credentials.
2. The cgi seems to be borking things up WRT using things like wget and HTTrack.
With wget, I get a "port error" message. and with HTTrack I get a "blocked by robots.txt" message.
Any ideas?
Thanks!
RedDot
Oh sorry, I missed a sentence or two, I didn't notice you tried wget. I don't know the answer to this port error issue.
posted by dinx2582 at 3:29 PM on January 23, 2009
posted by dinx2582 at 3:29 PM on January 23, 2009
Firefox Scrapbook. Capture Page As.... Set custom filter to PDF. Set depth to 3 or so.
posted by gregoreo at 4:12 PM on January 23, 2009
posted by gregoreo at 4:12 PM on January 23, 2009
(Scrapbook will give a prompt that lets you pause the download process and review the current list. If not already filtered out, be sure to uncheck any links to "logout", etc. Such links can kill the session for tools like wget.)
posted by gregoreo at 5:19 PM on January 23, 2009
posted by gregoreo at 5:19 PM on January 23, 2009
Response by poster: Thanks for the suggestions. Due to the metafilter site outage and my travelling, I haven't had a chance to review this since asking. I'll check out Scrapbook.
posted by reddot at 4:20 PM on February 16, 2009
posted by reddot at 4:20 PM on February 16, 2009
This thread is closed to new comments.
This is meant to be a push in the right direction as opposed to the best answer you are going to get.
posted by dinx2582 at 3:29 PM on January 23, 2009