Any perl gurus up to fix a bug in Warrick?
February 23, 2009 11:01 PM   Subscribe

For the good of the Underdogs, does anyone know how to fix a bug in Warrick?

First of all I asked the guy who wrote it, and he's unable to help. I'm not a Perl guru, so this is a bit out of my league, so I turn to the community...

I'm using warrick for a rebuild of HOTU. I've gotten a 2,000 URL's back over the past 4 days, with around 14,000 left.

There seems to be a bug with Warrick's retrieval of Yahoo's cached pages, which is a shame because the yahoo cache seems to be the most complete.

The error seems to be because there's an interstitial warning page before any cached results. This throws off the cache recovery and kicks out an error.

Does anyone know how to fix this, hack or otherwise? Like assume that every cached yahoo page needs to be clicked through to avoid the problem?

Example of the warning page

Text of the error in Warrick.
!! The yahoo repo has a cached url for [] -> [;_ylu=X3oDMTByNXFlNTgyBGNvbG8DZQRwb3MDMTg1BHNlYwNzcgR2dGlkAw--/SIG=18ovkfijd/EXP=1235201461/**http%3A//]
Request generated an error (410) for [;_ylu=X3oDMTByNXFlNTgyBGNvbG8DZQRwb3MDMTg1BHNlYwNzcgR2dGlkAw--/SIG=18ovkfijd/EXP=1235201461/**http%3A//] on try 1 of 5.
Sleeping for 5 minutes before trying again...

I can provide more information, partial recoveries, etcetera if it's needed...
posted by Lord_Pall to Technology (10 answers total) 2 users marked this as a favorite
It looks reasonably straightforward — Warrick just needs to be taught to recognize Yahoo's warning page and click through, right? I could take a whack at it in a couple of days if you haven't found a better solution by then.

(I wouldn't really consider this a bug in Warrick; it's just a situation it doesn't handle…)
posted by hattifattener at 12:36 AM on February 24, 2009

Error 410 is that something was there, but not now. That link pulls up a cached page for me. You should leave it running; when it retries, it will hit a different server, which might have it.

How much room is that taking up? Distributing queries using no clobber & Dropbox might work.
posted by Pronoiac at 1:57 AM on February 24, 2009

Response by poster: That's an interesting idea actually. Distributing this and having a shared dropbox is a good idea..

I'm not sure how big the whole shebang is at the moment, but I'm on the verge of paying for dropbox anyways, so as long as it's under 50 gigs, it doesn't matter.

I'll check the size this evening.
posted by Lord_Pall at 2:37 AM on February 24, 2009

from my 5 seconds of looking, it appears you only have to chop off the first URL chunk, then decode all of the HTML entity stuff (%3A, etc). Either that, or like hattifattener suggests, recognize the 410 and scrape for the forward link on the page.
posted by rhizome at 9:08 AM on February 24, 2009

Response by poster: Unfortunately, I don't know perl. I should probably learn it, but I sorta disagree with the concept of a language that looks line modem line noise.

In this case, I genuinely don't know how to fix the url truncation you're describing..
posted by Lord_Pall at 10:57 AM on February 24, 2009

hattifattener might, given an email and suitable compensation.
posted by rhizome at 12:51 PM on February 24, 2009

If warrick's stalling on those pages, not reading the response properly, try changing the line in WebRepos/ from
"my $num_tries = 5;"
to, uh, with my habits:
# (initials) - if responses aren't being parsed, don't ask again
# my $num_tries = 5;
my $num_tries = 1;

Using curl, there are different responses if you use -L/--location to follow redirects or not. That might be relevant.
posted by Pronoiac at 4:54 PM on February 24, 2009

I wrote something to make warrick understand that click-through page as if it were a redirect, but now I can't get Y! to give me one of those pages for testing. Send me an email and I'll send you the modified warrick so you can give it a try.
posted by hattifattener at 12:00 AM on February 26, 2009 [1 favorite]


Could I get a copy of the updated warrick, plz?
posted by Pronoiac at 8:23 PM on February 28, 2009

Distributing queries using no clobber & Dropbox might work.

Incidentally, at the moment, the above won't work. I say this having checked out the code & having made some unrelated modifications.

Also, the author seems to appreciate patches.
posted by Pronoiac at 5:11 PM on November 10, 2009

« Older 99 problems but a name aint one   |   How to advertise my apartment for short term rent? Newer »
This thread is closed to new comments.