Can I use a web scraper to get real URLs via shortened URLs?
April 21, 2009 10:54 PM Subscribe
Are there any open source web scrapers that I can use to get original URLs from shortened URLs (e.g. bit.ly, is.gd, tinyurl, etc).
I'm interested in scraping Digg submission histories for an easy way to look at the websites certain Diggers link to. Unfortunately, the history page links to a Digg page, which then often links to a digg.com-shortened URL via DiggBar. Is there any way to get through these clicks and scrape for the original URLs?
I'm interested in scraping Digg submission histories for an easy way to look at the websites certain Diggers link to. Unfortunately, the history page links to a Digg page, which then often links to a digg.com-shortened URL via DiggBar. Is there any way to get through these clicks and scrape for the original URLs?
It's not a scraper, as such, but if you are writing one you might look at the code from the LongURL mobile expander. It's a Greasemonkey script that expands these types of URLs and displays the destination URL in a Firefox tooltip.
posted by fireoyster at 11:44 PM on April 21, 2009
posted by fireoyster at 11:44 PM on April 21, 2009
Response by poster: Cool, that helps getting the destination URL from the shortened URL--thanks!
But what about getting even the shortened URL just from the submission history page (e.g. http://digg.com/users/MrBabyMan/history/submissions). It looks like the submission page just goes to the story's Digg.com page...unless I'm missing something? (new at this, if it's not totally obvious...) Is there a way to actually just pull all the destination URLs straight off the sub history page?
btw I'm not actually interested in MrBabyMan...just using as an example.
posted by alohaliz at 11:55 PM on April 21, 2009
But what about getting even the shortened URL just from the submission history page (e.g. http://digg.com/users/MrBabyMan/history/submissions). It looks like the submission page just goes to the story's Digg.com page...unless I'm missing something? (new at this, if it's not totally obvious...) Is there a way to actually just pull all the destination URLs straight off the sub history page?
btw I'm not actually interested in MrBabyMan...just using as an example.
posted by alohaliz at 11:55 PM on April 21, 2009
On Unix systems:
To do what you want with the submission history, install WWW::Mechanize from CPAN
(in the DOS prompt
Of course there are ways to automate that more, but that'll do you to some extent.
posted by singingfish at 1:06 AM on April 22, 2009
#!/usr/bin/perl use warnings; use strict; use LWP::UserAgent; my @urls = qw(http://xrl.us/bepvnq http://xrl.us/bepvnq); my @locations; open my $OUT, ">", "long_urls.txt"; my $ua=LWP::UserAgent->new; foreach my $url (@urls) { my $resp = $ua->get('http://xrl.us/bepvnq'); push @locations, $resp->previous->header('location'); } print $OUT "$_\n" for @locations;On windows, grab Strawberry Perl and run the script from the dos window
perl scrape.plin the directory containing the script.
To do what you want with the submission history, install WWW::Mechanize from CPAN
(in the DOS prompt
cpan WWW::Mechanizeand then run the utility script
mech-dump -links http://the.submission.page.com > unedited linksThen edit up the links to have ony the links you're interested in, insert into that list in the script above replacing the two urls in there, and you should be good.
Of course there are ways to automate that more, but that'll do you to some extent.
posted by singingfish at 1:06 AM on April 22, 2009
I see the problem you have there. That is a web scraping problem. I don't know of anything that will parse this out of the box. You would have to parse the HTML and pull out the links you care about, follow them, and then parse those pages to get the real URL. That's a tall order if you're not comfortable with scripting yourself. If you are feeling brave, WWW::Mechanize is, indeed, a good perl library to handle it. You might take a look at the find_link method which might get you what you need.
posted by MasterShake at 2:20 AM on April 22, 2009
posted by MasterShake at 2:20 AM on April 22, 2009
There's also WWW::Mechanize::Shell which helps you write mechanize scripts from the command line.
posted by singingfish at 4:55 AM on April 22, 2009
posted by singingfish at 4:55 AM on April 22, 2009
This thread is closed to new comments.
posted by MasterShake at 11:43 PM on April 21, 2009