Link Report
October 10, 2010 2:14 PM   Subscribe

Looking for software which will give me a report of every site I've ever linked to on my blog.

I've had a blog for many years and written many posts. I'd love to be able to run some kind of tool which will tell me which sites I've linked to and how often I've done so. In other words, it would scan all posts on my site & ideally give me back something like this:
nytimes.com - 117
washingtonpost.com - 33
cnet.com - 4
And so forth. (This is a purely hypothetical example, of course.) I'm looking for something simple and (hopefully) free. I'm using Windows XP, in case that matters, but have access to a Mac if need be. Also in case it matters, my site is on the Soapblox platform.

Thanks for your help!
posted by Conrad Cornelius o'Donald o'Dell to Technology (2 answers total)
 
Counting link frequency is pretty trivial (a couple of lines) in any of the major scripting languages. The hard part is getting the blog content in a form that can be easily read -- do they offer anything like a 'post export' or 'database dump' or the like where you can get all your content in one file? Even one post per file would be fine too. If not then you'd probably have to spider the site, which is also pretty easy with wget or HTTrack.
posted by Rhomboid at 12:43 AM on October 11, 2010


Best answer: I took a look at some of the blogs listed on the Soapblox site, and based on that I came up with the following perl commands. The first will pull down all blog posts from the site into files named postnum.html. This requires the LWP perl module; see below. You probably want to run this in an empty directory that you create for the purpose of running this script so that you don't pollute your home dir with a bunch of files. Replace www.example.com with your blog's front page. Note that this assumes your blog follows the URL patterns of /main/nnn for entry listing pages and /diary/nnn for each entry. (Even though the blogs I checked were SEO enabled to have the post title after the /nnn/ part, just /diary/nnn worked fine as well, similar to how MetaFilter works.)
perl -MLWP::Simple -e '$| = 1; $u = "http://www.example.com"; for($i = 0; @m = get("$u/main/$i") =~ m!<a\s+[^>]*href="/diary/(\d+)[^"]*"!sig; $i++) { print("$_ "), getstore("$u/diary/$_", "$_.html") for (grep { !$seen{$_}++ } @m); }'
The next command will read all the .html files in the current directory and generate a report of link frequency. These examples all use redirection to put the output to a file (>report.txt) but you can modify that to suit. This one requires the HTML::LinkExtor perl module installed first:
perl -MHTML::LinkExtor -0777 -ne 'HTML::LinkExtor->new(sub { $h{$2}++ if (shift eq "a" && +{@_}->{href} =~ m@https?://(www\.)?([^/]+)@i); })->parse($_); }{ print "$_ - $h{$_}\n" for(sort { $h{$b} <=> $h{$a} } keys %h)' *.html >report.txt
Note that this looks at the page as a whole, so for example if you have a blogroll those links will count on every entry and skew your stats. To process just the entry and not the rest of the page, try the following instead. Note that this assumes your layout is the same as the ones I looked at, keying off <h1 class="diaryTitle"> for the start of the entry and <td>Tags: for the end. You can change those if you need to.
perl -MHTML::LinkExtor -0777 -ne 's!^.+<h1 class="diaryTitle">(.+)<td>Tags:.+$!$1!si; HTML::LinkExtor->new(sub { $h{$2}++ if (shift eq "a" && +{@_}->{href} =~ m@https?://(www\.)?([^/]+)@i); })->parse($_); }{ print "$_ - $h{$_}\n" for(sort { $h{$b} <=> $h{$a} } keys %h)' *.html >report.txt
If you have trouble installing modules, here's a version of the above that uses regexps instead of a parser. It might miss a link or two here or there if they're coded strangely (e.g. using a single quote instead of a double quote for the href attribute, or using no quotes.)
perl -ne '$h{$2}++ while(m@<a [^>]*href="https?://(www\.)?([^/"]+)@ig); }{ print "$_ - $h{$_}\n" for(sort { $h{$b} <=> $h{$a} } keys %h)' *.html >report.txt
And again a version that counts only the body of the entry, same caveats as above:
perl -0777 -ne 's!^.+<h1 class="diaryTitle">(.+)<td>Tags:.+$!$1!si; $h{$2}++ while(m@<a [^>]*href="https?://(www\.)?([^/"]+)@ig); }{ print "$_ - $h{$_}\n" for(sort { $h{$b} <=> $h{$a} } keys %h)' *.html >report.txt
Notes on perl for Windows and installing modules:

There are several different flavors of perl for Windows, all free: ActivePerl by ActiveState, Strawberry perl, and Cygwin perl. Cygwin is a set of ports of many common unix utilities for Windows. If you plan to do any other scripting or development it's probably best to do it all with Cygwin as the Cygwin tools are all "POSIX-ized" to act like *nix. Strawberry and ActivePerl on the other hand are 'native win32' perls, which stand alone and act more like regular Windows binaries without the POSIX emulation. As far as installing modules, both Cygwin and Strawberry perl support installing directly from CPAN using the same methods as on *nix, i.e. run "cpan foo" to install the module foo, though you should check out a simplified interface caled cpanminus. ActivePerl on the other hand has no compiler and thus you have to use their binary package system to install modules.

Note that Strawberry perl comes out of the box with several common non-core modules installed, including LWP, which means it's probably the best choice if the previous paragraph made no sense to you, as you won't have to install any modules as long as you use the regexp version of the reporting command.
posted by Rhomboid at 5:37 AM on October 11, 2010 [1 favorite]


« Older Should I avoid signing a year-long lease in DC?...   |   Make AAPL and MSFT play nice together Newer »
This thread is closed to new comments.