Dealing with HTML Parsing Misery?
October 26, 2005 3:01 PM   Subscribe

I need to write something (in perl) to shorten the text of a link if it's a URL, but only a URL. I've played with a variety of regexps and banged my head against HTML::Parser, but I've gotten no love. Help!

As far as regexps, go, I've done OK with /\(.*?)\< \/a\>/gi for getting the link text in question out, and modifying it isn't a problem.

The problem arises when I try to put it back. All I've been able to do is either only replace part of the orginal link OR I get stuck in an endless loop. I've been escaping my > and < signs, but that doesn't seem to help at all. for examples of what i've been trying (where in each $z is a copy of the original $1 from the first regexp, and $modtxt is the text i want to replace), tt>s#">$z\< #$modtxt\#/tt> will replace the text, but mungs up the tag so the initial A tag isn't closed before the ending tag. s#">$z\< #\\>$modtxt\< #/tt>, on the other hand, gets stuck in an endless loop.

I've been googling and banging my head against this for several days, and while I think I must be overlooking something really simple, I can't figure out what it is. Thus, I turn to Ask.Me for assistance.

(btw, getting those regexps through was a surprisingly difficult undertaking)
posted by Captain_Tenille to Computers & Internet (25 answers total)
 
AAAARGH. I swear, the regexps looked ok in preview.

These are the correct regular expressions:

First one: /\<a.*href=.*\>(.*?)\< \/a\>/gi

Second one: s/">$z\</$modtxt\</

Third one: s/">$z\</\"\>$modtxt\</

Hopefully they make it out of live preview.
posted by Captain_Tenille at 3:07 PM on October 26, 2005


When I'm messing with perl like this, I wrap everything possible in the whole string in parentheses (like /(^.*)(match regexp)(.*$)/, and then stick it back together with $1.$changed$.3. at the end. (I do it this way because I usually can't figure out the right way.)
posted by smackfu at 3:10 PM on October 26, 2005


you may want to use other characters for your regexp, at first glance: s!foo!bar!gi, e.g. If you use a character which isn't likely to be manipulated in the URL (now or later), you save some grief and readability.
posted by kcm at 3:11 PM on October 26, 2005


kcm: tried that already. Still gets stuck in the endless loop.
posted by Captain_Tenille at 3:17 PM on October 26, 2005


I don't know Perl, but based on what I'd do in sed, maybe you could pull out all three parts of the search regexp like

/(\<a.*href=.*\>)(.*?)(\< \/a\>)/gi

then build the final string with $1$whatever$3 instead of doing a search-and-replace.

On preview: what smackfu said. Plus, this actually saves work (the rexexp match has already gone to the trouble of searching your text; why search it again for the replace?)
posted by flabdablet at 3:21 PM on October 26, 2005


It's not necessary to escape angle brackets -- they have no special significance in regular expressions.

It's possible that $z (the copy of $1) contains metacharacters with significance in regular expressions -- but the nuances of how metacharacters are treated when interpolated via a variable are kind of mind-bending, and I don't have them at the top of my mind. Perhaps you could try:

my $z = quotemeta($1);

If you're positive that $z is an exact copy of $1 (and doesn't itself contain metacharacters), try:

TO CAPTURE:

m{<a.*href=[^>]+>\s*(.+?)\s*</a>}ig;

TO REPLACE:

s{>\s*$z\s*<}{>$modtxt<};

It isn't safe to assume that a quote mark (let alone a double-quote mark) will always appear before the first closing angle bracket, so I don't think it's required in your replacement pattern.
posted by macrone at 3:21 PM on October 26, 2005


Use XML::Parser and XML::Writer and stop worrying about brackets. (I haven't used HTML::Parser, but if it's anything like XML::Parser, it's definitely the way to go.)
posted by callmejay at 3:31 PM on October 26, 2005


If you end up using XSLT, here's an identity template that'll do it,
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">

<xsl:template match="text()[ancestor::a][contains(., '://')]">
<xsl:value-of select="substring(.,1,10)"/>
</xsl:template>

<xsl:template match="*|@*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>

</xsl:stylesheet>
This one will deal with the case of extra tags in the hyperlink too, eg, <a href="..."><b>http://chance-to-advertise-my-site-in-code.com</b></a>
posted by holloway at 3:38 PM on October 26, 2005


I'm a bit confused by what you mean by "endless loop" but the only thing you should need to escape is the / character, and only if you use the default characters. If as kcm and your original post suggest, not even that. Using the default slashset:

$_ = '<A href="http://metafilter.com">METAFILTER IS THE BEST!</A>';

if (/<a.*href=.*>(.*?)<\/a>/gi) {
my $modtxt = my $z = $1;
$modtxt =~ s/ IS THE BEST//; # for example
s/">$z</">$modtxt</;
}

print $_;


on preview, I'm repeating macrone a bit.
posted by Ogre Lawless at 3:50 PM on October 26, 2005


Looking over Ogre Lawless's comment and mine again, I think you should also make sure that all the wilcard matches are non-greedy. Any ".*" is liable to eat up much more text than intended, which could lead to nested matches and replacements, which I suppose could lead to some kind of evil recursion.

To restate my patterns:

TO CAPTURE:

m{<a.+?href=[^>]+>\s*(.+?)\s*</a>}igs;

(There's going to be at least a space after the <a, but you don't want to match into another link. You also want to match across linebreaks, thus the "s" modifier.)

TO REPLACE:

s{>\s*$z\s*<}{>$modtxt<};
posted by macrone at 4:05 PM on October 26, 2005


How about you show us your whole code? That might help. The thing about the endless loop doesn't seem apparent from what you've given us.

The best way to get Perl help, as much as I love Ask, is PerlMonks.
posted by AmbroseChapel at 4:13 PM on October 26, 2005


Captain_Tenille, can you give an example of the sort of input you are expecting, and the resulting output that you'd like?

You say that you are trying to "shorten the text of a link if it's a URL, but only a URL". Maybe, I'm reading this the wrong way, but I take that to mean that you'd like to change link text like "http://www.somesite.com/dir/file.htm" into something like "somesite.com" or "file.htm", while leaving links with text like "My File" unchanged.

Is this what you are trying to accomplish?
posted by woj at 4:23 PM on October 26, 2005


woj: that's pretty much exactly what I'm trying to accomplish.

If anyone wants to see the code in question, what I've been working on can be seen here: linktest.pl. I would just post it, but I need to go watch my daughter for a bit and don't feel like futzing with formatting.
posted by Captain_Tenille at 4:27 PM on October 26, 2005


OK here's something:


#!/usr/bin/perl
undef $/;
open( HTMLFILE, "/usr/ambrose/file.html" ) || die "$!";
my $html = <HTMLFILE>;
close( HTMLFILE );

$html =~ s{(<a [^>]+>)([^< ]+)</a>}
{$1 . munge($2) . '</a>'}egsi;

print $html;

sub munge() {
my $tag_contents = shift();
if ( $tag_contents =~ m|^http(s)?://|
&& length( $tag_contents ) > 32 )
{
$tag_contents = substr( $tag_contents, 0, 32 ) . '...';
}
return $tag_contents;
}



where the HTML file in question looks like this:


<a href="http://www.yahoo.com/">http://www.yahoo.com/</a>
Short URL as link text
<a href="http://www.yahoo.com/">Click here</a>
non-URL as link text
<a href="http://www.yahoo.com/foo/bar/baz/quux/">http://www.yahoo.com/foo/bar/baz/quux/</a>
long URL as link text


What it should do is: ignore link content which isn't a URL, ignore URLs if they're less than 32 chars long, and change the ones which are longer into the first 32 chars, plus '...' to show you've truncated them.

How's that?

It outputs this on the test file:


<a href="http://www.yahoo.com/">http://www.yahoo.com/</a>
Short URL as link text
<a href="http://www.yahoo.com/">Click here</a>
non-URL as link text
<a href="http://www.yahoo.com/foo/bar/baz/quux/">http://www.yahoo.com/foo/bar/baz...</a>
long URL as link text

posted by AmbroseChapel at 4:46 PM on October 26, 2005


AmbroseChapel: Unfortunately, your script also ignores link text that itself contains tags. You should change:

s{(<a [^>]+>)([^< ]+)</a>}

to:

s{(<a [^>]+>)(.+?)</a>}

Otherwise, I think your approach is best: to execute code in the replace pattern, rather than running two regexes across the same data.
posted by macrone at 4:54 PM on October 26, 2005


On preview: Yeah, what AmbroseChapel said. But here it is anyway. Oh, and this just modifies your code, but whatever.

I think your problem is more in how you're using the while loop. I'd skip it and use the "e" modifier on the s/// operator. That lets you put an evaluated expression inside the replacement (which lets you run whatever code you like to generate the replacement). Then, the standard "g" modifier on the s/// will just make all of your replacements for you without "re-finding" them. Try this out:

$link =~ s/(<a.+?href[^>]+>)\s*(.+?)\s*(<\/a>)/$1 . &shortenMe($2) . $3/gei;

sub shortenMe {
my $input = shift;
return $input if $input !~ m#(^http://|^ftp://)#;

my @tok = split '/', $input;

my $protocol = shift @tok;
shift @tok; #off into space
my $domain = shift @tok;
my $remainder = shift @tok;
my $modtxt = "$protocol//$domain/...";
return $modtxt;
}
posted by whatnotever at 4:59 PM on October 26, 2005


This might work for you, with a small amount of tweaking. Mine will use the host name from the href as the new link text, and uses Regex::Common:


use Regexp::Common qw /URI/;
use warnings;
use strict;

open(my $html, "
while (<$html>) {
m#()\s*(\S+)\s*()#mi;
my ($opentag, $linktext, $closetag) = ($1, $2, $3);

if ($linktext =~ /$RE{URI}{HTTP}{-keep}/) {
my $host = $3;
print "Found URL as link text...\n";
print "\tNew link is \'$opentag$host$closetag\'\n";
}
}


The file "test.html" looks pretty much like AmbroseChapel's.
posted by woj at 5:20 PM on October 26, 2005


Oh yeah, and I wanted to mention that you can use Regex::Common to match on any type of URI, not just http.
posted by woj at 5:22 PM on October 26, 2005


Sorry to keep posting, but I just re-read your code, and if you change the if statement in my example to look like this, then it does what your script intends:


if ($linktext =~ /$RE{URI}{HTTP}{-keep}/) {
my ($proto, $host) = ($2,$3);
print "Found URL as link text...\n";
print "\tOld link was \'$opentag$linktext$closetag\'\n";
print "\tNew link is \'$opentag $proto://$host... $closetag\'\n";
}

posted by woj at 5:29 PM on October 26, 2005


Unfortunately, your script also ignores link text that itself contains tags.

But we don't need to change those ones!

I mean, you're right of course, but there's no URL which needs to be shortened which will be missed, is there?

Mind you, what happens if the link text contains something like:

http://www.blah.com/ <b>I love this site!</b>

then we're in trouble...
posted by AmbroseChapel at 5:32 PM on October 26, 2005


Mind you, what happens if the link text contains something like:

http://www.blah.com/ <b>I love this site!</b>

then we're in trouble...


Perhaps not the most elegant solution in the world, but this works even with links as ugly as...

<a href="http://ask.metafilter.com/mefi/26171"><em>check out</em> http://ask.metafilter.com/mefi/26171 <strong>for more info</strong></a>


use Regexp::Common qw /URI/;
use warnings;
use strict;

open(my $html, "<test.html");

while (<$html>) {

m#(<a\s+href.*?>)(.*?)(</a\s*>)#i;
my ($opentag, $linktext, $closetag) = ($1, $2, $3);
my $replace = $opentag;
foreach my $chunk (split /\s+/,$linktext) {
if ($chunk=~/$RE{URI}{HTTP}{-keep}/ ){
my ($proto, $host) = ($2,$3);
$replace.=" $proto://$host... ";
}
else {
$replace.=" $chunk";
}

}
print "Replacement is $replace\n";
}


Sorry, I don't feel like adding in a bunch of nbsp's to indent it correctly.
posted by woj at 6:07 PM on October 26, 2005


Somewhat off-topic but I'm interested in this part:

open(my $html, "<test.html");
while (<$html>){
}

That wouldn't work if the tag was split across lines, would it? Say you had

<a
href="foo">


for instance?
posted by AmbroseChapel at 6:18 PM on October 26, 2005


True, I guess you'd have to slurp in the file and use multi-line matching. I was just using the standard input record separator for simplicity's sake when I was testing it out. Honestly, I didn't realize that you could put newlines within the tags without choking the browser. :)
posted by woj at 6:23 PM on October 26, 2005


And while I'm nit-picking, this:

m#(<a\s+href.*>)\s*(\S+)\s*(</a\s*>)#mi;

is going to give problematic results if there are two a tags on the same line, because of the ".*" being greedy.

<a href="http://foo.com/">foo</a>, <a href="http://bar.com/">bar</a>


for instance will get you everything up to the second "bar" in $1.

I didn't realize that you could put newlines within the tags without choking the browser. :)


Any whitespace is legal, including returns. I've been bitten before...
posted by AmbroseChapel at 6:27 PM on October 26, 2005


Any whitespace is legal, including returns. I've been bitten before...


I actually don't work with html documents too often, so that is good to know. The missing ? was a typo on my part, but yes, there would be a problem with two tags on the line. And with images as links, assuming he doesn't want to break those, and probably a bunch of other problems I can't think of right now... I think I'm gonna sit out the rest of this one. Good luck.
posted by woj at 7:02 PM on October 26, 2005


« Older Where in Toronto can I get a wolf puppet?   |   Help me organize my finances Newer »
This thread is closed to new comments.