<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Dealing with HTML Parsing Misery?</title>
	<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery/</link>
	<description>Comments on Ask MetaFilter post Dealing with HTML Parsing Misery?</description>
	<pubDate>Wed, 26 Oct 2005 15:07:49 -0800</pubDate>
	<lastBuildDate>Wed, 26 Oct 2005 15:07:49 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Dealing with HTML Parsing Misery?</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery</link>	
		<description>I need to write something (in perl) to shorten the text of a link if it&apos;s a URL, but only a URL. I&apos;ve played with a variety of regexps and banged my head against HTML::Parser, but I&apos;ve gotten no love. Help! &lt;br /&gt;&lt;br /&gt; As far as regexps, go, I&apos;ve done OK with &lt;tt&gt;/\&lt;a .*href=.*\&gt;(.*?)\&lt; \/a\&gt;/gi &lt;/&gt;&lt;/a&gt;&lt;/tt&gt; for getting the link text in question out, and modifying it isn&apos;t a problem.&lt;br&gt;
&lt;br&gt;
The problem arises when I try to put it back. All I&apos;ve been able to do is either only replace part of the orginal link OR I get stuck in an endless loop. I&apos;ve been escaping my &amp;gt; and &lt; signs, but that doesn&apos;t seem to help at all. for examples of what i&apos;ve been trying (where in each $z is a copy of the original $1 from the first regexp, and $modtxt is the text i want to replace), tt&gt;s#&quot;&amp;gt;$z\&lt; #$modtxt\#/tt&gt; will replace the text, but mungs up the tag so the initial A tag isn&apos;t closed before the ending tag. &lt;tt&gt;s#&quot;&amp;gt;$z\&lt; #\\&gt;$modtxt\&lt; #/tt&gt;, on the other hand, gets stuck in an endless loop.&lt;br&gt;
&lt;br&gt;
I&apos;ve been googling and banging my head against this for several days, and while I think I must be overlooking something really simple, I can&apos;t figure out what it is. Thus, I turn to Ask.Me for assistance.&lt;br&gt;
&lt;br&gt;
(btw, getting those regexps through was a surprisingly difficult undertaking)&lt;/&gt;&lt;/&gt;&lt;/tt&gt;&lt;/&gt;&lt;/&gt;</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2005:site.26171</guid>
		<pubDate>Wed, 26 Oct 2005 15:01:48 -0800</pubDate>
		<dc:creator>Captain_Tenille</dc:creator>
		
			<category>perl</category>
		
			<category>regexp</category>
		
			<category>html</category>
		
	</item> <item>
		<title>By: Captain_Tenille</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412939</link>	
		<description>AAAARGH. I swear, the regexps looked ok in preview.&lt;br&gt;
&lt;br&gt;
These are the correct regular expressions:&lt;br&gt;
&lt;br&gt;
First one: /\&amp;lt;a.*href=.*\&amp;gt;(.*?)\&lt; \/a\&gt;/gi&lt;br&gt;
&lt;br&gt;
Second one: s/&quot;&amp;gt;$z\&amp;lt;/$modtxt\&amp;lt;/&lt;br&gt;
&lt;br&gt;
Third one: s/&quot;&amp;gt;$z\&amp;lt;/\&quot;\&amp;gt;$modtxt\&amp;lt;/&lt;br&gt;
&lt;br&gt;
Hopefully they make it out of live preview.&lt;/&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412939</guid>
		<pubDate>Wed, 26 Oct 2005 15:07:49 -0800</pubDate>
		<dc:creator>Captain_Tenille</dc:creator>
	</item><item>
		<title>By: smackfu</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412947</link>	
		<description>When I&apos;m messing with perl like this, I wrap everything possible in the whole string in parentheses (like /(^.*)(match regexp)(.*$)/, and then stick it back together with $1.$changed$.3. at the end.  (I do it this way because I usually can&apos;t figure out the right way.)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412947</guid>
		<pubDate>Wed, 26 Oct 2005 15:10:59 -0800</pubDate>
		<dc:creator>smackfu</dc:creator>
	</item><item>
		<title>By: kcm</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412949</link>	
		<description>you may want to use other characters for your regexp, at first glance: s!foo!bar!gi, e.g.  If you use a character which isn&apos;t likely to be manipulated in the URL (now or later), you save some grief and readability.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412949</guid>
		<pubDate>Wed, 26 Oct 2005 15:11:43 -0800</pubDate>
		<dc:creator>kcm</dc:creator>
	</item><item>
		<title>By: Captain_Tenille</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412955</link>	
		<description>kcm: tried that already. Still gets stuck in the endless loop.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412955</guid>
		<pubDate>Wed, 26 Oct 2005 15:17:48 -0800</pubDate>
		<dc:creator>Captain_Tenille</dc:creator>
	</item><item>
		<title>By: flabdablet</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412958</link>	
		<description>I don&apos;t know Perl, but based on what I&apos;d do in sed, maybe you could pull out all three parts of the search regexp like&lt;br&gt;
&lt;br&gt;
 /(\&amp;lt;a.*href=.*\&amp;gt;)(.*?)(\&amp;lt; \/a\&amp;gt;)/gi&lt;br&gt;
&lt;br&gt;
then build the final string with $1$whatever$3 instead of doing a search-and-replace.&lt;br&gt;
&lt;br&gt;
On preview: what smackfu said.  Plus, this actually saves work (the rexexp match has already gone to the trouble of searching your text; why search it again for the replace?)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412958</guid>
		<pubDate>Wed, 26 Oct 2005 15:21:16 -0800</pubDate>
		<dc:creator>flabdablet</dc:creator>
	</item><item>
		<title>By: macrone</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412959</link>	
		<description>It&apos;s not necessary to escape angle brackets -- they have no special significance in regular expressions.&lt;br&gt;
&lt;br&gt;
It&apos;s possible that $z (the copy of $1) contains metacharacters with significance in regular expressions -- but the nuances of how metacharacters are treated when interpolated via a variable are kind of mind-bending, and I don&apos;t have them at the top of my mind. Perhaps you could try:&lt;br&gt;
&lt;br&gt;
    my $z = quotemeta($1);&lt;br&gt;
&lt;br&gt;
If you&apos;re positive that $z is an exact copy of $1 (and doesn&apos;t itself contain metacharacters), try:&lt;br&gt;
&lt;br&gt;
TO CAPTURE:&lt;br&gt;
&lt;br&gt;
m{&amp;lt;a.*href=[^&amp;gt;]+&amp;gt;\s*(.+?)\s*&amp;lt;/a&amp;gt;}ig;&lt;br&gt;
&lt;br&gt;
TO REPLACE:&lt;br&gt;
&lt;br&gt;
s{&amp;gt;\s*$z\s*&amp;lt;}{&amp;gt;$modtxt&amp;lt;};&lt;br&gt;
&lt;br&gt;
It isn&apos;t safe to assume that a quote mark (let alone a double-quote mark) will always appear before the first closing angle bracket, so I don&apos;t think it&apos;s required in your replacement pattern.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412959</guid>
		<pubDate>Wed, 26 Oct 2005 15:21:49 -0800</pubDate>
		<dc:creator>macrone</dc:creator>
	</item><item>
		<title>By: callmejay</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412977</link>	
		<description>Use XML::Parser and XML::Writer and stop worrying about brackets.  (I haven&apos;t used HTML::Parser, but if it&apos;s anything like XML::Parser, it&apos;s definitely the way to go.)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412977</guid>
		<pubDate>Wed, 26 Oct 2005 15:31:09 -0800</pubDate>
		<dc:creator>callmejay</dc:creator>
	</item><item>
		<title>By: holloway</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#412985</link>	
		<description>If you end up using XSLT, here&apos;s an identity template that&apos;ll do it,&lt;br&gt;
&lt;blockquote&gt;&lt;tt&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&amp;gt;&lt;br&gt;
&amp;lt;xsl:stylesheet	version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&amp;gt;&lt;br&gt;
&lt;br&gt;
	&amp;lt;xsl:template match=&quot;text()[ancestor::a][contains(., &apos;://&apos;)]&quot;&amp;gt;&lt;br&gt;
	   	&amp;lt;xsl:value-of select=&quot;substring(.,1,10)&quot;/&amp;gt;&lt;br&gt;
	&amp;lt;/xsl:template&amp;gt;&lt;br&gt;
&lt;br&gt;
	&amp;lt;xsl:template match=&quot;*|@*&quot;&amp;gt;&lt;br&gt;
		    &amp;lt;xsl:copy&amp;gt;&amp;lt;xsl:apply-templates/&amp;gt;&amp;lt;/xsl:copy&amp;gt;&lt;br&gt;
	&amp;lt;/xsl:template&amp;gt;&lt;br&gt;
&lt;br&gt;
&amp;lt;/xsl:stylesheet&amp;gt;&lt;/tt&gt;&lt;/blockquote&gt;This one will deal with the case of extra tags in the hyperlink too, eg, &amp;lt;a href=&quot;...&quot;&amp;gt;&lt;b&gt;&amp;lt;b&amp;gt;&lt;/b&gt;http://chance-to-advertise-my-site-in-code.com&lt;b&gt;&amp;lt;/b&amp;gt;&lt;/b&gt;&amp;lt;/a&amp;gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-412985</guid>
		<pubDate>Wed, 26 Oct 2005 15:38:35 -0800</pubDate>
		<dc:creator>holloway</dc:creator>
	</item><item>
		<title>By: Ogre Lawless</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413011</link>	
		<description>I&apos;m a bit confused by what you mean by &quot;endless loop&quot; but the only thing you should need to escape is the / character, and only if you use the default characters.  If as kcm and your original post suggest, not even that.  Using the default slashset:&lt;br&gt;
&lt;br&gt;
$_ = &apos;&amp;lt;A href=&quot;http://metafilter.com&quot;&amp;gt;METAFILTER IS THE BEST!&amp;lt;/A&amp;gt;&apos;;&lt;br&gt;
&lt;br&gt;
if (/&amp;lt;a.*href=.*&amp;gt;(.*?)&amp;lt;\/a&amp;gt;/gi) {&lt;br&gt;
 my $modtxt = my $z = $1;&lt;br&gt;
 $modtxt =~ s/ IS THE BEST//; # for example&lt;br&gt;
 s/&quot;&amp;gt;$z&amp;lt;/&quot;&amp;gt;$modtxt&amp;lt;/;&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
print $_;&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
on preview, I&apos;m repeating macrone a bit.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413011</guid>
		<pubDate>Wed, 26 Oct 2005 15:50:56 -0800</pubDate>
		<dc:creator>Ogre Lawless</dc:creator>
	</item><item>
		<title>By: macrone</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413030</link>	
		<description>Looking over &lt;b&gt;Ogre Lawless&lt;/b&gt;&apos;s comment and mine again, I think you should also make sure that all the wilcard matches are non-greedy. Any &quot;.*&quot; is liable to eat up much more text than intended, which could lead to nested matches and replacements, which I suppose could lead to some kind of evil recursion.&lt;br&gt;
&lt;br&gt;
To restate my patterns:&lt;br&gt;
&lt;br&gt;
TO CAPTURE:&lt;br&gt;
&lt;br&gt;
m{&amp;lt;a.+?href=[^&amp;gt;]+&amp;gt;\s*(.+?)\s*&amp;lt;/a&amp;gt;}igs;&lt;br&gt;
&lt;br&gt;
(There&apos;s going to be at least a space after the &amp;lt;a, but you don&apos;t want to match into another link. You also want to match across linebreaks, thus the &quot;s&quot; modifier.)&lt;br&gt;
&lt;br&gt;
TO REPLACE:&lt;br&gt;
&lt;br&gt;
s{&amp;gt;\s*$z\s*&amp;lt;}{&amp;gt;$modtxt&amp;lt;};</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413030</guid>
		<pubDate>Wed, 26 Oct 2005 16:05:15 -0800</pubDate>
		<dc:creator>macrone</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413036</link>	
		<description>How about you show us your whole code? That might help. The thing about the endless loop doesn&apos;t seem apparent from what you&apos;ve given us.&lt;br&gt;
&lt;br&gt;
The best way to get Perl help, as much as I love Ask, is &lt;a href=&quot;http://perlmonks.org/?node=Seekers%20of%20Perl%20Wisdom&quot;&gt;PerlMonks&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413036</guid>
		<pubDate>Wed, 26 Oct 2005 16:13:12 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413043</link>	
		<description>&lt;strong&gt;Captain_Tenille&lt;/strong&gt;, can you give an example of the sort of input you are expecting, and the resulting output that you&apos;d like?&lt;br&gt;
&lt;br&gt;
You say that you are trying to &quot;&lt;em&gt;shorten the text of a link if it&apos;s a URL, but only a URL&lt;/em&gt;&quot;. Maybe, I&apos;m reading this the wrong way, but I take that to mean t&lt;em&gt;&lt;/em&gt;hat you&apos;d like to change link text like &quot;&lt;a href=&quot;http://www.somesite.com/dir/file.htm&quot;&gt;http://www.somesite.com/dir/file.htm&lt;/a&gt;&quot;  into something like &quot;&lt;a href=&quot;http://www.somesite.com/dir/file.htm&quot;&gt;somesite.com&lt;/a&gt;&quot;  or &quot;&lt;a href=&quot;http://www.somesite.com/dir/file.htm&quot;&gt;file.htm&lt;/a&gt;&quot;, while leaving links with text like &quot;&lt;a href=&quot;http://www.somesite.com/dir/file.htm&quot;&gt;My File&lt;/a&gt;&quot; unchanged.&lt;br&gt;
&lt;br&gt;
Is this what you are trying to accomplish? &lt;strong&gt;&lt;/strong&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413043</guid>
		<pubDate>Wed, 26 Oct 2005 16:23:11 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: Captain_Tenille</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413044</link>	
		<description>woj: that&apos;s pretty much exactly what I&apos;m trying to accomplish.&lt;br&gt;
&lt;br&gt;
If anyone wants to see the code in question, what I&apos;ve been working on can be seen here: &lt;a href=&quot;http://home.satanosphere.com/cam-pics/linktest.pl&quot;&gt;linktest.pl&lt;/a&gt;. I would just post it, but I need to go watch my daughter for a bit and don&apos;t feel like futzing with formatting.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413044</guid>
		<pubDate>Wed, 26 Oct 2005 16:27:59 -0800</pubDate>
		<dc:creator>Captain_Tenille</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413056</link>	
		<description>OK here&apos;s something:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
#!/usr/bin/perl&lt;br&gt;
undef $/;&lt;br&gt;
open( HTMLFILE, &quot;/usr/ambrose/file.html&quot; ) || die &quot;$!&quot;;&lt;br&gt;
my $html = &amp;lt;HTMLFILE&amp;gt;;&lt;br&gt;
close( HTMLFILE );&lt;br&gt;
&lt;br&gt;
$html =~ s{(&amp;lt;a [^&amp;gt;]+&amp;gt;)([^&lt; ]+)&amp;lt;/a&gt;}&lt;br&gt;
         {$1 . munge($2) . &apos;&amp;lt;/a&amp;gt;&apos;}egsi;&lt;br&gt;
&lt;br&gt;
print $html;&lt;br&gt;
&lt;br&gt;
sub munge() {&lt;br&gt;
    my $tag_contents = shift();&lt;br&gt;
    if ( $tag_contents =~ m|^http(s)?://|&lt;br&gt;
        &amp;amp;&amp;amp; length( $tag_contents ) &amp;gt; 32 )&lt;br&gt;
    {&lt;br&gt;
        $tag_contents = substr( $tag_contents, 0, 32 ) . &apos;...&apos;;&lt;br&gt;
    }&lt;br&gt;
    return $tag_contents;&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
&lt;/&gt;&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
where the HTML file in question looks like this:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/&quot;&amp;gt;http://www.yahoo.com/&amp;lt;/a&amp;gt;&lt;br&gt;
Short URL as link text&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/&quot;&amp;gt;Click here&amp;lt;/a&amp;gt;&lt;br&gt;
non-URL as link text&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/foo/bar/baz/quux/&quot;&amp;gt;http://www.yahoo.com/foo/bar/baz/quux/&amp;lt;/a&amp;gt; &lt;br&gt;
long URL as link text&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
What it should do is: ignore link content which isn&apos;t a URL, ignore URLs if they&apos;re less than 32 chars long, and change the ones which are longer into the first 32 chars, plus &apos;...&apos; to show you&apos;ve truncated them.&lt;br&gt;
&lt;br&gt;
How&apos;s that?&lt;br&gt;
&lt;br&gt;
It outputs this on the test file:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/&quot;&amp;gt;http://www.yahoo.com/&amp;lt;/a&amp;gt;&lt;br&gt;
Short URL as link text&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/&quot;&amp;gt;Click here&amp;lt;/a&amp;gt;&lt;br&gt;
non-URL as link text&lt;br&gt;
&amp;lt;a href=&quot;http://www.yahoo.com/foo/bar/baz/quux/&quot;&amp;gt;http://www.yahoo.com/foo/bar/baz...&amp;lt;/a&amp;gt; &lt;br&gt;
long URL as link text&lt;br&gt;
&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413056</guid>
		<pubDate>Wed, 26 Oct 2005 16:46:57 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: macrone</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413063</link>	
		<description>&lt;b&gt;AmbroseChapel&lt;/b&gt;: Unfortunately, your script also ignores link text that itself contains tags. You should change:&lt;br&gt;
&lt;br&gt;
s{(&amp;lt;a [^&amp;gt;]+&amp;gt;)([^&amp;lt; ]+)&amp;lt;/a&amp;gt;}&lt;br&gt;
&lt;br&gt;
to:&lt;br&gt;
&lt;br&gt;
s{(&amp;lt;a [^&amp;gt;]+&amp;gt;)(.+?)&amp;lt;/a&amp;gt;}&lt;br&gt;
&lt;br&gt;
Otherwise, I think your approach is best: to execute code in the replace pattern, rather than running two regexes across the same data.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413063</guid>
		<pubDate>Wed, 26 Oct 2005 16:54:37 -0800</pubDate>
		<dc:creator>macrone</dc:creator>
	</item><item>
		<title>By: whatnotever</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413070</link>	
		<description>On preview:  Yeah, what AmbroseChapel said.  But here it is anyway.  Oh, and this just modifies your code, but whatever.&lt;br&gt;
&lt;br&gt;
I think your problem is more in how you&apos;re using the while loop.  I&apos;d skip it and use the &quot;e&quot; modifier on the s/// operator.  That lets you put an evaluated expression inside the replacement (which lets you run whatever code you like to generate the replacement).  Then, the standard &quot;g&quot; modifier on the s/// will just make all of your replacements for you without &quot;re-finding&quot; them.  Try this out:&lt;br&gt;
&lt;br&gt;
$link =~ s/(&amp;lt;a.+?href[^&amp;gt;]+&amp;gt;)\s*(.+?)\s*(&amp;lt;\/a&amp;gt;)/$1 . &amp;amp;shortenMe($2) . $3/gei;&lt;br&gt;
&lt;br&gt;
sub shortenMe {&lt;br&gt;
	my $input = shift;&lt;br&gt;
	return $input if $input !~ m#(^http://|^ftp://)#;&lt;br&gt;
&lt;br&gt;
	my @tok = split &apos;/&apos;, $input;&lt;br&gt;
&lt;br&gt;
	my $protocol = shift @tok;&lt;br&gt;
	shift @tok; #off into space&lt;br&gt;
	my $domain = shift @tok;&lt;br&gt;
	my $remainder =  shift @tok;&lt;br&gt;
	my $modtxt = &quot;$protocol//$domain/...&quot;;&lt;br&gt;
	return $modtxt;&lt;br&gt;
}</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413070</guid>
		<pubDate>Wed, 26 Oct 2005 16:59:46 -0800</pubDate>
		<dc:creator>whatnotever</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413091</link>	
		<description>This might work for you, with a small amount of tweaking. Mine will use the host name from the href as the new link text, and uses &lt;a href=&quot;null&quot;&gt;&lt;a href=&quot;http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI/http.pm&quot;&gt;Regex::Common&lt;/a&gt;&lt;/a&gt;:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
use Regexp::Common qw /URI/;&lt;br&gt;
use warnings;&lt;br&gt;
use strict;&lt;br&gt;
&lt;br&gt;
open(my $html, &quot;&amp;lt;test.html&quot;);&lt;br&gt;
&lt;br&gt;
while (&amp;lt;$html&amp;gt;) {&lt;br&gt;
 m#(&amp;lt;a\s+href.*&amp;gt;)\s*(\S+)\s*(&amp;lt;/a\s*&amp;gt;)#mi;&lt;br&gt;
     my ($opentag, $linktext, $closetag) = ($1, $2, $3);&lt;br&gt;
&lt;br&gt;
       if ($linktext =~ /$RE{URI}{HTTP}{-keep}/) {&lt;br&gt;
              my $host = $3;&lt;br&gt;
              print &quot;Found URL as link text...\n&quot;;&lt;br&gt;
              print &quot;\tNew link is \&apos;$opentag$host$closetag\&apos;\n&quot;;&lt;br&gt;
       }&lt;br&gt;
}&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
The file &quot;test.html&quot; looks pretty much like AmbroseChapel&apos;s.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413091</guid>
		<pubDate>Wed, 26 Oct 2005 17:20:25 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413096</link>	
		<description>Oh yeah, and I wanted to mention that you can use Regex::Common to match on &lt;em&gt;&lt;strong&gt;any&lt;/strong&gt;&lt;/em&gt; type of URI, not just http.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413096</guid>
		<pubDate>Wed, 26 Oct 2005 17:22:11 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413103</link>	
		<description>Sorry to keep posting, but I just re-read your code, and if you change the if statement in my example to look like this, then it does what your script intends:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
    if ($linktext =~ /$RE{URI}{HTTP}{-keep}/) {&lt;br&gt;
        my ($proto, $host) = ($2,$3);&lt;br&gt;
        print &quot;Found URL as link text...\n&quot;;&lt;br&gt;
        print &quot;\tOld link was \&apos;$opentag$linktext$closetag\&apos;\n&quot;;&lt;br&gt;
        print &quot;\tNew link is \&apos;$opentag $proto://$host... $closetag\&apos;\n&quot;;&lt;br&gt;
    }&lt;br&gt;
&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413103</guid>
		<pubDate>Wed, 26 Oct 2005 17:29:55 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413104</link>	
		<description>&lt;em&gt;Unfortunately, your script also ignores link text that itself contains tags.&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
But we don&apos;t need to change those ones!&lt;br&gt;
&lt;br&gt;
I mean, you&apos;re right of course, but there&apos;s no URL which needs to be shortened which will be missed, is there?&lt;br&gt;
&lt;br&gt;
Mind you, what happens if the link text contains something like:&lt;br&gt;
&lt;br&gt;
http://www.blah.com/ &amp;lt;b&amp;gt;I love this site!&amp;lt;/b&amp;gt;&lt;br&gt;
&lt;br&gt;
then we&apos;re in trouble...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413104</guid>
		<pubDate>Wed, 26 Oct 2005 17:32:09 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413145</link>	
		<description>&lt;em&gt;Mind you, what happens if the link text contains something like:&lt;br&gt;
&lt;br&gt;
http://www.blah.com/ &amp;lt;b&amp;gt;I love this site!&amp;lt;/b&amp;gt;&lt;br&gt;
&lt;br&gt;
then we&apos;re in trouble...&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
Perhaps not the most elegant solution in the world, but this works even with links as ugly as...&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;&amp;lt;a href=&quot;http://ask.metafilter.com/mefi/26171&quot;&amp;gt;&amp;lt;em&amp;gt;check out&amp;lt;/em&amp;gt; http://ask.metafilter.com/mefi/26171 &amp;lt;strong&amp;gt;for more info&amp;lt;/strong&amp;gt;&amp;lt;/a&amp;gt;&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
use Regexp::Common qw /URI/;&lt;br&gt;
use warnings;&lt;br&gt;
use strict;&lt;br&gt;
&lt;br&gt;
open(my $html, &quot;&amp;lt;test.html&quot;);&lt;br&gt;
&lt;br&gt;
while (&amp;lt;$html&amp;gt;) {&lt;br&gt;
&lt;br&gt;
    m#(&amp;lt;a\s+href.*?&amp;gt;)(.*?)(&amp;lt;/a\s*&amp;gt;)#i;&lt;br&gt;
    my ($opentag, $linktext, $closetag) = ($1, $2, $3);&lt;br&gt;
    my $replace = $opentag;&lt;br&gt;
    foreach my $chunk (split /\s+/,$linktext) {&lt;br&gt;
        if ($chunk=~/$RE{URI}{HTTP}{-keep}/ ){&lt;br&gt;
            my ($proto, $host) = ($2,$3);&lt;br&gt;
            $replace.=&quot; $proto://$host... &quot;;&lt;br&gt;
        }&lt;br&gt;
        else {&lt;br&gt;
            $replace.=&quot; $chunk&quot;;&lt;br&gt;
        }&lt;br&gt;
&lt;br&gt;
    }&lt;br&gt;
    print &quot;Replacement is $replace\n&quot;;&lt;br&gt;
}&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
&lt;small&gt;Sorry, I don&apos;t feel like adding in a bunch of nbsp&apos;s to indent it correctly. &lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413145</guid>
		<pubDate>Wed, 26 Oct 2005 18:07:42 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413155</link>	
		<description>Somewhat off-topic but I&apos;m interested in this part:&lt;br&gt;
&lt;code&gt;&lt;br&gt;
open(my $html, &quot;&amp;lt;test.html&quot;);&lt;br&gt;
while (&amp;lt;$html&amp;gt;){&lt;br&gt;
}&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
That wouldn&apos;t work if the tag was split across lines, would it? Say you had&lt;br&gt;
&lt;code&gt;&lt;br&gt;
&amp;lt;a &lt;br&gt;
href=&quot;foo&quot;&amp;gt;&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
for instance?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413155</guid>
		<pubDate>Wed, 26 Oct 2005 18:18:09 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413161</link>	
		<description>True, I guess you&apos;d have to slurp in the file and use multi-line matching. I was just using the standard input record separator for simplicity&apos;s sake when I was testing it out. Honestly, I didn&apos;t realize that you could put newlines within the tags without choking the browser. :)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413161</guid>
		<pubDate>Wed, 26 Oct 2005 18:23:01 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413169</link>	
		<description>And while I&apos;m nit-picking, this:&lt;br&gt;
&lt;br&gt;
&lt;code&gt; m#(&amp;lt;a\s+href.*&amp;gt;)\s*(\S+)\s*(&amp;lt;/a\s*&amp;gt;)#mi;&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
is going to give problematic results if there are two &lt;strong&gt;a&lt;/strong&gt; tags on the same line, because of the &lt;strong&gt;&quot;.*&quot;&lt;/strong&gt; being greedy.&lt;br&gt;
&lt;code&gt;&lt;br&gt;
&amp;lt;a href=&quot;http://foo.com/&quot;&amp;gt;foo&amp;lt;/a&amp;gt;, &amp;lt;a href=&quot;http://bar.com/&quot;&amp;gt;bar&amp;lt;/a&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
for instance will get you everything up to the second &lt;strong&gt;&quot;bar&quot;&lt;/strong&gt; in &lt;strong&gt;$1&lt;/strong&gt;.&lt;br&gt;
&lt;em&gt;&lt;br&gt;
I didn&apos;t realize that you could put newlines within the tags without choking the browser. :)&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
Any whitespace is legal, including returns. I&apos;ve been bitten before...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413169</guid>
		<pubDate>Wed, 26 Oct 2005 18:27:54 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: woj</title>
		<link>http://ask.metafilter.com/26171/Dealing-with-HTML-Parsing-Misery#413192</link>	
		<description>&lt;em&gt;Any whitespace is legal, including returns. I&apos;ve been bitten before...&lt;br&gt;
&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
I actually don&apos;t work with html documents too often, so that is good to know. The missing ? was a typo on my part, but yes, there would be a problem with two tags on the line. And with images as links, assuming he doesn&apos;t want to break those, and probably a bunch of other problems I can&apos;t think of right now... I think I&apos;m gonna sit out the rest of this one. Good luck.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2005:site.26171-413192</guid>
		<pubDate>Wed, 26 Oct 2005 19:02:12 -0800</pubDate>
		<dc:creator>woj</dc:creator>
	</item>
	</channel>
</rss>
