Dealing with HTML Parsing Misery?
October 26, 2005 3:01 PM
Subscribe
I need to write something (in perl) to shorten the text of a link if it's a URL, but only a URL. I've played with a variety of regexps and banged my head against HTML::Parser, but I've gotten no love. Help!
As far as regexps, go, I've done OK with
/\(.*?)\< \/a\>/gi > for getting the link text in question out, and modifying it isn't a problem.
The problem arises when I try to put it back. All I've been able to do is either only replace part of the orginal link OR I get stuck in an endless loop. I've been escaping my > and < signs, but that doesn't seem to help at all. for examples of what i've been trying (where in each $z is a copy of the original $1 from the first regexp, and $modtxt is the text i want to replace), tt>s#">$z\< #$modtxt\#/tt> will replace the text, but mungs up the tag so the initial A tag isn't closed before the ending tag.
s#">$z\< #\\>$modtxt\< #/tt>, on the other hand, gets stuck in an endless loop.
I've been googling and banging my head against this for several days, and while I think I must be overlooking something really simple, I can't figure out what it is. Thus, I turn to Ask.Me for assistance.
(btw, getting those regexps through was a surprisingly difficult undertaking)>>>>
posted by Captain_Tenille to computers & internet (25 comments total)
These are the correct regular expressions:
First one: /\<a.*href=.*\>(.*?)\< \/a\>/gi
Second one: s/">$z\</$modtxt\</
Third one: s/">$z\</\"\>$modtxt\</
Hopefully they make it out of live preview.>
posted by Captain_Tenille at 3:07 PM on October 26, 2005