A simple problem with a complicated solution
February 8, 2011 8:05 AM Subscribe
In ColdFusion 7, how do I find every URL in a string containing HTML, and replace them with root-relative versions?
posted by The Winsome Parker Lewis to computers & internet (9 answers total) 1 user marked this as a favorite
I have a string containing the full HTML of a web page. At bare minimum, I want to find the values of every href="[x]" and src="[x]" attribute. Better still if it can grab things like preloaded JS rollover URLs (img.src = '[x]'; is one of a million possible examples). Some URLs will be absolute, some relative, and some already root-relative. I want to format all of them (if they're internal to this site, anyway) as root-relative, without removing them from their positions in the HTML string.
What's the best way to do this? I think I need to start by finding every URL, then filter out all the ones that don't start with a slash (because those are already root-relative). From there I can figure out how to transform the remaining ones. But I can't figure out how to get to that point. Some sort of regular expressions-based loop?
The pitfalls of parsing HTML with regular expressions are many and well-documented. I'm not sure that this is complicated enough to fall under that umbrella, since I don't need to map the whole DOM. But I need my code to be flexible enough to handle tag attributes that occur in any order, with or without spaces around the equals sign, with single quotes, double quotes, or no quotes at all. Yeah, I have no control over the HTML and can't guarantee it'll always be nicely formatted.