A simple problem with a complicated solution
February 8, 2011 8:05 AM Subscribe
In ColdFusion 7, how do I find every URL in a string containing HTML, and replace them with root-relative versions?
I have a string containing the full HTML of a web page. At bare minimum, I want to find the values of every href="[x]" and src="[x]" attribute. Better still if it can grab things like preloaded JS rollover URLs (img.src = '[x]'; is one of a million possible examples). Some URLs will be absolute, some relative, and some already root-relative. I want to format all of them (if they're internal to this site, anyway) as root-relative, without removing them from their positions in the HTML string.
What's the best way to do this? I think I need to start by finding every URL, then filter out all the ones that don't start with a slash (because those are already root-relative). From there I can figure out how to transform the remaining ones. But I can't figure out how to get to that point. Some sort of regular expressions-based loop?
The pitfalls of parsing HTML with regular expressions are many and well-documented. I'm not sure that this is complicated enough to fall under that umbrella, since I don't need to map the whole DOM. But I need my code to be flexible enough to handle tag attributes that occur in any order, with or without spaces around the equals sign, with single quotes, double quotes, or no quotes at all. Yeah, I have no control over the HTML and can't guarantee it'll always be nicely formatted.
I have a string containing the full HTML of a web page. At bare minimum, I want to find the values of every href="[x]" and src="[x]" attribute. Better still if it can grab things like preloaded JS rollover URLs (img.src = '[x]'; is one of a million possible examples). Some URLs will be absolute, some relative, and some already root-relative. I want to format all of them (if they're internal to this site, anyway) as root-relative, without removing them from their positions in the HTML string.
What's the best way to do this? I think I need to start by finding every URL, then filter out all the ones that don't start with a slash (because those are already root-relative). From there I can figure out how to transform the remaining ones. But I can't figure out how to get to that point. Some sort of regular expressions-based loop?
The pitfalls of parsing HTML with regular expressions are many and well-documented. I'm not sure that this is complicated enough to fall under that umbrella, since I don't need to map the whole DOM. But I need my code to be flexible enough to handle tag attributes that occur in any order, with or without spaces around the equals sign, with single quotes, double quotes, or no quotes at all. Yeah, I have no control over the HTML and can't guarantee it'll always be nicely formatted.
Response by poster: I'm not real experienced with regex. It looks like your code will catch src="" but not href="", so how would you account for that?
ColdFusion 8 introduced a function called REMatch(), which I believe is the equivalent to the match() method you referenced. Unfortunately, I don't have access to that in CF7.
posted by The Winsome Parker Lewis at 9:19 AM on February 8, 2011
ColdFusion 8 introduced a function called REMatch(), which I believe is the equivalent to the match() method you referenced. Unfortunately, I don't have access to that in CF7.
posted by The Winsome Parker Lewis at 9:19 AM on February 8, 2011
Regex is the way to go. Use two, in case the attributes in HTML don't occur in order. You'll use regexes alot over your career, they are worth investing in learning. Google "regex buddy", he will be your best friend in creating the match statements. I use him to this day to write mine in seconds.
posted by bprater at 12:07 PM on February 8, 2011
posted by bprater at 12:07 PM on February 8, 2011
Depending on your requirements, jQuery may work for you, too.
posted by bprater at 12:08 PM on February 8, 2011
posted by bprater at 12:08 PM on February 8, 2011
Response by poster: Thanks for the suggestion, bprater, I'll check out RegexBuddy. I need to run the transformation completely server-side, so jQuery is out. I still haven't gotten a conclusive answer in this thread but I'm trying a few more things on my own. If I find a solution I'll post it here for posterity.
posted by The Winsome Parker Lewis at 12:16 PM on February 8, 2011
posted by The Winsome Parker Lewis at 12:16 PM on February 8, 2011
The tool I would use for this is the perl module HTML::SimpleLinkExtor.
posted by AmbroseChapel at 1:18 PM on February 8, 2011
posted by AmbroseChapel at 1:18 PM on February 8, 2011
Example code (assuming you've got perl and and module installed):
The 'extract.html' is this page, saved to my computer. The base HREF is passed in so that it can be used as the base to make the URLs absolute. It returns 129 absolute links of various kinds (CSS, favicon, image as well as regular links) from this page.
posted by AmbroseChapel at 1:35 PM on February 8, 2011
#!/usr/local/bin/perl
use HTML::SimpleLinkExtor;
my $extor = HTML::SimpleLinkExtor->new(
'http://ask.metafilter.com/177880/A-simple-problem-with-a-complicated-solution'
);
$extor->parse_file('/users/foo/extract.html');
@all_links = $extor->links;
foreach (@all_links) {
print $_, $/;
}
The 'extract.html' is this page, saved to my computer. The base HREF is passed in so that it can be used as the base to make the URLs absolute. It returns 129 absolute links of various kinds (CSS, favicon, image as well as regular links) from this page.
posted by AmbroseChapel at 1:35 PM on February 8, 2011
A lot of this depends on whether the HTML you're passing in is straightforward, or contains various edge conditions.
For instance, do you have to deal with the possibility that there might be sections of the HTML source commented out? How about CDATA sections?
If the HTML is totally arbitrary and you can't rule out things like that (and there are probably worse/weirder things that that, although those are the two that I have personally run into when running naive regexes on HTML), then you need to stop thinking "regex" and think "parser" instead.
If the input HTML is predictable (i.e. you're generating it yourself and know what it will and won't contain) than you might be fine just using a regex. But if you're pulling it from somewhere outside your own control I'd think about something like tagsoup to preprocess the HTML into XML which you can then parse out. It still is not 100% predictable in every circumstance (because even tagsoup, which is pretty clever, can be tricked by bad enough HTML), but it's lot better than trying to write some ridiculous regex.
posted by Kadin2048 at 4:04 PM on February 9, 2011
For instance, do you have to deal with the possibility that there might be sections of the HTML source commented out? How about CDATA sections?
If the HTML is totally arbitrary and you can't rule out things like that (and there are probably worse/weirder things that that, although those are the two that I have personally run into when running naive regexes on HTML), then you need to stop thinking "regex" and think "parser" instead.
If the input HTML is predictable (i.e. you're generating it yourself and know what it will and won't contain) than you might be fine just using a regex. But if you're pulling it from somewhere outside your own control I'd think about something like tagsoup to preprocess the HTML into XML which you can then parse out. It still is not 100% predictable in every circumstance (because even tagsoup, which is pretty clever, can be tricked by bad enough HTML), but it's lot better than trying to write some ridiculous regex.
posted by Kadin2048 at 4:04 PM on February 9, 2011
Response by poster: I finally came up with the code I needed. Since I hate threads asking for programming help that never get resolved, I wanted to come back and share my results. It was a lot more complicated than I anticipated... Here's what I came up with, processing a variable called htmlstring. You'll notice I expanded the HTML attributes I was looking in to include action="[x]" (for forms). Big apologies in advance if AskMe chokes on all the code I'm about to post... it's HTML encoded and wrapped in <pre> tags, but of course if there is a problem, I can't edit the post to fix it. *crosses fingers*
---------------------------------------------------
posted by The Winsome Parker Lewis at 12:49 PM on February 15, 2011
---------------------------------------------------
<!--- The REFindAll() function is courtesy of http://www.cflib.org/udf/REFindAll ---> <cffunction name="REFindAll" output="true" returnType="struct"> <cfargument name="regex" type="string" required="yes"> <cfargument name="text" type="string" required="yes"> <cfset var results=structNew()> <cfset var pos=1> <cfset var subex=""> <cfset var done=false> <cfset results.len=arraynew(1)> <cfset results.pos=arraynew(1)> <cfloop condition="not done"> <cfset subex=reFind(arguments.regex, arguments.text, pos, true)> <cfif subex.len[1] is 0> <cfset done=true> <cfelse> <cfset arrayappend(results.len, subex.len[1])> <cfset arrayappend(results.pos, subex.pos[1])> <cfset pos=subex.pos[1]+subex.len[1]> </cfif> </cfloop> <cfif arraylen(results.len) is 0> <cfset arrayappend(results.len, 0)> <cfset arrayappend(results.pos, 0)> </cfif> <cfreturn results> </cffunction> <!--- Function takes an attribute, returns a struct containing a URL and L/R character offsets ---> <cffunction name="extractUrlAndOffsets"> <!--- Feed this function an href="X", src="X", or action="X" attribute ---> <cfargument name="htmlattr" type="string" required="yes"> <!--- Find the equals sign and note its location so we don't get lost ---> <cfset offsetleft=Find(Chr(61), htmlattr)> <!--- Trim variable to the equals sign (using the offset we just found) ---> <cfset htmlattr=Right(htmlattr, Len(htmlattr) - offsetleft)> <!--- Strip away non-URL characters from the left and increment that offset accordingly ---> <cfloop condition="ListFindNoCase(' ,"",'',#Chr(10)#,#Chr(13)#', Left(htmlattr, 1)) neq 0"> <cfset htmlattr=Right(htmlattr, Len(htmlattr) - 1)> <cfset offsetleft=offsetleft + 1> </cfloop> <!--- Now nothing precedes the URL but count any junk that follows it in another offset var ---> <cfset offsetright=Len(htmlattr) - FindOneOf(' ,"",'',#Chr(10)#,#Chr(13)#>', htmlattr) + 1> <!--- In some cases the offset will equal the length of the htmlattr var, so compensate ---> <cfif offsetright neq Len(htmlattr)> <cftry> <!--- Strip away everything right of and including the offset, leaving only the URL ---> <cfset htmlattr=Left(htmlattr, Len(htmlattr) - offsetright)> <!--- If a malformed link throws an exception, notify someone and skip this link ---> <cfcatch> <!--- Returning a non-struct data type will cause this link to be ignored ---> <cfreturn false> </cfcatch> </cftry> <cfelse> <!--- Correct the offset value to zero, and don't modify the htmlattr string at all ---> <cfset offsetright=0> </cfif> <!--- Create a structure for the function to return ---> <cfset targetdata=structNew()> <!--- Load the two variables into the structure ---> <cfset StructInsert(targetdata, 'url', htmlattr, 'true')> <cfset StructInsert(targetdata, 'offsetleft', offsetleft, 'true')> <cfset StructInsert(targetdata, 'offsetright', offsetright, 'true')> <cfreturn targetdata> </cffunction> <!--- Function determines if a URL is for a file in this site, returns true or false ---> <cffunction name="isTranslationNeeded" returntype="boolean"> <!--- Feed this function a URL to check ---> <cfargument name="urltocheck" type="string" required="yes"> <!--- Is it an absolute URL, and is this domain not included somewhere in it? ---> <cfif Left(urltocheck, 4) eq 'http' and not FindNoCase('http://www.mydomain.com', urltocheck)> <!--- The URL points off-site, so don't touch it ---> <cfreturn false> <cfelse> <!--- This URL qualifies for translation ---> <cfreturn true> </cfif> </cffunction> <!--- Function converts any URL to root-relative format ---> <cffunction name="makeUrlRootRelative" returntype="string"> <!--- Feed this function a URL to format ---> <cfargument name="formaturl" type="string" required="yes"> <!--- If it begins with a slash, it's already root-relative; skip to the end ---> <cfif Left(formaturl, 1) neq '/'> <!--- If it's an absolute URL, make it root-relative ---> <cfif Left(formaturl, 4) eq 'http'> <!--- Remove the protocol and domain, up to the first slash ---> <cfset formaturl=ReplaceNoCase(formaturl, 'http://www.mydomain.com', '')> </cfif> <!--- If a link starts with a hash, prepend the root-relative path to the current page ---> <cfif Left(formaturl, 1) eq '##'> <!--- First add the query string, if there is one set ---> <cfif IsDefined("CGI.QUERY_STRING") and CGI.QUERY_STRING neq ''> <cfset formaturl=Insert('?'&CGI.QUERY_STRING, formaturl, 0)> </cfif> <!--- Precede the hash with the current filename to preserve anchors ---> <cfset formaturl=Insert(CGI.SCRIPT_NAME, formaturl, 0)> </cfif> <!--- If the URL still isn't root-relative, make it root-relative ---> <cfif Left(formaturl, 1) neq '/'> <cfset formaturl=Insert(GetDirectoryFromPath(CGI.SCRIPT_NAME), formaturl, 0)> </cfif> <!--- We need to convert relative URLs that use directory traversal ---> <!--- At this point they will be formatted like this: /dir/dir/../../file.cfm ---> <!--- We will delete each occurrence of ../ as well as the directory preceding it ---> <!--- But first find every instance of ./ (without the second dot) and delete it ---> <cfif FindNoCase(formaturl, './')> <!--- Remove every /./ from the URL, to catch every ./ not at the very beginning ---> <cfset formaturl=ReplaceNoCase(formaturl, '/./', '/', 'all')> <!--- Find and delete ./ from the very beginning of the URL if it exists ---> <cfif Left(formaturl, 2) eq './'> <cfset formaturl=Right(formaturl, Len(formaturl) - 1)> </cfif> <!--- Now that ./ has been dealth with, handle ../ directory traversal ---> <cfloop condition="FindNoCase('../', formaturl)"> <!--- Where in the string does the ../ occur? ---> <cfset dotdotslashposition=FindNoCase('../', formaturl)> <!--- Remove the ../ first ---> <cfset formaturl=RemoveChars(formaturl, dotdotslashposition, 3)> <!--- We'll now use dotdotslashposition as a pointer, move it back two chars ---> <!--- The reason for this is to put it before the slash that preceded the ../ ---> <cfset dotdotslashposition=dotdotslashposition - 2> <!--- If ../ occurs more than the number of directories, an exception is thrown ---> <cftry> <!--- Now delete one char at a time from here until another slash is found ---> <cfloop condition="Mid(formaturl, dotdotslashposition, 1) neq '/'"> <cfset formaturl=RemoveChars(formaturl, dotdotslashposition, 1)> <cfset dotdotslashposition=dotdotslashposition - 1> </cfloop> <!--- Now we have two slashes in a row, so delete one of them to finish ---> <cfset formaturl=RemoveChars(formaturl, dotdotslashposition, 1)> <!--- If too many ../ appear, don't display an error; return string as-is ---> <cfcatch><!--- Required tag in a cftry statement, just leave it empty ---></cfcatch> </cftry> </cfloop> </cfif> </cfif> <!--- The variable should definitely contain a root-relative URL at this point ---> <cfreturn formaturl> </cffunction> <!--- This regex should grab every HREF, SRC, and ACTION attribute in an HTML string ---> <cfset regex='(?:(?:\shref)|(?:\ssrc)|(?:\saction))=([\x22\x27]?)(\S+)\1'> <!--- Build a struct with positional and length data of every match in the HTML string ---> <cfset target="#REFindAll(regex, htmlstring)#"> <!--- The pointer will help us keep our place in the HTML string as it's manipulated ---> <cfset pointer=0> <!--- Begin processing each URL discovered ---> <cfloop index="x" from="1" to="#ArrayLen(target.pos)#"> <!--- Isolate this substring from the HTML string, to analyze and transform ---> <cfset href=Mid(htmlstring, target.pos[x] + pointer, target.len[x])> <!--- Common JavaScript references, ignore regex results containing them (they're not URLs) ---> <cfif not FindNoCase('document.', href) and not FindNoCase(';', href)> <cfset item=extractUrlAndOffsets(href)> <!--- If item var doesn't contain a struct, the link was malformed; skip it and move on ---> <!--- Also skip any links that don't qualify for translation ---> <cfif IsStruct(item) and isTranslationNeeded(item.url)> <!--- Convert the URL to root-relative form ---> <cfset thisurl=makeUrlRootRelative(item.url)> <!--- Remove the old URL to be replaced from the HTML string at this location ---> <cfset htmlstring=RemoveChars(htmlstring, target.pos[x] + item.offsetleft + pointer, target.len[x] - item.offsetleft - item.offsetright)> <!--- Insert the new URL in the same place, finally... we're almost done! ---> <cfset htmlstring=Insert(thisurl, htmlstring, target.pos[x] + item.offsetleft + pointer - 1)> <!--- Since the two URL lengths will differ, update the pointer for the next one ---> <cfset pointer = pointer - ((target.len[x] - item.offsetleft - item.offsetright) - Len(thisurl))> </cfif> </cfif> </cfloop> <!--- Every URL pointing to this site in the HTML string is now formatted as root-relative --->
posted by The Winsome Parker Lewis at 12:49 PM on February 15, 2011
« Older How to create multiple local users on multiple... | Paid VS Unpaid Design Internships Newer »
This thread is closed to new comments.
results = string.match("src\w*=\w*\"(.*)\"");
...and then parse through the results array? It will only contain the full url itself. Then you could loop through the original string matching the original extracted URL and replace it in place with your new relative URL.
Or maybe I'm under-thinking the situation.
posted by xax at 9:00 AM on February 8, 2011