How to parse unencoded ampersands in URLs in PHP?
December 11, 2003 3:55 PM   Subscribe

How to parse unencoded ampersands in URLs in PHP with preg_replace() while excluding already-encoded ampersands? (more inside)

Some of my blog content comes from a remote collaborative blog, and I just want to change &'s to & when they occur inside an <a href> tag, for validation purposes.

It's simple enough to do a preg_replace('/&/','&',$string); -- but what about when a conscientious submitter has already encoded the &? Any regex experts know how to exclude that?
posted by brownpau to Computers & Internet (8 answers total)
 
Um, those second occurences of the ampersand should be "&lamp;" if you get what I mean.
posted by brownpau at 3:59 PM on December 11, 2003


Gah.
posted by brownpau at 3:59 PM on December 11, 2003


Replace the & with %26
posted by riffola at 4:13 PM on December 11, 2003


Sorry, I should have added that %26 is the HEX code for &, and using that instead of & in an URI works a-ok, and you also don't have problems with XHTML/XML validation.
posted by riffola at 4:15 PM on December 11, 2003


preg_replace('/&(?!amp;)/','&amp;',$string); perhaps?

*prays that looks right on post*
posted by boaz at 4:22 PM on December 11, 2003


Probably superfluous, but a word of warning: Anything you do will be an heuristic, there is no way to distinguish between someone who writes &amp; and means & and has conscientiously encoded it for you, and someone who writes &amp; and means & followed by a followed by m followed by p followed by ;



Granted, the latter case is rather unlikely, but it becomes more likely as you start allowing more character entities.
posted by fvw at 4:30 PM on December 11, 2003


Why do it the confusing yet cute way?

Use html_entity_decode() to convert the encoded ampersands back to regular ampersands then do your replace. And I'd suggest using str_replace() since you won't need any fancy matching rules. Much less overhead.
posted by y6y6y6 at 5:02 PM on December 11, 2003


Synchronicity moment; I just spent a good chunk of today trying to solve the same problem in multiply-parsed XSL. (Turns out you just blindly wrap all childless text() nodes which contain the character '&' in CDATA tags on the first pass, is the trick to that one. Easy.)
posted by ook at 5:30 PM on December 11, 2003


« Older Seeking advice about Bernese Mountain Dogs   |   I'm trying to find a copy of the theme to "The... Newer »
This thread is closed to new comments.