Regular expression question
December 13, 2005 11:44 AM   Subscribe

RegexFilter: I want to strip out all HTML from a string except for approved tags of B and I.

I have this pattern : "< (p|img)*?>" Which strips out any instance of P or IMG tags, but I want to reverse it... I want to say only allow B or I. I tried this: "< ^(b|i)*?>" thinking that mean any character NOT in the group b or i, but no go. Any tips before I go insane?
posted by xmutex to Computers & Internet (25 answers total)
 
if you use the caret like this it refers to the beginning of the string.
I think /< ([^bi])+>/ is what you need.
I'm not sure what you are trying to do with the "*?" but it usually does not make sense to comine them since "?" (0-1 instances) is contained in "*" (0-infinite number of instances).
posted by snownoid at 11:58 AM on December 13, 2005


^ only means NOT when inside of a range [] , outside of a range it means start of string.
posted by furtive at 11:59 AM on December 13, 2005


Are you doing this in perl? I did something like this recently and I think it looks like this:
$b = "b";
$i = "i";

s/\< [^$b$i]\>//g;

This is a great page on regular expressions.
posted by Alison at 12:00 PM on December 13, 2005


Any tips before I go insane?

Parsing HTML with only Regex leads to insanity. End of story.

If you're using PHP or have access to its command line app, there's a strip_tags() function that will do what you want.

If you're using perl there are a bunch of HTML parsing CPAN modules (not super familiar with them, so I can't recommend one)

If you're stuck in a text editor...good luck :)
posted by alana at 12:03 PM on December 13, 2005


Response by poster: snownoid: "< ([^bi])+>" puts me closer but still seems to allow img tags?

I tried modifying it as such "< ([^bi])>" thinking that would only allow one character between < and>, but no go..
posted by xmutex at 12:03 PM on December 13, 2005


What alan said. Parse HTML with an HTML parser, which regular expressions aren't. If you're dealing with user input, you will forget edge cases (you've already forgotten about whitespace and <p<p>>!) It's a solved problem, so reuse some of the working code that's already out there.
posted by mendel at 12:29 PM on December 13, 2005


Yes, right. [^bi] it is equivalent to "not b and not i" so all tags that contain a "b" or an "i" are not matched because not all their characters are neither "b"s nor "i"s.
I thought you could try "/< [^b]|[^i]+>/" but that doesn't work either because a "b" is not an "i" and is thus matched.
Actually I'm not so sure anymore it is possible to do what you want using only regular expressions.
If you are using php you could do something like
preg_match("/< (\w+)>/",$yourstring,$match) (not sure the syntax is perfectly correct) and then check whether match[0] is "b" or "i".
posted by snownoid at 12:33 PM on December 13, 2005


Probably a better solution, but you can run two regexs if this isn't a very processor intensive script:

$test = "<i>this</i> is a <b>very</b> good test. <p>don't you think?</p><br><br><img src=test.gif>";

$test =~ s/\<\w{2}.*?\>//g; //any tags 2 characters or longer
$test =~ s/\<[^ib]*?\>//g; //any tags not <b> or <i>
posted by ducksauce at 12:43 PM on December 13, 2005


And by "probably a better solution", I meant "there is probably a better solution than what I'm about to post", in case that wasn't not clear.
posted by ducksauce at 12:44 PM on December 13, 2005


Oh, that's a good idea.

The regular expressions would be more precise/correct like this, though:
$test =~ s/< \w{2,}>//g; //any tags 2 characters or longer
$test =~ s/< [^ib]>//g; //any tags not or
posted by snownoid at 12:52 PM on December 13, 2005


Response by poster: Cool. I will test these out. As a follow-up: is there some way to regex search for Microsoft Word 'smart' quotes or whatever they are called?
posted by xmutex at 1:05 PM on December 13, 2005


I find it is often easier to break something like this down into steps. On my comment script which I use for my Web site, I do it this way:

1) Convert all & to &amp;
2) Convert all < to &lt;
3) Convert all &lt;B> to <B> (case-insensitive)
4) Same for &lt;I>, &lt;/B>, and &lt;/I>.

These are all simple text searches, no regex involved. Steps 3 & 4 could be combined using one regex, though.

This has the effect of leaving any non-permitted tags as text rather than stripping them out, which may not be exactly what you want, but you could follow this up with a regex that strips out &lt;.*?> (where *? has the Perl meaning of a non-greedy *).
posted by kindall at 1:32 PM on December 13, 2005


This last question (about the smart quotes) makes a lot more sense if we know what your doing this in. PHP? PERL? A Text Editor (and if so, which one)
posted by miniape at 1:38 PM on December 13, 2005


Response by poster: miniape: C#. Could do it in anything (php/perl) though.
posted by xmutex at 1:42 PM on December 13, 2005


I could be wrong, but I believe you need to refer to them with hexidecimal character codes: \xhh character with hex code hh
Here are some PCRE docs. Check out the backslash section. I'm not sure if C# is Perl Compatible though.
http://adm.jinr.ru/doc/exim/pcre.html#SEC14

If you're anything like me, you'll have most of your hair gone by the end of the night trying to figure out what to escape, what's getting interpreted as a back reference and what's actually working.
posted by miniape at 1:53 PM on December 13, 2005


Here's another option that is telling you not to go anywhere near this task with regular expressions. It is close to impossible to reliably strip some but not all tags using a regular expression that does not fail under strange weird edge cases. And if done improperly, this can lead to a cross site scripting vulnerability that would allow someone to embed javascript on the page and steal you login cookie, among other things.

If you think this is a triviality, go revew some of the numerous security advisories against things like phpBB or IPB that tried to do this and have gotten it wrong.

Just... Don't.
posted by Rhomboid at 3:21 PM on December 13, 2005


^^ what he said.
posted by holloway at 5:10 PM on December 13, 2005


Yeah, as soon as you allow users to enter any markup, even just a couple of tags, it's surprisingly difficult to avoid opening up security holes.

Regular expressions should be fine in this case though if you're really careful. I'd suggest converting the permitted tags to some other cryptic form to set them aside, strip all remaining tags, then strip any stray greater/less than symbols, then convert the allowed tags back.
posted by malevolent at 10:59 PM on December 13, 2005


Response by poster: Thanks all for the thoughts. I have to do this, sadly. Trying to move an archaic HTML-page-based web zine/journal (content pasted in from MS Word; my God!) to MT and need to parse out entries from HTML.

Beastly burden, but it must be done.
posted by xmutex at 9:28 AM on December 14, 2005


content pasted in from MS Word

If you mean that they're full of Word-generated HTML, HTML Tidy is particularly good at stripping that out specifically.
posted by mendel at 9:46 AM on December 14, 2005


I don't know exactly what you're doing here, but if you're trying to convert MS word docs with bold and italics kept in and to strip out all the smart quotes, I have had very good luck running a batch of .doc files against antiword with formatting turned on, then running them against a script to turn *bold* and /italics/ into html tags. This gives you ascii text (no smart quotes). I've never really played with HTML tidy, but medel's idea might be much better.

But if you're working with just html and you want to strip all the tags except the bold and italic, consider turning those tags into something else first (like |-BOLD-|the words|-ENDBOLD-|), then using a regex to remove all tags or the equivalent of the strip_tags function in C# if one exists. Then replace all the |-BOLD-|s with actual html tags.

It's an extra step, but it's easy.
posted by miniape at 9:59 AM on December 14, 2005


Thanks all for the thoughts. I have to do this, sadly. Trying to move an archaic HTML-page-based web zine/journal (content pasted in from MS Word; my God!) to MT and need to parse out entries from HTML.
What in the world has that got to do with using a parser instead of a regular expression? Of course you have to strip tags, nobody is doubting that. Using REs to do it is what is so bad.

Here's a perfect example of what I'm talking about - just released today: Bypass XSS filter in PHPNUKE 7.9=>x Yet another coder that thought they could just write a simple little RE and be on their way...
posted by Rhomboid at 9:59 AM on December 14, 2005


Don't use Regexes to parse HTML.
A quick search in CPAN reveals HTML::Scrubber.
posted by Sharcho at 4:03 PM on December 14, 2005


Definitely avoid doing the thing with regexes if you can.

But here's a different, three-pass regex-based approach:
  • replace all <b> and <i> tags with placeholders, for instance something like ##b## or %%i%%
  • remove all HTML
  • put the B and I tags back.
and just for fun, here's a regex which will distinguish between <b> and <i> tags and tags which simply begin with B and I.

< [bi](\s.*?)?>

where the b or i is followed optionally by a space and then some other stuff, so it can't be followed by 'mg' or 'ockquote'. This also takes care of a possible problem with things like <i class="foo">.
posted by AmbroseChapel at 11:39 PM on December 14, 2005


Hmm. Obviously that regex should be <[bi](\s.*?)?> with the [bi] bit straight after the bracket.
posted by AmbroseChapel at 11:43 PM on December 14, 2005


« Older How do you use generic interfaces in Java 1.5?   |   Printer Cartridge Alignment Failed Newer »
This thread is closed to new comments.