<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

      <title>Comments on: Regular expression question</title>
      <link>http://ask.metafilter.com/28972/Regular-expression-question/</link>
      <description>Comments on Ask MetaFilter post Regular expression question</description>
	  	  <pubDate>Tue, 13 Dec 2005 11:58:35 -0800</pubDate>
      <lastBuildDate>Tue, 13 Dec 2005 11:58:35 -0800</lastBuildDate>
      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>

<item>
  	<title>Question: Regular expression question</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question</link>	
  	<description>RegexFilter: I want to strip out all HTML from a string except for approved tags of B and I. I have this pattern : &quot;&lt; (p|img)*?&gt;&quot; Which strips out any instance of P or IMG tags, but I want to reverse it... I want to say only allow B or I. I tried this: &quot;&lt; ^(b|i)*?&gt;&quot; thinking that mean any character NOT in the group b or i, but no go. Any tips before I go insane?&lt;/&gt;&lt;/&gt;</description>
  	<guid isPermaLink="false">post:ask.metafilter.com,2008:site.28972</guid>
  	<pubDate>Tue, 13 Dec 2005 11:44:03 -0800</pubDate>
  	<dc:creator>xmutex</dc:creator>
	
	<category>regex</category>
	
	<category>patterns</category>
	
</item>
<item>
  	<title>By: snownoid</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456100</link>	
  	<description>if you use the caret like this it refers to the beginning of the string. &lt;br&gt;
I think /&lt; ([^bi])+&gt;/ is what you need. &lt;br&gt;
I&apos;m not sure what you are trying to do with the &amp;quot;*?&amp;quot; but it usually does not make sense to comine them since &amp;quot;?&amp;quot; (0-1 instances) is contained in &amp;quot;*&amp;quot; (0-infinite number of instances).&lt;/&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456100</guid>
  	<pubDate>Tue, 13 Dec 2005 11:58:35 -0800</pubDate>
  	<dc:creator>snownoid</dc:creator>
</item>
<item>
  	<title>By: furtive</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456103</link>	
  	<description>^ only means NOT when inside of a range [] , outside of a range it means start of string.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456103</guid>
  	<pubDate>Tue, 13 Dec 2005 11:59:15 -0800</pubDate>
  	<dc:creator>furtive</dc:creator>
</item>
<item>
  	<title>By: Alison</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456105</link>	
  	<description>Are you doing this in perl?  I did something like this recently and I think it looks like this:&lt;br&gt;
$b = &amp;quot;b&amp;quot;;&lt;br&gt;
$i = &amp;quot;i&amp;quot;;&lt;br&gt;
&lt;br&gt;
s/\&lt; [^$b$i]\&gt;//g; &lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.comp.leeds.ac.uk/Perl/matching.html&quot;&gt;This&lt;/a&gt; is a great page on regular expressions.&lt;/&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456105</guid>
  	<pubDate>Tue, 13 Dec 2005 12:00:07 -0800</pubDate>
  	<dc:creator>Alison</dc:creator>
</item>
<item>
  	<title>By: alan</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456113</link>	
  	<description>&lt;em&gt;Any tips before I go insane?&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
Parsing HTML with only Regex leads to insanity.  End of story.&lt;br&gt;
&lt;br&gt;
If you&apos;re using PHP or have access to its command line app, there&apos;s a strip_tags() function that will do what you want.&lt;br&gt;
&lt;br&gt;
If you&apos;re using perl there are a bunch of HTML parsing CPAN modules (not super familiar with them, so I can&apos;t recommend one)&lt;br&gt;
&lt;br&gt;
If you&apos;re stuck in a text editor...good luck :)</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456113</guid>
  	<pubDate>Tue, 13 Dec 2005 12:03:31 -0800</pubDate>
  	<dc:creator>alan</dc:creator>
</item>
<item>
  	<title>By: xmutex</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456114</link>	
  	<description>snownoid: &amp;quot;&lt; ([^bi])+&gt;&amp;quot; puts me closer but still seems to allow img tags?&lt;br&gt;
&lt;br&gt;
I tried modifying it as such &amp;quot;&lt; ([^bi])&gt;&amp;quot; thinking that would only allow one character between &lt; and&gt;, but no go..&lt;/&gt;&lt;/&gt;&lt;/&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456114</guid>
  	<pubDate>Tue, 13 Dec 2005 12:03:37 -0800</pubDate>
  	<dc:creator>xmutex</dc:creator>
</item>
<item>
  	<title>By: mendel</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456125</link>	
  	<description>What alan said. Parse HTML with an HTML parser, which regular expressions aren&apos;t. If you&apos;re dealing with user input, you &lt;i&gt;will&lt;/i&gt; forget edge cases (you&apos;ve already forgotten about whitespace and &amp;lt;p&amp;lt;p&amp;gt;&amp;gt;!) It&apos;s a solved problem, so reuse some of the working code that&apos;s already out there.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456125</guid>
  	<pubDate>Tue, 13 Dec 2005 12:29:16 -0800</pubDate>
  	<dc:creator>mendel</dc:creator>
</item>
<item>
  	<title>By: snownoid</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456135</link>	
  	<description>Yes, right. [^bi] it is equivalent to &amp;quot;not b and not i&amp;quot; so all tags that contain a &amp;quot;b&amp;quot; or an &amp;quot;i&amp;quot; are not matched because not all their characters are neither &amp;quot;b&amp;quot;s nor &amp;quot;i&amp;quot;s.&lt;br&gt;
I thought you could try &amp;quot;/&lt; [^b]|[^i]+&gt;/&amp;quot; but that doesn&apos;t work either because a &amp;quot;b&amp;quot; is not an &amp;quot;i&amp;quot; and is thus matched. &lt;br&gt;
Actually I&apos;m not so sure anymore it is possible to do what you want using only regular expressions.&lt;br&gt;
If you are using php you could do something like&lt;br&gt;
preg_match(&amp;quot;/&lt; (\w+)&gt;/&amp;quot;,$yourstring,$match) (not sure the syntax is perfectly correct) and then check whether match[0] is &amp;quot;b&amp;quot; or &amp;quot;i&amp;quot;.&lt;/&gt;&lt;/&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456135</guid>
  	<pubDate>Tue, 13 Dec 2005 12:33:32 -0800</pubDate>
  	<dc:creator>snownoid</dc:creator>
</item>
<item>
  	<title>By: ducksauce</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456149</link>	
  	<description>Probably a better solution, but you can run two regexs if this isn&apos;t a very processor intensive script:&lt;br&gt;
&lt;br&gt;
$test = &amp;quot;&amp;lt;i&amp;gt;this&amp;lt;/i&amp;gt; is a &amp;lt;b&amp;gt;very&amp;lt;/b&amp;gt; good test.  &amp;lt;p&amp;gt;don&apos;t you think?&amp;lt;/p&amp;gt;&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&amp;lt;img src=test.gif&amp;gt;&amp;quot;;&lt;br&gt;
&lt;br&gt;
$test =~ s/\&amp;lt;\w{2}.*?\&amp;gt;//g; //any tags 2 characters or longer&lt;br&gt;
$test =~ s/\&amp;lt;[^ib]*?\&amp;gt;//g;    //any tags not &amp;lt;b&amp;gt; or &amp;lt;i&amp;gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456149</guid>
  	<pubDate>Tue, 13 Dec 2005 12:43:41 -0800</pubDate>
  	<dc:creator>ducksauce</dc:creator>
</item>
<item>
  	<title>By: ducksauce</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456151</link>	
  	<description>And by &amp;quot;probably a better solution&amp;quot;, I meant &amp;quot;there is probably a better solution than what I&apos;m about to post&amp;quot;, in case that wasn&apos;t not clear.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456151</guid>
  	<pubDate>Tue, 13 Dec 2005 12:44:13 -0800</pubDate>
  	<dc:creator>ducksauce</dc:creator>
</item>
<item>
  	<title>By: snownoid</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456164</link>	
  	<description>Oh, that&apos;s a good idea. &lt;br&gt;
&lt;br&gt;
The regular expressions would be more precise/correct like this, though:&lt;br&gt;
$test =~ s/&lt; \w{2,}&gt;//g; //any tags 2 characters or longer&lt;br&gt;
$test =~ s/&lt; [^ib]&gt;//g; //any tags not &lt;b&gt; or &lt;i&gt;&lt;/&gt;&lt;/&gt;&lt;/b&gt;&lt;/i&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456164</guid>
  	<pubDate>Tue, 13 Dec 2005 12:52:19 -0800</pubDate>
  	<dc:creator>snownoid</dc:creator>
</item>
<item>
  	<title>By: xmutex</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456185</link>	
  	<description>Cool. I will test these out. As a follow-up: is there some way to regex search for Microsoft Word &apos;smart&apos; quotes or whatever they are called?</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456185</guid>
  	<pubDate>Tue, 13 Dec 2005 13:05:14 -0800</pubDate>
  	<dc:creator>xmutex</dc:creator>
</item>
<item>
  	<title>By: kindall</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456224</link>	
  	<description>I find it is often easier to break something like this down into steps. On my comment script which I use for my Web site, I do it this way:&lt;br&gt;
&lt;br&gt;
1) Convert all &amp;amp; to &amp;amp;amp;&lt;br&gt;
2) Convert all &amp;lt; to &amp;amp;lt;&lt;br&gt;
3) Convert all &amp;amp;lt;B&amp;gt; to &amp;lt;B&amp;gt; (case-insensitive)&lt;br&gt;
4) Same for &amp;amp;lt;I&amp;gt;, &amp;amp;lt;/B&amp;gt;, and &amp;amp;lt;/I&amp;gt;.&lt;br&gt;
&lt;br&gt;
These are all simple text searches, no regex involved. Steps 3 &amp;amp; 4 could be combined using one regex, though.&lt;br&gt;
&lt;br&gt;
This has the effect of leaving any non-permitted tags as text rather than stripping them out, which may not be exactly what you want, but you could follow this up with a regex that strips out &amp;amp;lt;.*?&amp;gt; (where *? has the Perl meaning of a non-greedy *).</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456224</guid>
  	<pubDate>Tue, 13 Dec 2005 13:32:47 -0800</pubDate>
  	<dc:creator>kindall</dc:creator>
</item>
<item>
  	<title>By: miniape</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456235</link>	
  	<description>This last question (about the smart quotes) makes a lot more sense if we know what your doing this in. PHP? PERL? A Text Editor (and if so, which one)</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456235</guid>
  	<pubDate>Tue, 13 Dec 2005 13:38:09 -0800</pubDate>
  	<dc:creator>miniape</dc:creator>
</item>
<item>
  	<title>By: xmutex</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456240</link>	
  	<description>miniape: C#. Could do it in anything (php/perl) though.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456240</guid>
  	<pubDate>Tue, 13 Dec 2005 13:42:50 -0800</pubDate>
  	<dc:creator>xmutex</dc:creator>
</item>
<item>
  	<title>By: miniape</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456256</link>	
  	<description>I could be wrong, but I believe you need to refer to them with hexidecimal character codes: \xhh   character with hex code hh&lt;br&gt;
Here are some PCRE docs. Check out the backslash section. I&apos;m not sure if C# is Perl Compatible though.&lt;br&gt;
http://adm.jinr.ru/doc/exim/pcre.html#SEC14&lt;br&gt;
&lt;br&gt;
If you&apos;re anything like me, you&apos;ll have most of your hair gone by the end of the night trying to figure out what to escape, what&apos;s getting interpreted as a back reference and what&apos;s actually working.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456256</guid>
  	<pubDate>Tue, 13 Dec 2005 13:53:47 -0800</pubDate>
  	<dc:creator>miniape</dc:creator>
</item>
<item>
  	<title>By: Rhomboid</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456388</link>	
  	<description>Here&apos;s another option that is telling you not to go anywhere near this task with regular expressions.  It is close to impossible to reliably strip some but not all tags using a regular expression that does not fail under strange weird edge cases.  And if done improperly, this can lead to a cross site scripting vulnerability that would allow someone to embed javascript on the page and steal you login cookie, among other things.&lt;br&gt;
&lt;br&gt;
If you think this is a triviality, go revew some of the numerous security advisories against things like phpBB or IPB that tried to do this and have gotten it wrong.&lt;br&gt;
&lt;br&gt;
Just... Don&apos;t.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456388</guid>
  	<pubDate>Tue, 13 Dec 2005 15:21:04 -0800</pubDate>
  	<dc:creator>Rhomboid</dc:creator>
</item>
<item>
  	<title>By: holloway</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456523</link>	
  	<description>^^ what he said.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456523</guid>
  	<pubDate>Tue, 13 Dec 2005 17:10:01 -0800</pubDate>
  	<dc:creator>holloway</dc:creator>
</item>
<item>
  	<title>By: malevolent</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#456890</link>	
  	<description>Yeah, as soon as you allow users to enter any markup, even just a couple of tags, it&apos;s &lt;a href=&quot;http://ha.ckers.org/xss.html&quot;&gt;surprisingly difficult&lt;/a&gt; to avoid opening up security holes.&lt;br&gt;
&lt;br&gt;
Regular expressions should be fine in this case though if you&apos;re really careful. I&apos;d suggest converting the permitted tags to some other cryptic form to set them aside, strip all remaining tags, then strip any stray greater/less than symbols, then convert the allowed tags back.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-456890</guid>
  	<pubDate>Tue, 13 Dec 2005 22:59:26 -0800</pubDate>
  	<dc:creator>malevolent</dc:creator>
</item>
<item>
  	<title>By: xmutex</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#457279</link>	
  	<description>Thanks all for the thoughts. I have to do this, sadly. Trying to move an archaic HTML-page-based web zine/journal (content pasted in from MS Word; my God!) to MT and need to parse out entries from HTML.&lt;br&gt;
&lt;br&gt;
Beastly burden, but it must be done.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-457279</guid>
  	<pubDate>Wed, 14 Dec 2005 09:28:00 -0800</pubDate>
  	<dc:creator>xmutex</dc:creator>
</item>
<item>
  	<title>By: mendel</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#457312</link>	
  	<description>&lt;i&gt;content pasted in from MS Word&lt;/i&gt;&lt;br&gt;
&lt;br&gt;
If you mean that they&apos;re full of Word-generated HTML, &lt;a href=&quot;http://www.w3.org/People/Raggett/tidy/&quot;&gt;HTML Tidy&lt;/a&gt; is particularly good at stripping that out specifically.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-457312</guid>
  	<pubDate>Wed, 14 Dec 2005 09:46:13 -0800</pubDate>
  	<dc:creator>mendel</dc:creator>
</item>
<item>
  	<title>By: miniape</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#457333</link>	
  	<description>I don&apos;t know exactly what you&apos;re doing here, but if you&apos;re trying to convert MS word docs with bold and italics kept in and to strip out all the smart quotes, I have had very good luck running a batch of .doc files against antiword with formatting turned on, then running them against a script to turn *bold* and /italics/ into html tags. This gives you ascii text (no smart quotes). I&apos;ve never really played with HTML tidy, but medel&apos;s idea might be much better.&lt;br&gt;
&lt;br&gt;
But if you&apos;re working with just html and you want to strip all the tags except the bold and italic, consider turning those tags into something else first (like |-BOLD-|the words|-ENDBOLD-|), then using a regex to remove all tags or the equivalent of the strip_tags function in C# if one exists. Then replace all the |-BOLD-|s with actual html tags.&lt;br&gt;
&lt;br&gt;
It&apos;s an extra step, but it&apos;s easy.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-457333</guid>
  	<pubDate>Wed, 14 Dec 2005 09:59:26 -0800</pubDate>
  	<dc:creator>miniape</dc:creator>
</item>
<item>
  	<title>By: Rhomboid</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#457334</link>	
  	<description>&lt;blockquote&gt;&lt;i&gt;Thanks all for the thoughts. I have to do this, sadly. Trying to move an archaic HTML-page-based web zine/journal (content pasted in from MS Word; my God!) to MT and need to parse out entries from HTML.&lt;/i&gt;&lt;/blockquote&gt;&lt;br&gt;
What in the world has that got to do with using a parser instead of a regular expression?  Of course you have to strip tags, nobody is doubting that.  Using REs to do it is what is so bad.&lt;br&gt;
&lt;br&gt;
Here&apos;s a perfect example of what I&apos;m talking about - just released today: &lt;a href=&quot;http://marc.theaimsgroup.com/?l=full-disclosure&amp;m=113458008403953&amp;w=4&quot;&gt;Bypass XSS filter in PHPNUKE 7.9=&amp;gt;x&lt;/a&gt;  Yet another coder that thought they could just write a simple little RE and be on their way...</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-457334</guid>
  	<pubDate>Wed, 14 Dec 2005 09:59:56 -0800</pubDate>
  	<dc:creator>Rhomboid</dc:creator>
</item>
<item>
  	<title>By: Sharcho</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#457730</link>	
  	<description>Don&apos;t use Regexes to parse HTML.&lt;br&gt;
A quick search in CPAN reveals &lt;a href=&quot;http://search.cpan.org/~podmaster/HTML-Scrubber/Scrubber.pm&quot;&gt;HTML::Scrubber&lt;/a&gt;.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-457730</guid>
  	<pubDate>Wed, 14 Dec 2005 16:03:12 -0800</pubDate>
  	<dc:creator>Sharcho</dc:creator>
</item>
<item>
  	<title>By: AmbroseChapel</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#458125</link>	
  	<description>Definitely avoid doing the thing with regexes if you can.&lt;br&gt;
&lt;br&gt;
But here&apos;s a different, three-pass regex-based approach:&lt;br&gt;
&lt;br&gt;
&lt;ul&gt;&lt;li&gt;replace all &amp;lt;b&amp;gt; and &amp;lt;i&amp;gt; tags with placeholders, for instance something like&lt;br&gt;
##b## or  %%i%%&lt;br&gt;
&lt;li&gt;remove all HTML&lt;br&gt;
&lt;li&gt;put the B and I tags back.&lt;br&gt;
&lt;/li&gt;&lt;/li&gt;&lt;/li&gt;&lt;/ul&gt;and just for fun, here&apos;s a regex which will distinguish between &amp;lt;b&amp;gt; and &amp;lt;i&amp;gt; tags and tags which simply &lt;em&gt;begin&lt;/em&gt; with B and I.&lt;br&gt;
&lt;br&gt;
&lt;code&gt;  &lt; [bi](\s.*?)?&gt;&lt;/&gt;&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
where the b or i is followed optionally by a space and then some other stuff, so it can&apos;t be followed by &apos;mg&apos; or &apos;ockquote&apos;. This also takes care of a possible problem with things like &lt;code&gt;&amp;lt;i class=&amp;quot;foo&amp;quot;&amp;gt;&lt;/code&gt;.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-458125</guid>
  	<pubDate>Wed, 14 Dec 2005 23:39:14 -0800</pubDate>
  	<dc:creator>AmbroseChapel</dc:creator>
</item>
<item>
  	<title>By: AmbroseChapel</title>
  	<link>http://ask.metafilter.com/28972/Regular-expression-question#458128</link>	
  	<description>Hmm. Obviously that regex should be &lt;code&gt; &amp;lt;[bi](\s.*?)?&amp;gt;&lt;/code&gt; with the [bi] bit straight after the bracket.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.28972-458128</guid>
  	<pubDate>Wed, 14 Dec 2005 23:43:05 -0800</pubDate>
  	<dc:creator>AmbroseChapel</dc:creator>
</item>

    </channel>
</rss>
