<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Stripping some (but not all) formatting from rtf text (on a mac).</title>
	<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac/</link>
	<description>Comments on Ask MetaFilter post Stripping some (but not all) formatting from rtf text (on a mac).</description>
	<pubDate>Wed, 02 May 2007 01:56:05 -0800</pubDate>
	<lastBuildDate>Wed, 02 May 2007 01:56:05 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Stripping some (but not all) formatting from rtf text (on a mac).</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac</link>	
		<description>Is it possible to programmatically strip some (but not all) of the formatting from text that&apos;s been copied from safari? &lt;br /&gt;&lt;br /&gt; So you know how when you copy and paste from Safari to Textedit (or another rtf-aware program), all the formatting comes through -- the links, the italics, images, etc?&lt;br&gt;
&lt;br&gt;
I want to programmatically strip out all the formatting except the bolds and italics -- most importantly, to strip out the hyperlinks and images.  I&apos;m going to be copying a lot of text, so basically I just want this to be the push of a button.&lt;br&gt;
&lt;br&gt;
Writing a script for textedit itself is out: the rtf is opaque (the only option is just to convert it to plain text, and I&apos;ll lose the italics).  Nisus Writer Pro (beta) yields a similar roadblock (though if I could call up a contextual menu for each word using GUI scripting the problem&apos;s solved -- I don&apos;t think this is possible).  I&apos;d rather not use word because it&apos;s crashy under rosetta.  Suggestions?</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2007:site.61791</guid>
		<pubDate>Wed, 02 May 2007 01:55:16 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
		
			<category>safari</category>
		
			<category>applescript</category>
		
			<category>scripting</category>
		
			<category>mac</category>
		
			<category>osx</category>
		
	</item> <item>
		<title>By: Tlogmer</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930125</link>	
		<description>&lt;i&gt;I&apos;d rather not use word&lt;/i&gt;&lt;br&gt;
&lt;br&gt;
Er.  That wasn&apos;t very clear.  I&apos;d rather not use &lt;i&gt;Microsoft&lt;/i&gt; Word.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930125</guid>
		<pubDate>Wed, 02 May 2007 01:56:05 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
	</item><item>
		<title>By: humblepigeon</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930132</link>	
		<description>If you use Firefox, the formatting will be lost when you copy and paste into another app.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930132</guid>
		<pubDate>Wed, 02 May 2007 02:19:27 -0800</pubDate>
		<dc:creator>humblepigeon</dc:creator>
	</item><item>
		<title>By: Tlogmer</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930139</link>	
		<description>Right -- but it&apos;s important that the bold and italic text remain bold and italic.  I just want to strip the other stuff.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930139</guid>
		<pubDate>Wed, 02 May 2007 02:28:06 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
	</item><item>
		<title>By: malevolent</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930140</link>	
		<description>Get it back into HTML (maybe paste it into an HTML email or WYSIWYG editor?), grab the source, and use find &amp;amp; replace (if you know regular expressions you&apos;ll be able to fully automate it) to strip out the tags you don&apos;t want. You can then copy from the edited web page.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930140</guid>
		<pubDate>Wed, 02 May 2007 02:30:35 -0800</pubDate>
		<dc:creator>malevolent</dc:creator>
	</item><item>
		<title>By: cmiller</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930200</link>	
		<description>Regex can get close, but it&apos;s not a very good state machine for most SGMLish text.  If the text contains comments, (&quot;&amp;lt;!--  ....  --&amp;gt;&quot;) then all bets are off.  Let&apos;s assume that&apos;s not a problem, though.&lt;br&gt;
&lt;br&gt;
Dump it into a text file.  Write a few lines of Python (already installed on OS X) to strip it out.  Something like:&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
----&lt;pre&gt;#!/usr/bin/python&lt;br&gt;
&lt;br&gt;
import sys&lt;br&gt;
import re&lt;br&gt;
&lt;br&gt;
keep_tags = (&quot;b&quot;, &quot;i&quot;, &quot;em&quot;, &quot;strong&quot;)&lt;br&gt;
&lt;br&gt;
for line in file(sys.argv[1]):&lt;br&gt;
    for i, item in enumerate(re.split(&quot;(&lt; [^&gt;]*&amp;gt;)&quot;, line)):&lt;br&gt;
&lt;br&gt;
        if i % 2 == 1:  # odd parts look like HTML tags&lt;br&gt;
            match = re.search(r&quot;\w+&quot;, item)  # get the first word inside the element&lt;br&gt;
            if match:&lt;br&gt;
                tag_name = match.group(0).lower()&lt;br&gt;
                if tag_name not in keep_tags:&lt;br&gt;
                    continue  # skip to next item&lt;br&gt;
&lt;br&gt;
        sys.stdout.write(item)&lt;br&gt;
&lt;/&gt;&lt;/pre&gt;&lt;br&gt;
That will read from the file you name on the command line and write out your text.  I assume you know how to get to a Terminal shell.&lt;br&gt;
&lt;tt&gt;$  python that_program_name your_source_file&lt;/tt&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930200</guid>
		<pubDate>Wed, 02 May 2007 05:28:32 -0800</pubDate>
		<dc:creator>cmiller</dc:creator>
	</item><item>
		<title>By: cmiller</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930204</link>	
		<description>Oh, this works against web pages, btw.  Save your source page.  Run the program, and redirect the output to a new file.  View the new file in Safari.  Copy from it.&lt;br&gt;
&lt;br&gt;
&lt;tt&gt;$ python that_program_name your_source_file.html &amp;gt; new_stripped_file.html&lt;/tt&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930204</guid>
		<pubDate>Wed, 02 May 2007 05:32:05 -0800</pubDate>
		<dc:creator>cmiller</dc:creator>
	</item><item>
		<title>By: clord</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930462</link>	
		<description>you may also want to keep &quot;p&quot; and &quot;br&quot; tags, otherwise you will have one very large paragraph.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930462</guid>
		<pubDate>Wed, 02 May 2007 10:10:48 -0800</pubDate>
		<dc:creator>clord</dc:creator>
	</item><item>
		<title>By: Tlogmer</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930481</link>	
		<description>Damn.  Thanks, cmiller.  (I&apos;m only an amateur programmer, and I do my regexes in ruby, not pearl -- that would have taken me awhile.)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930481</guid>
		<pubDate>Wed, 02 May 2007 10:30:17 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
	</item><item>
		<title>By: Tlogmer</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930490</link>	
		<description>Er, make that &quot;ruby, not python&quot;. Just woke up.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930490</guid>
		<pubDate>Wed, 02 May 2007 10:40:22 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
	</item><item>
		<title>By: ijoshua</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930501</link>	
		<description>&lt;code&gt;&lt;pre&gt;% pbpaste -Prefer ascii|pbcopy&lt;/pre&gt;&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930501</guid>
		<pubDate>Wed, 02 May 2007 10:52:16 -0800</pubDate>
		<dc:creator>ijoshua</dc:creator>
	</item><item>
		<title>By: ijoshua</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930503</link>	
		<description>Oh, sorry, I didn&apos;t fully understand the question.  My answer above will strip all formatting.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930503</guid>
		<pubDate>Wed, 02 May 2007 10:53:58 -0800</pubDate>
		<dc:creator>ijoshua</dc:creator>
	</item><item>
		<title>By: Tlogmer</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#930820</link>	
		<description>I couldn&apos;t get cmiller&apos;s script to work, but a friend of mine is a javascript badass and he helped me do it.  It&apos;s a hacky solution, but here it is:&lt;br&gt;
&lt;br&gt;
1. An html file gives you a dialog box, grabs the article you specify using xmlhttp, runs a regular expression to strip the links (but leave the link text).&lt;br&gt;
&lt;br&gt;
2. I wrote a css file to strip out images and the like and loaded it as Safari&apos;s custom file.&lt;br&gt;
&lt;br&gt;
Here&apos;s the html file:&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&amp;lt;html&amp;gt;&lt;br&gt;
&amp;lt;head&amp;gt;&lt;br&gt;
	&amp;lt;script language=&quot;javascript&quot;&amp;gt;&lt;br&gt;
	function getarticle(theArticle) {&lt;br&gt;
		&lt;br&gt;
		try {&lt;br&gt;
			xmlhttp = new ActiveXObject(&quot;Msxml2.XMLHTTP&quot;);&lt;br&gt;
		} catch (e) {&lt;br&gt;
		  	try {&lt;br&gt;
		    	xmlhttp = new ActiveXObject(&quot;Microsoft.XMLHTTP&quot;);&lt;br&gt;
		    } catch (E) {&lt;br&gt;
		        xmlhttp = false;&lt;br&gt;
		    }&lt;br&gt;
		}&lt;br&gt;
		&lt;br&gt;
		if (!xmlhttp &amp;amp;&amp;amp; typeof XMLHttpRequest!=&apos;undefined&apos;) {&lt;br&gt;
		    xmlhttp = new XMLHttpRequest();&lt;br&gt;
		}&lt;br&gt;
&lt;br&gt;
		xmlhttp.open(&quot;GET&quot;, &apos;http://en.wikipedia.org/wiki/&apos; + theArticle,true);&lt;br&gt;
		xmlhttp.onreadystatechange=function() {&lt;br&gt;
			if (xmlhttp.readyState==4) {&lt;br&gt;
		    	modarticle(xmlhttp.responseText);&lt;br&gt;
		    }&lt;br&gt;
		}&lt;br&gt;
		xmlhttp.send(null);&lt;br&gt;
	}&lt;br&gt;
	&lt;br&gt;
	function modarticle(playtext) {&lt;br&gt;
		&lt;br&gt;
		playtext = playtext.replace(/&amp;lt;a.*?href=&quot;.+?&quot;.*?&amp;gt;(.+?)&amp;lt;\/a&amp;gt;/gi, &apos;$1&apos;);&lt;br&gt;
		document.write(playtext);&lt;br&gt;
	}&lt;br&gt;
	&amp;lt;/script&amp;gt;&lt;br&gt;
&amp;lt;/head&amp;gt;&lt;br&gt;
&lt;br&gt;
&amp;lt;body onload=&quot;var theArticle = prompt(&apos;Article name:&apos;);getarticle(theArticle)&quot;&amp;gt;&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&amp;lt;/body&amp;gt;&lt;br&gt;
&lt;br&gt;
&amp;lt;/html&amp;gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-930820</guid>
		<pubDate>Wed, 02 May 2007 17:28:57 -0800</pubDate>
		<dc:creator>Tlogmer</dc:creator>
	</item><item>
		<title>By: cmiller</title>
		<link>http://ask.metafilter.com/61791/Stripping-some-but-not-all-formatting-from-rtf-text-on-a-mac#931122</link>	
		<description>&quot;Couldn&apos;t get cmiller&apos;s to work&quot;?  &lt;br&gt;
&lt;br&gt;
I know it&apos;s academic now, but why?  What happened?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.61791-931122</guid>
		<pubDate>Thu, 03 May 2007 06:16:30 -0800</pubDate>
		<dc:creator>cmiller</dc:creator>
	</item>
	</channel>
</rss>
