Stripping some (but not all) formatting from rtf text (on a mac).
May 2, 2007 1:55 AM   Subscribe

Is it possible to programmatically strip some (but not all) of the formatting from text that's been copied from safari?

So you know how when you copy and paste from Safari to Textedit (or another rtf-aware program), all the formatting comes through -- the links, the italics, images, etc?

I want to programmatically strip out all the formatting except the bolds and italics -- most importantly, to strip out the hyperlinks and images. I'm going to be copying a lot of text, so basically I just want this to be the push of a button.

Writing a script for textedit itself is out: the rtf is opaque (the only option is just to convert it to plain text, and I'll lose the italics). Nisus Writer Pro (beta) yields a similar roadblock (though if I could call up a contextual menu for each word using GUI scripting the problem's solved -- I don't think this is possible). I'd rather not use word because it's crashy under rosetta. Suggestions?
posted by Tlogmer to Computers & Internet (13 answers total) 1 user marked this as a favorite
 
Response by poster: I'd rather not use word

Er. That wasn't very clear. I'd rather not use Microsoft Word.
posted by Tlogmer at 1:56 AM on May 2, 2007


If you use Firefox, the formatting will be lost when you copy and paste into another app.
posted by humblepigeon at 2:19 AM on May 2, 2007


Response by poster: Right -- but it's important that the bold and italic text remain bold and italic. I just want to strip the other stuff.
posted by Tlogmer at 2:28 AM on May 2, 2007


Get it back into HTML (maybe paste it into an HTML email or WYSIWYG editor?), grab the source, and use find & replace (if you know regular expressions you'll be able to fully automate it) to strip out the tags you don't want. You can then copy from the edited web page.
posted by malevolent at 2:30 AM on May 2, 2007


Regex can get close, but it's not a very good state machine for most SGMLish text. If the text contains comments, ("<!-- .... -->") then all bets are off. Let's assume that's not a problem, though.

Dump it into a text file. Write a few lines of Python (already installed on OS X) to strip it out. Something like:


----
#!/usr/bin/python

import sys
import re

keep_tags = ("b", "i", "em", "strong")

for line in file(sys.argv[1]):
    for i, item in enumerate(re.split("(< [^>]*>)", line)):

        if i % 2 == 1:  # odd parts look like HTML tags
            match = re.search(r"\w+", item)  # get the first word inside the element
            if match:
                tag_name = match.group(0).lower()
                if tag_name not in keep_tags:
                    continue  # skip to next item

        sys.stdout.write(item)
That will read from the file you name on the command line and write out your text. I assume you know how to get to a Terminal shell.
$ python that_program_name your_source_file
posted by cmiller at 5:28 AM on May 2, 2007


Oh, this works against web pages, btw. Save your source page. Run the program, and redirect the output to a new file. View the new file in Safari. Copy from it.

$ python that_program_name your_source_file.html > new_stripped_file.html
posted by cmiller at 5:32 AM on May 2, 2007


you may also want to keep "p" and "br" tags, otherwise you will have one very large paragraph.
posted by clord at 10:10 AM on May 2, 2007


Response by poster: Damn. Thanks, cmiller. (I'm only an amateur programmer, and I do my regexes in ruby, not pearl -- that would have taken me awhile.)
posted by Tlogmer at 10:30 AM on May 2, 2007


Response by poster: Er, make that "ruby, not python". Just woke up.
posted by Tlogmer at 10:40 AM on May 2, 2007


% pbpaste -Prefer ascii|pbcopy

posted by ijoshua at 10:52 AM on May 2, 2007


Oh, sorry, I didn’t fully understand the question. My answer above will strip all formatting.
posted by ijoshua at 10:53 AM on May 2, 2007


Best answer: I couldn't get cmiller's script to work, but a friend of mine is a javascript badass and he helped me do it. It's a hacky solution, but here it is:

1. An html file gives you a dialog box, grabs the article you specify using xmlhttp, runs a regular expression to strip the links (but leave the link text).

2. I wrote a css file to strip out images and the like and loaded it as Safari's custom file.

Here's the html file:




<html>
<head>
<script language="javascript">
function getarticle(theArticle) {

try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}

if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
xmlhttp = new XMLHttpRequest();
}

xmlhttp.open("GET", 'http://en.wikipedia.org/wiki/' + theArticle,true);
xmlhttp.onreadystatechange=function() {
if (xmlhttp.readyState==4) {
modarticle(xmlhttp.responseText);
}
}
xmlhttp.send(null);
}

function modarticle(playtext) {

playtext = playtext.replace(/<a.*?href=".+?".*?>(.+?)<\/a>/gi, '$1');
document.write(playtext);
}
</script>
</head>

<body onload="var theArticle = prompt('Article name:');getarticle(theArticle)">



</body>

</html>
posted by Tlogmer at 5:28 PM on May 2, 2007


"Couldn't get cmiller's to work"?

I know it's academic now, but why? What happened?
posted by cmiller at 6:16 AM on May 3, 2007


« Older How To Sound-Proof a Room?   |   Short Story Feedback? Newer »
This thread is closed to new comments.