Stripping some (but not all) formatting from rtf text (on a mac).
May 2, 2007 1:55 AM Subscribe
Is it possible to programmatically strip some (but not all) of the formatting from text that's been copied from safari?
So you know how when you copy and paste from Safari to Textedit (or another rtf-aware program), all the formatting comes through -- the links, the italics, images, etc?
I want to programmatically strip out all the formatting except the bolds and italics -- most importantly, to strip out the hyperlinks and images. I'm going to be copying a lot of text, so basically I just want this to be the push of a button.
Writing a script for textedit itself is out: the rtf is opaque (the only option is just to convert it to plain text, and I'll lose the italics). Nisus Writer Pro (beta) yields a similar roadblock (though if I could call up a contextual menu for each word using GUI scripting the problem's solved -- I don't think this is possible). I'd rather not use word because it's crashy under rosetta. Suggestions?
So you know how when you copy and paste from Safari to Textedit (or another rtf-aware program), all the formatting comes through -- the links, the italics, images, etc?
I want to programmatically strip out all the formatting except the bolds and italics -- most importantly, to strip out the hyperlinks and images. I'm going to be copying a lot of text, so basically I just want this to be the push of a button.
Writing a script for textedit itself is out: the rtf is opaque (the only option is just to convert it to plain text, and I'll lose the italics). Nisus Writer Pro (beta) yields a similar roadblock (though if I could call up a contextual menu for each word using GUI scripting the problem's solved -- I don't think this is possible). I'd rather not use word because it's crashy under rosetta. Suggestions?
If you use Firefox, the formatting will be lost when you copy and paste into another app.
posted by humblepigeon at 2:19 AM on May 2, 2007
posted by humblepigeon at 2:19 AM on May 2, 2007
Response by poster: Right -- but it's important that the bold and italic text remain bold and italic. I just want to strip the other stuff.
posted by Tlogmer at 2:28 AM on May 2, 2007
posted by Tlogmer at 2:28 AM on May 2, 2007
Get it back into HTML (maybe paste it into an HTML email or WYSIWYG editor?), grab the source, and use find & replace (if you know regular expressions you'll be able to fully automate it) to strip out the tags you don't want. You can then copy from the edited web page.
posted by malevolent at 2:30 AM on May 2, 2007
posted by malevolent at 2:30 AM on May 2, 2007
Regex can get close, but it's not a very good state machine for most SGMLish text. If the text contains comments, ("<!-- .... -->") then all bets are off. Let's assume that's not a problem, though.
Dump it into a text file. Write a few lines of Python (already installed on OS X) to strip it out. Something like:
----
$ python that_program_name your_source_file
posted by cmiller at 5:28 AM on May 2, 2007
Dump it into a text file. Write a few lines of Python (already installed on OS X) to strip it out. Something like:
----
#!/usr/bin/python import sys import re keep_tags = ("b", "i", "em", "strong") for line in file(sys.argv[1]): for i, item in enumerate(re.split("(< [^>]*>)", line)): if i % 2 == 1: # odd parts look like HTML tags match = re.search(r"\w+", item) # get the first word inside the element if match: tag_name = match.group(0).lower() if tag_name not in keep_tags: continue # skip to next item sys.stdout.write(item) >That will read from the file you name on the command line and write out your text. I assume you know how to get to a Terminal shell.
$ python that_program_name your_source_file
posted by cmiller at 5:28 AM on May 2, 2007
Oh, this works against web pages, btw. Save your source page. Run the program, and redirect the output to a new file. View the new file in Safari. Copy from it.
$ python that_program_name your_source_file.html > new_stripped_file.html
posted by cmiller at 5:32 AM on May 2, 2007
$ python that_program_name your_source_file.html > new_stripped_file.html
posted by cmiller at 5:32 AM on May 2, 2007
you may also want to keep "p" and "br" tags, otherwise you will have one very large paragraph.
posted by clord at 10:10 AM on May 2, 2007
posted by clord at 10:10 AM on May 2, 2007
Response by poster: Damn. Thanks, cmiller. (I'm only an amateur programmer, and I do my regexes in ruby, not pearl -- that would have taken me awhile.)
posted by Tlogmer at 10:30 AM on May 2, 2007
posted by Tlogmer at 10:30 AM on May 2, 2007
Response by poster: Er, make that "ruby, not python". Just woke up.
posted by Tlogmer at 10:40 AM on May 2, 2007
posted by Tlogmer at 10:40 AM on May 2, 2007
Oh, sorry, I didn’t fully understand the question. My answer above will strip all formatting.
posted by ijoshua at 10:53 AM on May 2, 2007
posted by ijoshua at 10:53 AM on May 2, 2007
Best answer: I couldn't get cmiller's script to work, but a friend of mine is a javascript badass and he helped me do it. It's a hacky solution, but here it is:
1. An html file gives you a dialog box, grabs the article you specify using xmlhttp, runs a regular expression to strip the links (but leave the link text).
2. I wrote a css file to strip out images and the like and loaded it as Safari's custom file.
Here's the html file:
<html>
<head>
<script language="javascript">
function getarticle(theArticle) {
try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}
if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
xmlhttp = new XMLHttpRequest();
}
xmlhttp.open("GET", 'http://en.wikipedia.org/wiki/' + theArticle,true);
xmlhttp.onreadystatechange=function() {
if (xmlhttp.readyState==4) {
modarticle(xmlhttp.responseText);
}
}
xmlhttp.send(null);
}
function modarticle(playtext) {
playtext = playtext.replace(/<a.*?href=".+?".*?>(.+?)<\/a>/gi, '$1');
document.write(playtext);
}
</script>
</head>
<body onload="var theArticle = prompt('Article name:');getarticle(theArticle)">
</body>
</html>
posted by Tlogmer at 5:28 PM on May 2, 2007
1. An html file gives you a dialog box, grabs the article you specify using xmlhttp, runs a regular expression to strip the links (but leave the link text).
2. I wrote a css file to strip out images and the like and loaded it as Safari's custom file.
Here's the html file:
<html>
<head>
<script language="javascript">
function getarticle(theArticle) {
try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}
if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
xmlhttp = new XMLHttpRequest();
}
xmlhttp.open("GET", 'http://en.wikipedia.org/wiki/' + theArticle,true);
xmlhttp.onreadystatechange=function() {
if (xmlhttp.readyState==4) {
modarticle(xmlhttp.responseText);
}
}
xmlhttp.send(null);
}
function modarticle(playtext) {
playtext = playtext.replace(/<a.*?href=".+?".*?>(.+?)<\/a>/gi, '$1');
document.write(playtext);
}
</script>
</head>
<body onload="var theArticle = prompt('Article name:');getarticle(theArticle)">
</body>
</html>
posted by Tlogmer at 5:28 PM on May 2, 2007
"Couldn't get cmiller's to work"?
I know it's academic now, but why? What happened?
posted by cmiller at 6:16 AM on May 3, 2007
I know it's academic now, but why? What happened?
posted by cmiller at 6:16 AM on May 3, 2007
This thread is closed to new comments.
Er. That wasn't very clear. I'd rather not use Microsoft Word.
posted by Tlogmer at 1:56 AM on May 2, 2007