Strip my tags, please!
October 31, 2005 1:31 PM   Subscribe

GeekFilter: I want to strip all HTML tags from a page of text, leaving plain text. I have Text Wrangler and OS X.4. I thought it would be easy…

I've tried Googling for a script that would do this, but haven't been able to figure it out. I tried using the regular expression < [^>]*> in Text Wrangler with the "use Grep" option, but that doesn't seem to work either. I don't want to have to pay $25 for something like Text Soap. I have an AppleScript that will make the clipboard plain text. The ideal would be something like that that I can invoke to remove tags from the contents of the clipboard. Thanks!
posted by al_fresco to Computers & Internet (17 answers total)
 
lynx -dump
posted by kcm at 1:35 PM on October 31, 2005


Best answer: I don't know if it's a MeFi typo, but your regex has a space in it. Also, it needs to account for the closing tag, so try: </?[^>]*>. That works for me in TW (also make sure to click "Start at Top").
posted by sbutler at 1:43 PM on October 31, 2005


Response by poster: lynx -dump

Can you explain how I would use that? I don't know Unix (or any other language, for that matter), but I can usually Google and figure out what I need on a task-by-task basis. Context?
posted by al_fresco at 1:47 PM on October 31, 2005 [1 favorite]


(actually, I guess it already did account for the closing tag. since things worked, I assume it was the space)
posted by sbutler at 1:51 PM on October 31, 2005


Response by poster: Thanks sbutler. That did the trick! I don't know how the space got there in my regex.

Keep the answers coming, though. I'd just as soon find a few ways to do this.
posted by al_fresco at 1:53 PM on October 31, 2005


Lynx doesn't appear to be standard with OSX.

However, if you installed it (or were on a Linux box), you could do:

lynx -dump http://yoursite.com

and it would print the website on stdout, formatted as 'text only'. (I prefer not to have the 'list of links' for each page dumped, so I include -nolist as one of the options.)
posted by unixrat at 1:54 PM on October 31, 2005


^^^ (While running 'terminal' on your OSX box. You need to be on a command line to run lynx.)

I did have that in there, I swear.
posted by unixrat at 1:55 PM on October 31, 2005


Call me crazy, but wouldn't the simplest method for this be to:

1. View the web page in the browser of your choice;
2. Select all;
3. Copy;
4. Go into the text editor of your choice;
5. Paste.

Not so great if you need to do this in batch mode, I guess.
posted by adamrice at 2:26 PM on October 31, 2005


In BBEdit, you can simply select all, then choose Remove Markup. Voila--no more tags. Not certain if TextWrangler has a similar feature.
posted by werty at 2:28 PM on October 31, 2005


Response by poster: Call me crazy, but wouldn't the simplest method for this be to:
posted by adamrice


This would work fine if I wanted to just grab the text of my whole page. I'm sending out an HTML-formatted email newsletter, though, in an OS X app called Newsletter. I format the HTML portion the way I want it (which is different from my website), and then I want to clean the tags out for the Plain-text Alternative. Make sense? Thanks, though!
posted by al_fresco at 2:44 PM on October 31, 2005


It is worth pointing out that the above regexp will fail when you have html with < and/or> in the middle of a comment or as a value to a key.

I've not seen many sites that do this though.
posted by ralawrence at 2:45 PM on October 31, 2005


Response by poster: In BBEdit, you can simply select all, then choose Remove Markup. Voila--no more tags. Not certain if TextWrangler has a similar feature.
posted by werty


If it's there, I'm not finding it.
posted by al_fresco at 2:46 PM on October 31, 2005


Excuse my nearsightedness, but I don't see how the "remove markup" command works here. Seems to leave much behind. What I do wonder though is why one would just not copy and paste the text from a web browser or BBedit (or TextWrangler, if available) preview? (Oops! I see A. Rice has the same question.)

Thanks from me as well for the Grep expression though. :-)
posted by Dick Paris at 2:49 PM on October 31, 2005


Another victim of live preview!
posted by Dick Paris at 2:51 PM on October 31, 2005


Response by poster: adamrice & Dick Paris:

Actually, I just realized that Newsletter's Preview window works just like a browser in this regard, so I could have just selected the contents of my preview and gotten the results I wanted. Not as sexy, but gets the job done.

Thanks, all.
posted by al_fresco at 2:59 PM on October 31, 2005


For the moment? Download a demo of BBEdit. Use its "Translate" command (Under Markup/Utilities) with the "Translate HTML to Text" options.

That will do a great job.

For the longer term, you probably need to write a script to do this, because some tags just need to be removed and others need to be replaced so that paragraphs, etc., are preserved.

You might want to translate HTML headings to ***TEXT*** for instance, to get as much impact as you can from plain text.
posted by AmbroseChapel at 3:16 PM on October 31, 2005


The advantage of Lynx or Links (text browsers themselves) is that they'll reproduce structure (headings / tables / lists) in plain-text. So lists get bullets as *s, and tables are framed using + and | and - characters, images get ALT text, and stuff like that.
posted by holloway at 11:41 PM on October 31, 2005


« Older Remote Desktop?   |   SF about the end of the universe Newer »
This thread is closed to new comments.