Comments on: Strip my tags, please!

Question: Strip my tags, please!

al_fresco — Mon, 31 Oct 2005 13:31:47 -0800

GeekFilter: I want to strip all HTML tags from a page of text, leaving plain text. I have Text Wrangler and OS X.4. I thought it would be easy...

I've tried Googling for a script that would do this, but haven't been able to figure it out. I tried using the regular expression < [^>]*> in Text Wrangler with the "use Grep" option, but that doesn't seem to work either. I don't want to have to pay $25 for something like Text Soap. I have an AppleScript that will make the clipboard plain text. The ideal would be something like that that I can invoke to remove tags from the contents of the clipboard. Thanks!

By: kcm

kcm — Mon, 31 Oct 2005 13:35:20 -0800

lynx -dump

By: sbutler

sbutler — Mon, 31 Oct 2005 13:43:37 -0800

I don't know if it's a MeFi typo, but your regex has a space in it. Also, it needs to account for the closing tag, so try: </?[^>]*>. That works for me in TW (also make sure to click "Start at Top").

By: al_fresco

al_fresco — Mon, 31 Oct 2005 13:47:16 -0800

lynx -dump

Can you explain how I would use that? I don't know Unix (or any other language, for that matter), but I can usually Google and figure out what I need on a task-by-task basis. Context?

By: sbutler

sbutler — Mon, 31 Oct 2005 13:51:10 -0800

(actually, I guess it already did account for the closing tag. since things worked, I assume it was the space)

By: al_fresco

al_fresco — Mon, 31 Oct 2005 13:53:49 -0800

Thanks sbutler. That did the trick! I don't know how the space got there in my regex.

Keep the answers coming, though. I'd just as soon find a few ways to do this.

By: unixrat

unixrat — Mon, 31 Oct 2005 13:54:53 -0800

Lynx doesn't appear to be standard with OSX.

However, if you installed it (or were on a Linux box), you could do:

lynx -dump http://yoursite.com

and it would print the website on stdout, formatted as 'text only'. (I prefer not to have the 'list of links' for each page dumped, so I include -nolist as one of the options.)

By: unixrat

unixrat — Mon, 31 Oct 2005 13:55:44 -0800

^^^ (While running 'terminal' on your OSX box. You need to be on a command line to run lynx.)

I did have that in there, I swear.

By: adamrice

adamrice — Mon, 31 Oct 2005 14:26:28 -0800

Call me crazy, but wouldn't the simplest method for this be to:

1. View the web page in the browser of your choice;
2. Select all;
3. Copy;
4. Go into the text editor of your choice;
5. Paste.

Not so great if you need to do this in batch mode, I guess.

By: werty

werty — Mon, 31 Oct 2005 14:28:28 -0800

In BBEdit, you can simply select all, then choose Remove Markup. Voila--no more tags. Not certain if TextWrangler has a similar feature.

By: al_fresco

al_fresco — Mon, 31 Oct 2005 14:44:46 -0800

Call me crazy, but wouldn't the simplest method for this be to:
posted by adamrice

This would work fine if I wanted to just grab the text of my whole page. I'm sending out an HTML-formatted email newsletter, though, in an OS X app called Newsletter. I format the HTML portion the way I want it (which is different from my website), and then I want to clean the tags out for the Plain-text Alternative. Make sense? Thanks, though!

By: ralawrence

ralawrence — Mon, 31 Oct 2005 14:45:30 -0800

It is worth pointing out that the above regexp will fail when you have html with < and/or> in the middle of a comment or as a value to a key.

I've not seen many sites that do this though.

By: al_fresco

al_fresco — Mon, 31 Oct 2005 14:46:12 -0800

In BBEdit, you can simply select all, then choose Remove Markup. Voila--no more tags. Not certain if TextWrangler has a similar feature.
posted by werty

If it's there, I'm not finding it.

By: Dick Paris

Dick Paris — Mon, 31 Oct 2005 14:49:56 -0800

Excuse my nearsightedness, but I don't see how the "remove markup" command works here. Seems to leave much behind. What I do wonder though is why one would just not copy and paste the text from a web browser or BBedit (or TextWrangler, if available) preview? (Oops! I see A. Rice has the same question.)

Thanks from me as well for the Grep expression though. :-)

By: Dick Paris

Dick Paris — Mon, 31 Oct 2005 14:51:03 -0800

Another victim of live preview!

By: al_fresco

al_fresco — Mon, 31 Oct 2005 14:59:24 -0800

adamrice & Dick Paris:

Actually, I just realized that Newsletter's Preview window works just like a browser in this regard, so I could have just selected the contents of my preview and gotten the results I wanted. Not as sexy, but gets the job done.

Thanks, all.

By: AmbroseChapel

AmbroseChapel — Mon, 31 Oct 2005 15:16:01 -0800

For the moment? Download a demo of BBEdit. Use its "Translate" command (Under Markup/Utilities) with the "Translate HTML to Text" options.

That will do a great job.

For the longer term, you probably need to write a script to do this, because some tags just need to be removed and others need to be replaced so that paragraphs, etc., are preserved.

You might want to translate HTML headings to ***TEXT*** for instance, to get as much impact as you can from plain text.

By: holloway

holloway — Mon, 31 Oct 2005 23:41:29 -0800

The advantage of Lynx or Links (text browsers themselves) is that they'll reproduce structure (headings / tables / lists) in plain-text. So lists get bullets as *s, and tables are framed using + and | and - characters, images get ALT text, and stuff like that.