How do I strip *some* formatting from text?
November 7, 2010 1:43 PM   Subscribe

Can I strip SOME formatting from text? I'd like to be able to strip all formatting except HTML links. Even better I'd like to strip all formatting other than HTML, bold and italics. Are there any good workarounds for this?

I've tried stripping formatting with the Word solution (control + spacebar) but that removes all the links too. I've tried TextEdit but making it plain text also strips all of the links.

Apologies if this isn't the proper terminology. I can clarify as needed.
posted by barnone to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
Welcome to the wonderful world of regular expressions!

What you want to do is to delete all HTML formatting except links. HTML formatting happens with tags. We need to be able to identify non-link HTML tags and remove them.

We need a regular expression that matches non-anchors (this one from stack overflow looks good), and a text editor that understands regular expressions (since you mentioned TextEdit, you're probably on a Mac, and so you'll want TextWrangler).

Use TextWrangler to search for anything that matches the regular expression, and replace it with nothing.
posted by zamboni at 2:17 PM on November 7, 2010


Generally speaking regular expressions can't reliably parse non-regular syntaxes (HTML/XML/SGML) but in this case you've got a very simple match that would work. I begrudgingly recommend zamboni's approach.

If however you have anything even slightly more complex though then it's probably worth just using an HTML tidier to turn the document into XHTML, and then filtering that with XSLT.
posted by holloway at 2:28 PM on November 7, 2010


What exactly are you doing? Are you writing a piece of software that will do this (that will be used by other people)? If so, you need a proper HTML parser. Use Google to find one for your programming language.

If it's just for your personal use, then try something like Notepad++ (open-source text/code editor) and use the regular expressions zamboni recommends. You can even do it as a macro to save yourself the hassle, that way you can simply run it through a menu item.
posted by spiderskull at 2:40 PM on November 7, 2010


Response by poster: I might not be techy enough for this... it's just for personal use.

I'm given text in various formats (from a blog, from a wiki, from MS Word) and it's got tons of invisible formatting. The text is then posted to our (Drupal) website. Our website text entry is really bizarre and the invisible formatting is really impossible to fix at that stage.

Thanks for the suggestions. I've downloaded TextWrangler and will give it a go.
posted by barnone at 2:58 PM on November 7, 2010


You want to try Bean!

Download Bean, paste in the text, then select one of the styles you wish to format and go to Style > Select By > Font Style.

It will select all the text of that style and change it to how you want.

Repeat as needed.
posted by 47triple2 at 3:15 PM on November 7, 2010


The simplest way I can think of to strip HTML, except for some allowed tags, is PHP's strip_tags() function, which does exactly that.

strip_tags ( string $str [, string $allowable_tags ] )
posted by AmbroseChapel at 8:45 PM on November 7, 2010


I'm given text in various formats (from a blog, from a wiki, from MS Word) and it's got tons of invisible formatting. The text is then posted to our (Drupal) website.

Have you asked over in the Drupal forums? I bet you're not the first Drupal user to be vexed with this.
posted by exphysicist345 at 10:05 PM on November 7, 2010


I use Textism's Word HTML cleaner to format all kinds of text into nice, neat HTML for pasting into webpages.
posted by bristolcat at 8:41 AM on November 8, 2010


Response by poster: This tool called BlogAssist actually works really well. I can input crazy text with tons of fonts and weird formatting, and it'll reformat it for blogging -- keeping the HTML links intact but removing everything else. I've only used it for a day but so far it seems to be working well.
posted by barnone at 11:15 AM on November 12, 2010


« Older help for dad, so i don't have the change the...   |   Should I call my insurance company? Newer »
This thread is closed to new comments.