How do I convert text to bare-bones HTML?
November 29, 2005 11:10 AM   Subscribe

How do I convert basic text formatting (italics, bold, underline, superscript, etc.) into HTML formatting on a semi-automated basis?

Many of my clients' websites are CMS-based, much like blogging software, allowing them to easily add new articles, update pages, etc. They don't need to know about paragraph tags, break tags or any of the document-level HTML tags. But they do need to insert character-formatting tags, like em, strong, and so on. A clever UI, with "bold" and "italic" buttons means that they don't need to know HTML in order to mark these up.

When porting large amounts of information, such as a twenty-page Word document, pasting the text inside of a textarea loses the formatting, and so somebody must go through and laboriously mark up the text with HTML to match the formatting of the original document. This is impractical and error-prone.

I've tried programs like wvWare and I've tried saving the original content as HTML and then running it through HTML Tidy, but I've had no luck. They create webpages. I just want the inline markup converted, with no block-level or page-level tags.

I figure that this can either happen by parsing a RTF file or through some JavaScript or OS-level magic, based on the text in the clipboard. This must be a common need for anybody building a CMS, and yet I can't find any solutions to the problem. Is there any widget (Flash, Java, whatever) into which I can paste formatted text and it will retain that formatting and generate HTML? Some command-line application that will do the same? Or do I need to -- god help me -- write my own PHP-based RTF parser?
posted by waldo to Computers & Internet (19 answers total) 2 users marked this as a favorite
Check out FCK editor. Works in IE and Firefox, and can paste directly from MS Word.
posted by scottreynen at 11:24 AM on November 29, 2005


1. Fog Creek's CityDesk will output xhtml from text edited in a MS Word/Outlook-like richtext environment. So they can paste from Word into Citydesk, click the HTML tab, and copy the xhtml into the textarea. However, it would include things like <br /> and <p> tags.

2. Not exactly what you want: Textile and Markdown let you write attractive plaintext that is translated into xhtml by text formatting plugins available for most CMSes, although that would involve training your clients.
posted by evariste at 11:29 AM on November 29, 2005

Or what scottreynen said, which answers your question a lot better than my suggestions.
posted by evariste at 11:29 AM on November 29, 2005

How about using the midas editor control, like this? It works in IE and Firefox and Safari. You can paste in formatted text and get HTML out of it.
posted by smackfu at 11:30 AM on November 29, 2005

waldo, you don't want to write your own RTF parser. You don't even want to deal with an RTF token stream.

All these WYSIWYG editors produce dreadful HTML. Looks worse than the cruft that Word produces.
posted by scruss at 11:40 AM on November 29, 2005

Response by poster: Yeah, I didn't really want to say it, scruss, but...yeah. I mean, these form-based editors are definitely pretty neat -- they produce code that faithfully mirrors the line height, font specifications, and word spacing of the original. I'd have to write a whole other program just to strip all that stuff out.

I just want bold, italics, underline, subscript, and superscript conversion, or a similarly stripped-down level of conversion. I'd think this would be a pretty common need, with the newfound popularity of blogging software.
posted by waldo at 11:45 AM on November 29, 2005

waldo: There's a difference between WYSIWYG editors, and javascript tools that just enable WYSIWYG text input in an HTML textbox... Some of those might generate much cleaner HTML code...
posted by twiggy at 12:05 PM on November 29, 2005

It looks like those WYSIWYG input boxes are just taking down the HTML instance of what's in the clipboard. The source is ugly because Word, OpenOffice, or whatever is actually generating the HTML, not the browser, so you can't really improve improve on that clientside. What you can do is run things through HTML Tidy on the server once it's submitted to make things sane again.
posted by zsazsa at 12:18 PM on November 29, 2005

Best answer: I thought FCK editor put out decent code.

Tidy with the show-body-only option should help.
posted by If I Had An Anus at 12:27 PM on November 29, 2005

All these WYSIWYG editors produce dreadful HTML. Looks worse than the cruft that Word produces.

Not true. We use FCK in a custom CMS and it works fine. It gives you fine-grained control over what you can do. It requires a bit of work to customize it, but it lets you customize it, and in a well-thought out way. It lets you choose which buttons you expose to your users (so you could easily define a toolbar with just the controls you want).

The only downside is you need to strip out all the extraneous tags yourself, but that's a solved problem you can find on the net in most every language (except for the one we had to work in, ASP/ VBScript). The basic idea is you pass the submitted content through a regex filter that strips out anything that looks like an html tag unless it matches a set of tags you allow (you should also consider allowing some tags by themselves and some tags with attributes).
posted by yerfatma at 12:29 PM on November 29, 2005

TinyMCE (similar to FCKEditor) works pretty well, and you can configure it to produce clean code.
posted by kirkaracha at 12:32 PM on November 29, 2005

Response by poster: This WYSIWYG-meets-Tidy-meets-strip_tags option sounds pretty compelling. Hideous. But compelling. :)
posted by waldo at 12:43 PM on November 29, 2005

This has a little more than you want, but you the code is all public domain so you can make changes if you'd like.

It even won some sort of contest.
posted by miniape at 1:17 PM on November 29, 2005

The Atlantis Word Processor is a very capable RTF editor which produces nice clean HTML when you choose "Save as web page".
posted by yclipse at 3:24 PM on November 29, 2005

Response by poster: CrayDrygu, I'm starting to think that you're right. It's not pretty, and it will still require post-processing with strip_tags (or regex, as you point out), but it may well work. I'm playing with it now. It may be a good 90% solution, which is better than the 0% where I'm at now. :)

This makes me want to learn to write Firefox plugins, just so I can solve this problem for good.
posted by waldo at 4:54 PM on November 29, 2005

Text 2 HTML freeware

I HATE HTML Tidy. It turns my nice, neat hand-coding into this unreadable garbage.
posted by IndigoRain at 11:06 PM on November 29, 2005

Heh, center tags are funny.
posted by If I Had An Anus at 6:19 AM on November 30, 2005

Response by poster: IndigoRain, you're using HTML Tidy wrong. It's a powerful tool, but not if you skip reading the manual. :) You want to use the --wrap 0 flag.
posted by waldo at 10:53 AM on November 30, 2005

Perhaps I have been unfair. I shall read the manual and re-evaluate.
posted by IndigoRain at 10:10 PM on November 30, 2005

« Older Warlock Info   |   Find my buddy an anniversary in the sun Newer »
This thread is closed to new comments.