How do I strip .docx of its styling but retain semantic markup?
March 13, 2009 8:25 AM   Subscribe

How can I automatically strip a .doc and .docx files of their styling but retain any semantic markup with the aim of putting them online?

I'm creating a wiki like app to allow the members of an organization update some help files stored online. I've found that what they will typically try to do is copy text directly from a word document straight in to the text area which strips it of all of its styling and semantic markup.

Therefore, I tried using tinyMCE which changes the textarea into a rich text editor. Unfortunately, this results in the word styling overwriting the site styling.

I think what I need can best be illustrated by an example: If a word document contains a paragraph, I need that paragraph wrapped in tags and stripped of any styling so that I can then style that paragraph as I like using css. I then need this for headings, lists, images, tables e.t.c.
posted by Fluffy654 to Computers & Internet (6 answers total) 4 users marked this as a favorite
Doesn't TinyMCE have a Paste From Word button that does basically what you're asking?
posted by bricoleur at 8:37 AM on March 13, 2009

Rich text cut and pasting is actually pretty useful in Windows, as long as you don't then paste into another M$ app that retains all of Word's crappy, wasteful markup. I've had good luck with the following:
  • Select and copy from Word; paste into Dreamweaver. This keeps the most basic of semantic markup, but strips away a lot of the crap. YMMV.
  • Character Cleaner. A great little tool that I use almost daily, but I suppose isn't great for batch work.
Is that what you're after?
posted by tapesonthefloor at 8:56 AM on March 13, 2009

Blimey, I'm surprised to see my ancient Character Cleaner get a mention; I now feel bad that I didn't finish the rich text version that would clean up messy markup pasted into it.

There are lots of server- and client-side tools out there aimed at tidying up the kind of mess Word makes (search for things like 'word cleaner' and 'html cleaner'). For a bulletproof solution you'll need something that parses and reconstructs on the server side, such as HTML Tidy (not sure what state that's in nowadays), but you might be able to configure TinyMCE to do a reasonably good job, have you played around with removing some tags/attributes from valid_elements?
posted by malevolent at 9:26 AM on March 13, 2009 [1 favorite]

Hey, it's simple and it works. Thanks, Mal!

Of course, if I had re-read TFQ a couple more times I would've realized it isn't appropriate at all for Fluffy's needs... my bad.
posted by tapesonthefloor at 9:32 AM on March 13, 2009

Just realised that I forgot to mention HTML Purifier. I haven't used it myself, but it looks good and if your wiki is PHP-based then it might be fairly straightforward to integrate.
posted by malevolent at 11:24 AM on March 13, 2009

I don't really 100% understand what you're asking, but I use Wimba Create (aka courseGenie) a great deal to convert word documents into clean html that is tagged and ready for css. The program is a plugin to Word. Once you start it it gives you access to a number of its own styles that you use to tag the document, essentially. You'd use the cgPageTitle style to tell it the title of pages and when pages begin, you'd use cgHeading to indicate a heading, cgBodyText for the regular text, etc etc. It comes with a selection of css, but you can plug your own in easily.

Not sure if this solves your problem, but it saves my life on a daily basis by allowing quick conversion of Word documents to pretty clean html.
posted by Jupiter Jones at 8:33 AM on March 14, 2009

« Older Why are airplane seatbelts different from car...   |   Your favorite example of experts getting it wrong Newer »
This thread is closed to new comments.