Plain ole HTML plz.
November 30, 2005 10:33 AM   Subscribe

My company has thousands of horribly mangled Word documents that need to be flipped into a wiki. Is there anything that really, truly strips the MS proprietary tags out?

I've tried saving as Filtered but that's been no help, and there's no way I can do a find/replace on all these darn documents. Please don't suggest Dreamweaver because the company is too effing cheap to buy it, and even getting a free piece of software installed takes an act of God, so self-extracting solutions would be great if possible.
posted by Vaska to Computers & Internet (18 answers total) 1 user marked this as a favorite
 
can you create a word macro to open all the .doc documents in a directory and save them out as plain text or HTML?
posted by clarahamster at 10:39 AM on November 30, 2005


Response by poster: No, I'm locked out from even from macros in our system (Our IT are crazy) and I can't save them out as Plain Text because there are a lot of tables in every document. All of MS Office's 'save as html' features give me way, way too many proprietary tags that makes the wiki variously cry and/or explode.
posted by Vaska at 10:45 AM on November 30, 2005


Best answer: I used to use HTML Tidy but haven't for a couple of years. Looks like it's been adapted into a Sourceforge project.
posted by matildaben at 10:47 AM on November 30, 2005


Demoroniser.
posted by jellicle at 10:51 AM on November 30, 2005


Dean Allen has an online tool that will turn MS Word HTML into HTML. Not sure if that fits the bill.
posted by yerfatma at 10:53 AM on November 30, 2005


Response by poster: Jellicle: I've checked out Demoroniser, and I'd love to use it, but it requires having Perl installed on one's system which I cannot do. Is there a self-contained version or way of doing it anywhere?
posted by Vaska at 10:53 AM on November 30, 2005


Response by poster: Yerfatma: Yeah I've tried that but the free version is horribly limited and my cheapass company won't pay for a subscription even though it'll save hundreds of man hours. Sigh. Thank you though.
posted by Vaska at 10:54 AM on November 30, 2005


I second the recomendation of Tidy as the first, cheapest, and best answer to your problem. If tidy doesn't produce the results you want, then you'll have to try something else.
posted by Ethereal Bligh at 11:07 AM on November 30, 2005


Tell your IT folks you need to install cygwin. This gives you Perl and hundreds of other tools.
posted by orthogonality at 11:18 AM on November 30, 2005


Best answer: Specifically, in HTML Tidy, you need to use these flags:

--word-2000 yes --bare yes --clean yes

I've never used it for a task as gnarly as yours, but it's always done the trick.
posted by waldo at 11:26 AM on November 30, 2005


Take the docs home on portable media (CD, whatever), install and run some sensible cleanup tool like HTML Tidy on them at home, then take them back to work all pretty and shiny. Or, if this is possible, connect to your work system from home and eliminate the silly "portable media" part of this.

But I don't understand: if there is a business reason for cleaning up these docs, and if HTML Tidy is the right tool for the right price, why can't you force IT to have it installed on your computer or force them to clean up the files themselves?

--word-2000 yes --bare yes --clean yes

and his heart was going like mad and yes I said yes I will Yes.
posted by pracowity at 11:36 AM on November 30, 2005


Response by poster: orthogonality pracowity : I can't install even if it is a business reason because the IT department in this company isn't run sanely. It's a long story but the short version is the company is run by a fear complex that strangles things. I got reprimanded for downloading NVU of all things.

matildaben Ethereal Bligh waldo : Managed to get Tidy and that seems to do most of the work I need. Still leaves all the class styles though darnit. Thanks for the tip though.
posted by Vaska at 12:50 PM on November 30, 2005


If you end up needing more flexibility (which you probably won't), I wrote a webservice called Docvert that converts Doc files to OpenDocument and then to any HTML or XML. Once in OpenDocument the whole process is open so you can use XSLT to maintain any particular tags or style convention you have.
posted by holloway at 12:57 PM on November 30, 2005


Still leaves all the class styles though darnit.

You should be able to do that kind of removal with a decent Programmer's Text Editor. What OS are you on? You can use JEdit on practically any platform I'd have thought.

But come to think of it, if you only need to do this once, you can use Dreamweaver, because there's a 30-day fully-functional trial of all Macromedia products.
posted by AmbroseChapel at 3:04 PM on November 30, 2005


... but it needs to be installed.
posted by blag at 3:52 PM on November 30, 2005


But he's installed HTML Tidy, hasn't he? -- I'm not sure what's going on right now. His IT department are insane, that's a given, so maybe he's working at home. That's what I'd do.
posted by AmbroseChapel at 5:22 PM on November 30, 2005


You don't need to install HTML Tidy - it's a command-line executable.

To strip out the class styles, try TextRep - it replaces text across multiple files and doesn't need installing. In fact, it's just a single exectuable. I can email it to you if you don't want to risk downloading it at work.
posted by blag at 6:12 PM on November 30, 2005


I'd start documenting all the ways the IT department makes my life more difficult and the number of manhours I'd save if they were more cooperative (and, from what it sounds like, competent). After I had plenty of evidence, I'd bring this documentation to management and suggest that they fix the IT department.

I'd also have a better job (money and/or satisfaction) lined up when I did that, though. YMMV. :)

(I worked in IT and related fields for ~7 years; it annoys me when the people in IT forget that they're there to make everyone else's job easier, not harder.)

As far as running perl on windows, the two best options are installing cygwin, as mentioned above, or installing ActivePerl. But of course, htmltidy is available as an executable, no perl needed.
posted by cactus at 10:54 PM on November 30, 2005


« Older Help! I dropped my external hard drive!   |   What does Norway do to prevent frost-heave on its... Newer »
This thread is closed to new comments.