How do I batch convert WORD.doc (not .docx) files to .txt?
January 12, 2019 10:55 AM   Subscribe

I'm feeling overly dependent on Microsoft. I have well over ten thousand files of my writing (almost all .doc files, not docx.) that, at the very least, I want to convert to .txt files. So yes, I am looking for a reasonably simple batch conversion strategy.

I say "reasonably simple" because I'm not a programmer. I can fumble around a bit with HTML but in general, think of me as an idiot, which is why this question and its answers aren't really working for me. It's also from nine years ago. I'm hoping there may be more options now.

If you think there's a better solution than just converting to .txt files, please weigh in, because I am aware I'm going to lose a lot of formatting. But I don't want to just switch one "dependence" for another. The key thing for me is that the words themselves be in a format that I will be able to easily access forever (or thereabouts).
posted by philip-random to Technology (17 answers total) 2 users marked this as a favorite
 
You can do this with libreoffice by running this command -

libreoffice --headless --convert-to txt "document.doc"
or, to convert all the files in the directory:
libreoffice --headless --convert-to txt *.*

However, this will lose a lot of formatting. You may prefer to make copies as PDF as that may be a more faithful conversion -

libreoffice --headless --convert-to pdf "document.doc"

PDF is a pretty good archive format. It is widely implemented and will be readable for at least as long as the media you store the files on. For most doc files a conversion to PDF will be as faithful as printing the document in terms of formatting preservation, and the text will remain copy-able and searchable. It probably won't be easily editable though.

(Or perhaps exploring libreoffice and it's handling of ms format doc files will allay your fears enough to make conversion unnecessary)
posted by samworm at 11:13 AM on January 12 [8 favorites]




You might also consider converting the file to the ODT format; these files are standard zip files containing XML, and should preserve the formatting better. Since ODT is a completely open format it is supported by a fairly large amount of software, including libreoffice and MS Office (2007+), and should be safe enough from an archival standpoint. In the (pretty unlikely) very worst case, you would still be able to crack open the file with an unzip tool and a text editor.

If your formatting needs are simple enough, you might consider converting to HTML or RTF instead of plain text; these are both stored in a plain-text format that is extremely well supported, and can be viewed with a regular old text editor if you don't mind some formatting markup interspersed in it.

While PDF is fine for preserving the exact look of a document, if the text is your primary concern, I'd un-recommend it for that. Plain text can be embedded in PDF, but it doesn't have to be, and while it's a more-or-less open format, it's not particularly useful if you want to be able to view or edit the files without a PDF viewer or editor.

Seconding the above suggestions of using Calibre or libreoffice to do the work of converting the files. If the command-line isn't your thing, Calibre has a GUI front-end that should hopefully make the process easier for you.
posted by Aleyn at 11:35 AM on January 12 [2 favorites]


Are you on Windows or OSX?
posted by suedehead at 11:44 AM on January 12


Windows
posted by philip-random at 11:50 AM on January 12


and thanks so far -- I've got libreoffice downloaded and will be exploring that.

As for why not RTF? That had been my initial intention ... until I realized that it's a proprietary Microsoft product. Maybe I'm just being paranoid here. But my thinking is, if I'm going to go to the trouble of what amounts to a fundamental rethink of how I work with text, do I really want to commit to anything proprietary?
posted by philip-random at 11:56 AM on January 12


I'm echoing Aleyn to say HTML will be your best bet. Extremely portable and easy to use, likely to be around a long time, and will preserve most formatting. Calibre will let you convert in bulk.

PDFs are a pain in my opinion. RTF is OK but as you mention it's proprietary (and is no longer developed.)
posted by anadem at 12:00 PM on January 12 [2 favorites]


In fact you could even go to EPUB, which is just HTML zipped up thus would be smaller and tidier, while being easily read and easy to extract from. That's also simple with Calibre.
posted by anadem at 12:03 PM on January 12


ODT is certainly a good option. If they’re not layout heavy and you’re mostly looking for content you might also want to checkout markup style options such as HTML or markdown.

Pandora is a tool for conveying between such formats. Here is a discussion regarding pandoc and .docx files which should be leversgeable to .doc’s
posted by mce at 12:05 PM on January 12


You may decide against using anything proprietary (which is fine). You might also decide that officially-proprietary is okay as long as it's thoroughly documented and reverse-engineered (which would actually make .DOC usable, since there are various open-source programs that can read and write it).

I think it depends on how you use the files and plan to use them in the future. You might decide to keep the originals along with whatever formats you convert to, so that you've got a version that retains formatting along with the text.
posted by trig at 12:13 PM on January 12 [2 favorites]


If you've installed LibreOffice and your goal is to move away from Microsoft Word, then I don't see why you need to convert out of .doc format. Just use LibreOffice with your existing files.

Also, if you want the no installation method and you don't mind your documents being on Google Drive temporarily, then I think you should be able to drag and drop your files into a folder on Google Drive and then use Google takeout to export them.
posted by rdr at 12:56 PM on January 12 [1 favorite]


Cloudconvert.com. Free, convert anything to anything, and they do batch conversion. It's a file-select interface, so no knowledge required beyond the ability to use a file-browse dialog.

For the number of files you're talking about, they will ask you to sign up for an account, but the account is free...initially. You get a certain amount of conversion-minutes per day at the "free" level, so if you're okay with the process stretching over several months, you may never need to go to the paid version with this one.
posted by Tailkinker to-Ennien at 2:36 PM on January 12


You might decide to keep the originals along with whatever formats you convert to

I strongly recommend this option. Word processor documents are so minuscule compared to movies or even photos that having multiple copies will cost a negligible amount to store them, and as long as you have some workable way of keeping track of the authoritative version of anything you update, there's no sound reason to discard the originals.
posted by flabdablet at 1:35 AM on January 13 [1 favorite]


Also: on Windows, the best way to install any of the packages that Ninite supports is to use a Ninite installer. They do a sane default installation with no added foistware, no options and no Next Next Next, and you can re-run them at any convenient time to bring the package(s) they install up to date. Here's the Ninite installer for LibreOffice.
posted by flabdablet at 1:41 AM on January 13


If you've installed LibreOffice and your goal is to move away from Microsoft Word, then I don't see why you need to convert out of .doc format. Just use LibreOffice with your existing files.

This can work well, but you do need to accept that LibreOffice and MS Office will have different opinions about the correct way to render any document containing tables, and sometimes even different opinions about the correct way to render fonts.

If you're making files just for yourself then you can always use LO to tweak something created with MS Office until it looks acceptably close to the original when printed; but if you're collaborating with somebody else, you really both want to be using the same tool. And since most of the world will flat refuse to use free software when they can spend their boss's money instead, that means you'll be stuck with MS Office for collaboration purposes for the foreseeable future.

When you have tweaked a MS-created original with LO, saving the result in LO's native ODT format will save you a lot of swearing later.
posted by flabdablet at 6:38 AM on January 13 [1 favorite]


thank you, everybody for weighing in. I feel I have my answer, albeit with a few questions posed that I hadn't really considered. Now I just need to find the time to really dive in and figure what is going to work for me ... starting with LibreOffice ...
posted by philip-random at 9:23 AM on January 13




« Older Elevator pitch needed for why this interaction...   |   Twossues Newer »

You are not logged in, either login or create an account to post comments