How can I find which Word docs contain embedded images?
February 19, 2018 4:43 PM   Subscribe

I have 48,000 Word documents in a folder. Some of them contain embedded images and some of them don't. My life would be a million times easier if I could separate them.

Can anyone think of a way of identifying, without opening the documents, which ones have embedded images and which ones don't?

I'm using Windows 10. I've tried sorting by file size in Explorer but some documents that don't have embedded images are larger files than some others that do, so that's not reliable.
posted by infinitejones to Computers & Internet (7 answers total) 4 users marked this as a favorite
I can't vouch for the reliability of the answers, but this StackExchange post asks the same question.
posted by sacrifix at 4:48 PM on February 19, 2018

It's not trivial, but mostly depends how comfortable you are with getting your hands dirty and scripting something. A word doc is a zipped archive and contains structured metadata (XML). You can get at that in a variety of ways. You need to figure out how to dump metadata from each file, figure out what image tags are used by Word, and sort the files into another directory.

In linux you'd do this by chaining a couple utilities together with pipes inside a bash script. Something like

unzip -p document.docx word/document.xml ought to get you as far as the xml.

This sort of thing comes up in digital forensics all the time, and there are a bunch of different programs that might help, from proprietary systems like Belkasoft to the excellent Bitcurator.

(here are a couple more: one, another).
posted by aspersioncast at 5:54 PM on February 19, 2018

I did a bit of investigating and experimenting with no immediate luck. But perhaps you might want to explore a few replacement desktop search tools for Windows? They may magically have this (now that I've become aware of the problem, absolutely necessary) feature.
posted by turbid dahlia at 6:09 PM on February 19, 2018

With Visual Basic, you should be able to iterate all the documents in a folder, and have it open each document and find if any pictures are inside the doc.
posted by nickggully at 6:25 PM on February 19, 2018

Best answer: I wrote a little Python script for you that will detect and move Word documents with images to an "images_docx" subfolder. As written it currently only works for .docx documents, not .doc (but I believe that could be added too!) If you have never run a Python script from the command line before, here is a tutorial that may help (though there will likely be better/longer tutorials elsewhere)

Hope that helps! And if not it was a fun thing to write, thanks for the question :)
posted by elephantsvanish at 7:30 PM on February 19, 2018 [16 favorites]

That's some clean and well-commented code, elephantsvanish!
posted by aspersioncast at 6:30 AM on February 20, 2018 [1 favorite]

Response by poster: Thanks for all the answers folks

I've marked the answer by elephantsvanish as the best answer because it's the simplest one for me to implement - I know enough Python to see that it will work and to be able to run it on my machine.

Not to say the other more VB-focused answers wouldn't work, I'm just not familiar with VB/Macros.
posted by infinitejones at 3:20 PM on February 20, 2018 [1 favorite]

« Older I've got the buttons!   |   Adult Fun In Downtown Orlando Newer »
This thread is closed to new comments.