Need to analyze words within 800+ ppt decks - shortcuts?
May 5, 2015 7:00 AM   Subscribe

I have a set of 800+ powerpoint decks. I'm looking to identify which ones have certain search terms -- say, "aardvarks" or do not include certain terms -- say, "tigers". Ideally, I'd get an excel spreadsheet with the file names and how many times "aardvarks" appeared, and another list of file names where "tigers" did not appear. Is there a tool to do this so that I don't have to open all 800+ individually?
posted by emkelley to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
I'm not sure if this will work (maybe I'm missing something), but on my computer (Mac) I can use the Finder to search for files with a keyword, and then limit by file type. Then I end up with a list of all of the documents with that keyword. You could repeat for each keyword and then copy the list into a spreadsheet.
posted by three_red_balloons at 7:14 AM on May 5, 2015


Whoops, yes, read that too fast. I don't know how you'd find the number of instances of a keyword or which files don't have it.
posted by three_red_balloons at 7:16 AM on May 5, 2015


On Linux, OS X, or a Windows system with Grep for Windows, I'd use grep from the command line.

grep -c aardvarks *

...returns a count of how many times "aardvarks" appears in each file. If you want it in a spreadsheet, you'd have to do just a wee bit of munging to replace the ":" separators with commas:

grep -c aardvarks * | sed -e 's_:_,_g' > aardvarkcounts.csv

To get a list of filenames where "tigers" did not appear, use the -L option:

grep -L tigers * > notigers.csv
posted by clawsoon at 7:24 AM on May 5, 2015


Note, grep is not going to work on .pptx files. With .pptx files, you would need to unzip them and then grep through the xml for each slide or note.
posted by bfranklin at 7:27 AM on May 5, 2015


If you're on Windows, you'll also need Sed for Windows to do my second example. It might not be necessary, though, if you can tell Excel to import the .csv file using ":" as the separator instead of ",".
posted by clawsoon at 7:28 AM on May 5, 2015


Good point, bfranklin. zgrep should work in that case, shouldn't it?
posted by clawsoon at 7:30 AM on May 5, 2015


I don't think so. zgrep works on gzip'd files, while Office uses zip archives (you'd use unzip on linux).
posted by bfranklin at 7:36 AM on May 5, 2015


The way I'd tackle this for a small number of keywords is to:

- drop all the files into one directory.
- make sure Explorer is indexing powerpoints
- browse to the folder with explorer
- search for aardvarks
- copy and paste the files for aardvarks into a new directory
- from the command like dir c:\directory\aardvarks\* > aardvarkslist.txt
- rinse and repeat for other keywords

If you have a LOT of keywords to search, I'd roll with writing some powershell or python to do it.
posted by bfranklin at 7:46 AM on May 5, 2015 [1 favorite]


zgrep does work on zip archives in addition to gzip archives, but it has a couple of other problems: It only extracts the first file in the archive (so all the actual slides would be missed), and grep -c counts lines rather than word occurrences (so if "aardvarks" was twice on one line it'd only be counted once).
posted by clawsoon at 7:48 AM on May 5, 2015


How much time do you have and programming background ? Using Apache Poi and Lucene, in about 5-8 hrs you should have a Lucene database set up for doing any queries you want.

(time/space tradeoff -- is this a one-off thing, or something you'll keep adding documents or making different queries ? )
posted by k5.user at 7:49 AM on May 5, 2015


Here's a one-liner that'll do it in a bash terminal on Linux or OS X. For Windows, you'd probably want Cygwin, though that will have a bit of a learning curve.

for i in *.pptx *.PPTX; do echo -n $i,; unzip -c $i | grep -ohi aardvarks | wc -l; done > aardvarkscounts.csv

However, if you're looking for a word that happens to occur in the xml tags (e.g. a word like "content" or "class"), this will give lots of false positives. In that case, some actual programming will be required, as per bfranklin and k5.user.
posted by clawsoon at 8:02 AM on May 5, 2015


If your on Windows, I think you should look at PowerTools for OpenXML. It adds some functionality to PowerShell for dealing with the "new" Office file formats (.docx, .pptx, etc).

In particular, the Cmdlet Select-OpenXMLString is probably what you want. In this blog post, under the heading "Using Select-OpenXmlString to Find Documents" there's an example that seems like a start.
posted by water under the bridge at 8:16 AM on May 5, 2015


Very few people seem to know this trick, but you can just unpack any pptx (or other MS Office .---x file) using your favorite unzip/unrar tool, and it'll end up as a folder full of the constituents (text, images, etc).

From there, it's pretty easy with Windows search or one of the better third-party tools to search the plaintext that makes up the slide notes, and go from there.
posted by stavrosthewonderchicken at 9:50 PM on May 5, 2015


« Older I was diagnosed with herpes this weekend. Please...   |   Life of an Adventure Guide Newer »
This thread is closed to new comments.