Need to analyze words within 800+ ppt decks - shortcuts?
May 5, 2015 7:00 AM Subscribe
I have a set of 800+ powerpoint decks. I'm looking to identify which ones have certain search terms -- say, "aardvarks" or do not include certain terms -- say, "tigers". Ideally, I'd get an excel spreadsheet with the file names and how many times "aardvarks" appeared, and another list of file names where "tigers" did not appear. Is there a tool to do this so that I don't have to open all 800+ individually?
Whoops, yes, read that too fast. I don't know how you'd find the number of instances of a keyword or which files don't have it.
posted by three_red_balloons at 7:16 AM on May 5, 2015
posted by three_red_balloons at 7:16 AM on May 5, 2015
On Linux, OS X, or a Windows system with Grep for Windows, I'd use grep from the command line.
...returns a count of how many times "aardvarks" appears in each file. If you want it in a spreadsheet, you'd have to do just a wee bit of munging to replace the ":" separators with commas:
To get a list of filenames where "tigers" did not appear, use the -L option:
posted by clawsoon at 7:24 AM on May 5, 2015
grep -c aardvarks *
...returns a count of how many times "aardvarks" appears in each file. If you want it in a spreadsheet, you'd have to do just a wee bit of munging to replace the ":" separators with commas:
grep -c aardvarks * | sed -e 's_:_,_g' > aardvarkcounts.csv
To get a list of filenames where "tigers" did not appear, use the -L option:
grep -L tigers * > notigers.csv
posted by clawsoon at 7:24 AM on May 5, 2015
Note, grep is not going to work on .pptx files. With .pptx files, you would need to unzip them and then grep through the xml for each slide or note.
posted by bfranklin at 7:27 AM on May 5, 2015
posted by bfranklin at 7:27 AM on May 5, 2015
If you're on Windows, you'll also need Sed for Windows to do my second example. It might not be necessary, though, if you can tell Excel to import the .csv file using ":" as the separator instead of ",".
posted by clawsoon at 7:28 AM on May 5, 2015
posted by clawsoon at 7:28 AM on May 5, 2015
Good point, bfranklin. zgrep should work in that case, shouldn't it?
posted by clawsoon at 7:30 AM on May 5, 2015
posted by clawsoon at 7:30 AM on May 5, 2015
I don't think so. zgrep works on gzip'd files, while Office uses zip archives (you'd use unzip on linux).
posted by bfranklin at 7:36 AM on May 5, 2015
posted by bfranklin at 7:36 AM on May 5, 2015
The way I'd tackle this for a small number of keywords is to:
- drop all the files into one directory.
- make sure Explorer is indexing powerpoints
- browse to the folder with explorer
- search for aardvarks
- copy and paste the files for aardvarks into a new directory
- from the command like
- rinse and repeat for other keywords
If you have a LOT of keywords to search, I'd roll with writing some powershell or python to do it.
posted by bfranklin at 7:46 AM on May 5, 2015 [1 favorite]
- drop all the files into one directory.
- make sure Explorer is indexing powerpoints
- browse to the folder with explorer
- search for aardvarks
- copy and paste the files for aardvarks into a new directory
- from the command like
dir c:\directory\aardvarks\* > aardvarkslist.txt
- rinse and repeat for other keywords
If you have a LOT of keywords to search, I'd roll with writing some powershell or python to do it.
posted by bfranklin at 7:46 AM on May 5, 2015 [1 favorite]
zgrep does work on zip archives in addition to gzip archives, but it has a couple of other problems: It only extracts the first file in the archive (so all the actual slides would be missed), and grep -c counts lines rather than word occurrences (so if "aardvarks" was twice on one line it'd only be counted once).
posted by clawsoon at 7:48 AM on May 5, 2015
posted by clawsoon at 7:48 AM on May 5, 2015
How much time do you have and programming background ? Using Apache Poi and Lucene, in about 5-8 hrs you should have a Lucene database set up for doing any queries you want.
(time/space tradeoff -- is this a one-off thing, or something you'll keep adding documents or making different queries ? )
posted by k5.user at 7:49 AM on May 5, 2015
(time/space tradeoff -- is this a one-off thing, or something you'll keep adding documents or making different queries ? )
posted by k5.user at 7:49 AM on May 5, 2015
Here's a one-liner that'll do it in a bash terminal on Linux or OS X. For Windows, you'd probably want Cygwin, though that will have a bit of a learning curve.
However, if you're looking for a word that happens to occur in the xml tags (e.g. a word like "content" or "class"), this will give lots of false positives. In that case, some actual programming will be required, as per bfranklin and k5.user.
posted by clawsoon at 8:02 AM on May 5, 2015
for i in *.pptx *.PPTX; do echo -n $i,; unzip -c $i | grep -ohi aardvarks | wc -l; done > aardvarkscounts.csv
However, if you're looking for a word that happens to occur in the xml tags (e.g. a word like "content" or "class"), this will give lots of false positives. In that case, some actual programming will be required, as per bfranklin and k5.user.
posted by clawsoon at 8:02 AM on May 5, 2015
If your on Windows, I think you should look at PowerTools for OpenXML. It adds some functionality to PowerShell for dealing with the "new" Office file formats (.docx, .pptx, etc).
In particular, the Cmdlet Select-OpenXMLString is probably what you want. In this blog post, under the heading "Using Select-OpenXmlString to Find Documents" there's an example that seems like a start.
posted by water under the bridge at 8:16 AM on May 5, 2015
In particular, the Cmdlet Select-OpenXMLString is probably what you want. In this blog post, under the heading "Using Select-OpenXmlString to Find Documents" there's an example that seems like a start.
posted by water under the bridge at 8:16 AM on May 5, 2015
Very few people seem to know this trick, but you can just unpack any pptx (or other MS Office .---x file) using your favorite unzip/unrar tool, and it'll end up as a folder full of the constituents (text, images, etc).
From there, it's pretty easy with Windows search or one of the better third-party tools to search the plaintext that makes up the slide notes, and go from there.
posted by stavrosthewonderchicken at 9:50 PM on May 5, 2015
From there, it's pretty easy with Windows search or one of the better third-party tools to search the plaintext that makes up the slide notes, and go from there.
posted by stavrosthewonderchicken at 9:50 PM on May 5, 2015
This thread is closed to new comments.
posted by three_red_balloons at 7:14 AM on May 5, 2015