Is any large corpus of Office documents available for download?
May 24, 2015 2:43 PM   Subscribe

I am looking for a large (>1000 files) corpus of real-world Office documents (Word, Excel, PowerPoint, though I wouldn't turn down Open/LibreOffice/Gnumeric/etc. formats).

I am aware that there is a limited ability to retrieve office documents from Google through searching, but I am really after a collection that originates from a single source and has a common purpose.

I am also aware that things like the Sony incident probably generated a big dump of such documents, but I want the files to be from a legitimate source

My definition of "real world document" is something manually created (not generated), produced for a real purpose (for example, not a situtation where the author has been asked to a type a document for an analysis project), and not curated/censored/filtered or reorganised (though stripped of personal/sensitive details would be fine).

My ideal source would be a snapshot of a shared network drive in a corporation - except legally acquried!
posted by hoverboards don't work on water to Computers & Internet (3 answers total)
 
Does it need to be actual Office documents? Would something like the Enron email dataset work?
posted by monju_bosatsu at 3:02 PM on May 24, 2015


Government websites are treasure troves of this kind of stuff. Searches like:

site:*.gov ext:docx

Will return lots and lots of links. If you want more of a consistent source for the documents, you can monkey around with the site parameter so that it is more specific.
posted by mmascolino at 8:20 PM on May 24, 2015


EUSES has a corpus of 5000 spreadsheets. http://eusesconsortium.org/resources.php , under "Data Sets"
posted by at at 10:27 PM on May 25, 2015


« Older I need HIPAA compliant booking & e-record...   |   Feel like I was hit by a virus truck Newer »
This thread is closed to new comments.