Is any large corpus of Office documents available for download?
May 24, 2015 2:43 PM Subscribe
I am looking for a large (>1000 files) corpus of real-world Office documents (Word, Excel, PowerPoint, though I wouldn't turn down Open/LibreOffice/Gnumeric/etc. formats).
I am aware that there is a limited ability to retrieve office documents from Google through searching, but I am really after a collection that originates from a single source and has a common purpose.
I am also aware that things like the Sony incident probably generated a big dump of such documents, but I want the files to be from a legitimate source
My definition of "real world document" is something manually created (not generated), produced for a real purpose (for example, not a situtation where the author has been asked to a type a document for an analysis project), and not curated/censored/filtered or reorganised (though stripped of personal/sensitive details would be fine).
My ideal source would be a snapshot of a shared network drive in a corporation - except legally acquried!
I am aware that there is a limited ability to retrieve office documents from Google through searching, but I am really after a collection that originates from a single source and has a common purpose.
I am also aware that things like the Sony incident probably generated a big dump of such documents, but I want the files to be from a legitimate source
My definition of "real world document" is something manually created (not generated), produced for a real purpose (for example, not a situtation where the author has been asked to a type a document for an analysis project), and not curated/censored/filtered or reorganised (though stripped of personal/sensitive details would be fine).
My ideal source would be a snapshot of a shared network drive in a corporation - except legally acquried!
Government websites are treasure troves of this kind of stuff. Searches like:
site:*.gov ext:docx
Will return lots and lots of links. If you want more of a consistent source for the documents, you can monkey around with the site parameter so that it is more specific.
posted by mmascolino at 8:20 PM on May 24, 2015
site:*.gov ext:docx
Will return lots and lots of links. If you want more of a consistent source for the documents, you can monkey around with the site parameter so that it is more specific.
posted by mmascolino at 8:20 PM on May 24, 2015
EUSES has a corpus of 5000 spreadsheets. http://eusesconsortium.org/resources.php , under "Data Sets"
posted by at at 10:27 PM on May 25, 2015
posted by at at 10:27 PM on May 25, 2015
This thread is closed to new comments.
posted by monju_bosatsu at 3:02 PM on May 24, 2015