Publicly available databases involving terrorism and war
October 23, 2007 8:49 AM Subscribe
After hearing this story on NPR this morning about the Harmony Database of al Qaeda correspondence, I began wondering what other public databases like this exist out there that are ripe for analysis and picking apart. I'm looking specifically for things like this related to the "War on Terrorism" and the Iraq War, but older stuff would be good too.
If you are interested in network corpora -- any large, rich collection of networked text that you can freely download -- and it doesn't necessarily have to do with terrorism or Iraq war, here are some interesting ones.
posted by mrflip at 10:38 PM on January 9, 2008
- Enron Corpus -- all Enron internal emails used in the Enron case.
- MediaDefender Email Corpus -- All of the emails exchanged by employees of MediaDefender, a company hired by Media companies to spoof and disrupt the P2P networks (you'll probably have to hit bittorrent for this)
- NLTK -- the natural language toolkit, comes with a variety of research-quality text corpora in English and many other languages.
- Citeseer -- massive collection of scientific papers. (There are several others of these, like the preprint archives and NIH databases).
- And, of course, Gutenberg and Wikipedia can both be downloaded in bulk, but there's a specific procedure for doing so which you should follow. See especially DBPedia for a fantastic collection of semantic data mined from wikipedia.
posted by mrflip at 10:38 PM on January 9, 2008
This thread is closed to new comments.
posted by Ironmouth at 12:10 PM on October 23, 2007