How to dig through a million pages of documents?
March 21, 2019 8:35 AM   Subscribe

Last summer, the US Senate had to wade through a million pages of documents on the Kavanaugh nomination. There are plenty of similar document dumps that journalists, lawmakers, investigators, etc. have to quickly assess and analyze. Are there any guides, articles or videos that explain how this process works, how to build a team and assess your own work for accuracy and comprehensiveness?
posted by caveatz to Writing & Language (10 answers total) 4 users marked this as a favorite
 
No one really wants to give their methods away in detail, but these days I imagine they used predictive analytics software. When you google you will see some fairy tales of marketing, but it's definitely come along to the point that it's useful.
posted by praemunire at 9:09 AM on March 21, 2019


Last I heard about this from someone in person, quite a few journalists used something like wordcloud and then keyword searches to find and read relevant bits of text. There's also your own knowledge about which kind of documents are most likely to be interesting, or keywords that you're hoping will come up.
posted by plonkee at 9:18 AM on March 21, 2019


This is a huge issue for the legal profession, where truckloads of document dumps are produced in the discovery process. These days they apply AI to the task: basically, you find some documents in the dump that you consider relevant, tell the AI "more like this please" and it finds some more; you score what it found for relevance so that it builds a better picture of what you want, and finds still more. It's more complex that this, of course--you're also applying categories as you go, etc. But that's the basic idea.
posted by adamrice at 10:34 AM on March 21, 2019 [2 favorites]


Response by poster: Thanks for the answers so far. What software packages get used to do this? Or more broadly, what do you call apps that do this? Predictive text analysis? Legal Discovery Analytics? I'm interested in researching further.
posted by caveatz at 11:25 AM on March 21, 2019


The digital discovery process is a big lucrative niche in the legal industry, scanning and or ocr'ing vast quantities of documents can be spun up quickly (for a price). If speed is critical a small army of attorneys skilled in looking for "the smoking gun" or hidden legal problem can be online probably next day. A million pages actually does not sound exceptional in size, that's under three hundred "banker boxes". Corporations will deliver semis full of "documents" as a strategy to hide something. The raw text for than many pages probably fits on a thumb drive and all the standard text based search tools can be faster and possibly more effective than AI.

google for "legal discovery software" dozens if not hundreds, there are trade shows exclusively for that niche.
posted by sammyo at 12:13 PM on March 21, 2019 [1 favorite]


In this thread the software category is described as "e-discovery" and vendors mentioned include Disco, Relativity, and Everlaw. This comment gives one example of a way it's used.
posted by XMLicious at 1:25 PM on March 21, 2019


In the Law Biz this is called Technology Assisted Review (TAR) or Predictive Coding. Google those terms and you'll turn up a ton of material on how it works. Here's a fact sheet on it from one random vendor but there are probably hundreds like it.
posted by The Bellman at 3:32 PM on March 21, 2019


> If speed is critical a small army of attorneys skilled in looking for "the smoking gun" or hidden legal problem can be online probably next day.

I estimate that 95% of the humans doing that work are not attorneys.

If the OP wants to know how individuals could do some of the same on a desktop computer, look at DevonThink (Mac) and dtSearch (Windows).
posted by yclipse at 4:42 PM on March 21, 2019


I estimate that 95% of the humans doing that work are not attorneys.

Large-scale legal doc review is done by attorneys or non-human entities. Most JDs are barely capable of doing this work with actual competence, you're not going to hire people without degrees (who also aren't effectively bound by legal confidentiality) to do it.

(It's not that any individual task involved is inherently that hard, but apparently most of the population is simply incapable of being provided with a reasonably in-depth overview of a case and understanding it sufficiently such that they can recognize documents related to key issues pre-identified for them. Few things will bring down your opinion of the human race like babysitting a flotilla of contract attorneys.)
posted by praemunire at 5:29 PM on March 21, 2019 [3 favorites]


I am a paralegal in civil litigation and I do a lot of document review. The largest document set I have personally overseen was 170,000 documents (emails and their attachments), comprising something like 1.2 million pages. I have also done document review that was paper-only, although I don't have an estimate of how many pages or documents (I do recall it was around 20 bankers' boxes worth of stuff).

I agree with what everyone else has said; if you are looking for particular resources, the searching guides for some of the document hosting platforms can give you a idea of how one would use the technology; this is Relativity's guide, here is Disco (I have done productions from both).

Some other thoughts based on my own personal approach to the process:

- Document review, particularly of a large amount of documents where it is impracticable or impossible to review each page individually, whether it is investigative journalism or a document production, is about asking the right questions - what words would one use to describe the thing I am looking for? During what date range would this conversation or document have been created? Between which people would this discussion have taken place? Or if you don't know the answers to any of those, you can narrow it down the other way - take a key date, and just look at everything emailed/created on that particular day or near it. Of course, you are likely looking for much more than one single thing, so you would repeat these questions for each topic you are interested in or which is responsive to the doc demand. Similarly, what can I exclude? What is definitely not what I am looking for? An easy example of this, where the database is email data, is spam emails - I don't need to review those individually to mark them non-responsive.

- I suppose if you're just looking for dirt or controversy (as with the Kavanaugh production), wherever it might be, this approach wouldn't necessarily work, in which case the only way out is through - throw a bunch of manpower at the database and just work through it. The technology-assisted review won't really help you unless you know what you are asking it to look for.

- Regarding review, most of the e-platforms have an option where you can ask it to generate a sample size, with varying degrees of confidence, so you can review a representative sample of your set and see whether anything is slipping through the cracks/getting marked incorrectly.
posted by Aubergine at 6:47 PM on March 21, 2019 [3 favorites]


« Older How to Sell a House with a Baby   |   Tone police police Newer »
This thread is closed to new comments.