Analyzing word usage - algorithms, techniques, and software in the field
March 25, 2004 5:57 AM   Subscribe

By using a computer to analyze the word usage of the anonymously authored book Primary Colors, Don Foster was able to determine that columnist Joe Klein wrote it. Foster wrote a book about his experience, but it gives very little detail into his methods. From what I gather, the general theory is that different writers have statistically significant ideosyncracies that show up in things like vocabulary choices. The FBI is supposed to be able to do this with ransom notes, and I've heard of one guy who is trying to do it with the U.S. Supreme Court's theoretically anonymous "per curiam" opinions. I'd like to do some similar work. What is this general field called--is "forensic linguistics" the right term? Can anyone recommend a good bibliography of works to read up on this, or maybe even some software that helps?
posted by profwhat to Writing & Language (11 answers total) 1 user marked this as a favorite
 
gZip. More info here.
posted by seanyboy at 6:07 AM on March 25, 2004


[slight o/t: This will likely prove difficult in the case of Supreme Court opinions. Opinions are rarely written solely by the Justice, but are collaborations between the Justice and the Justice's four law clerks. Clerks generally only serve for a single year. The "style" of a given Justice's opinions will likely change from year to year as the Justice's own styles changes, and more importantly, as the Justice hires different clerks each year. However, if this method is sophisticated enough, it might be able to accomplish the much more interesting task of figuring out which clerks wrote which opinions.]
posted by monju_bosatsu at 6:13 AM on March 25, 2004


And here's a link about using the aformentioned method to determine if Marlowe was the Bard
posted by seanyboy at 6:13 AM on March 25, 2004


I think one of the main points of Foster's book was that there isn't software that will really do the job, a lot of it has to do with keeping your eyes open for odd quirky word usage, turns of phrase and writing habits [length of sentences and paragraphs, amount of quoting, use of dialogue, etc.]. With that in mind it seems like what you really need is something that counts word and paragraph frequency and then have the sharp eye to pick out trends that link one piece of work to another set of works.

In the case of Primary Colors, Foster was lucky in that there was a large body of Klein's published work available for comparison [and Klein was a bastard about the whole thing and tried very seriously to damage Foster's reputation, calling him a liar and a slanderer, even if in the end he turned out to be correct] and a lot of his analysis has to do with Klein's use of specific turns of phrases and choosing of adjectives. In the end, it's not fingerprinting and there's no useful way to really prove anything, just strongly suggest. In the case of the courts you have the two problems of clerks writing a lot of the court material [as monju said] and there not being a large body of comparison material except for other opinions. The added difficulty is that there is a lot of necessary phraseology that is required in court documents that tends to obscure authorship... fewer ways for people to personalize or customize work, which is less true in fiction or essay writing. Foster said that a lot of what led him to be successful was a really dogged attention to detail more than any software he used
posted by jessamyn at 7:08 AM on March 25, 2004


This doesn't mean that software wouldn't work for you. As well as the above, Bayesian filtering would probably work. It'd be VERY easy to set popfile to do this sort of categorisation for you.
Also - See here for more links to Bayes type categorisers.
posted by seanyboy at 7:26 AM on March 25, 2004


reading seanyboy's second link it looks like any adaptive compression algorithm will do the job (i'm not saying it will be sufficiently sensitive, or it will be able to separate judges from clerks; i'm just looking at the technical side of things). it appears that they simply grouped files into pairs and then compressed the pairs. when pairs were by the same author they compressed more efficiently because the adaptive algorithm did not need to readjust for a new style in the second file.

it's a very elegant approach.
posted by andrew cooke at 8:03 AM on March 25, 2004


Here's a recent Guardian article which suggests Foster's methods are not all they might be.
posted by biffa at 8:37 AM on March 25, 2004


I've used Algorithm::NaiveBayes (a perl module that does what it sounds like it should) to do some authorship analysis of disputed Shakespeare plays. You have to be extremely careful about what texts you use when doing Bayesian analysis, especially if your sample size is small. This particular algorithm isn't as exact as I like -- your probability results tend to cluster around 0 and 1 as a result of normalization, which means that it's often hard to make fine distinctions.

I'm also looking at Orange for similar work. It looks quite promising, but I don't yet have an implementation I'm happy with. The other python tool I'm playing with is NLTK.

If my perl code would be useful to you, dash me an email. It's still an infant, so I haven't posted it online yet. Oh, and for what it's worth I call this type of thing "statistical natural language processing." I'm quite fond of this book on the subject.
posted by amery at 9:19 AM on March 25, 2004


The most revealing words may be the small ones that authors and readers overlook. There's an algorithm that pretty accurately guesses if the author is a man or a woman. This has nothing to do with our notions of "male" and "female" language, instead it detects patterns of small inconspicuous words like "of" and "the". Try it here or visit Moshe Koppel, one of the originators (several pdfs on automatically categorizing written texts by author gender.
posted by Termite at 1:39 PM on March 25, 2004


Here's an interesting article on determining if Shakespeare was the Earl of Oxford. Another vote for Foundations of Statistical Natural Language Processing; it's a rocking book. I've found Bayesian less useful than SVDs for text categorization, but I'm not trying to establish document provenance.
posted by brool at 3:13 PM on March 25, 2004


There's an algorithm that pretty accurately guesses if the author is a man or a woman.

Accurately according to its inventors, inaccurately according to more or less everyone who's tried it on their web sites.

As for forensic linguistics, you might find the comments in this thread of interest.
posted by languagehat at 12:59 PM on March 26, 2004


« Older Auditioning for The Mikado   |   Puerto Vallarta or Guadalajara living? Newer »
This thread is closed to new comments.