Comparing writing styles
November 11, 2008 11:12 AM Subscribe
I'm aware that it is possible to compare different samples of writing to assess whether they have the same or different authors. How would I go about doing this? Is there publicly available software or would I have to go to an expert?
I think that Word has something like that built in. (I might be confused about that, though, and I don't have a copy to check it.)
posted by Class Goat at 11:32 AM on November 11, 2008
posted by Class Goat at 11:32 AM on November 11, 2008
I've read a bunch a papers about forensic linguistics a couple of years ago, and my impression was that the field is more art than science, and that even consulting an expert would not give you certainty.
posted by dhoe at 11:43 AM on November 11, 2008 [1 favorite]
posted by dhoe at 11:43 AM on November 11, 2008 [1 favorite]
I saw an episode of History Detectives that aimed to do this. The episode was the 1856 Mormon Tale, Season 6, Episode 1. They were trying to identify the true author of a book and hypothesized that it was a certain author from the same period. They took the manuscripts to David Hoover of New York University who used authorship attribution software to prove whether or not it was the same author. The software analyzed the documents for frequencies of words and patterns. There's more information on pages 5 and 6 in the transcript [PDF] of the show. Hopefully this gives you some leads!
posted by bristolcat at 12:07 PM on November 11, 2008
posted by bristolcat at 12:07 PM on November 11, 2008
One thing you should probably be specifying is the purpose that you need this software for; there's a variety of stuff out there, some of it more at the forensic end and some at the bit-of-fun end, with others in between. What you need to decide whether to fail college students for plagiarism will be different to what you need to assess whether you've found a previously undiscovered manuscript by the young Henry James, and they'll both be different to what you need to email a friend saying "according to this software there's a 60% chance your blog and my blog were written by the same person lol".
posted by Acheman at 12:20 PM on November 11, 2008
posted by Acheman at 12:20 PM on November 11, 2008
Best answer: I don't know any out of the box solutions, but I know a little bit about the theory. Back in undergrad, a group of us did a project on author fingerprinting for a CS class. The way we did it was to look at several features of the text. Off the top of my head, some of them were:
- punctuation use and frequency
- sentence length
- sentence std deviation
- word length
- word std deviation
- common word frequency (most common 100 or so words in english lang)
- trigram frequency
The only one that really needs explanation is trigram frequency. A trigram is defined as a set of three consecutive words. You parse your way through some text and store all trigrams in a big-ass data structure (or database), then calculate the frequency of each. The idea is that most authors have key phrases or methods of sentence construction that they'll reuse.
I remember that we ran these all through a training corpus of text from project Gutenberg, and then used some hill-climbing algorithms to find the proper weighting of each parameter. Unfortunately, I don't have that data anymore, so I can't tell you which metric was most informative. (though I seem to recall that trigrams were the way to go) If you're computationally-inclined, coding something up would be a fun little weekend project.
posted by chrisamiller at 1:11 PM on November 11, 2008
- punctuation use and frequency
- sentence length
- sentence std deviation
- word length
- word std deviation
- common word frequency (most common 100 or so words in english lang)
- trigram frequency
The only one that really needs explanation is trigram frequency. A trigram is defined as a set of three consecutive words. You parse your way through some text and store all trigrams in a big-ass data structure (or database), then calculate the frequency of each. The idea is that most authors have key phrases or methods of sentence construction that they'll reuse.
I remember that we ran these all through a training corpus of text from project Gutenberg, and then used some hill-climbing algorithms to find the proper weighting of each parameter. Unfortunately, I don't have that data anymore, so I can't tell you which metric was most informative. (though I seem to recall that trigrams were the way to go) If you're computationally-inclined, coding something up would be a fun little weekend project.
posted by chrisamiller at 1:11 PM on November 11, 2008
Oh yeah, one more metric: I forgot about type-token ratio (TTR), which is the ratio of different words to the total number of words used.
posted by chrisamiller at 1:14 PM on November 11, 2008
posted by chrisamiller at 1:14 PM on November 11, 2008
« Older Two Master's Degrees... What are my options? | Its my birthday... where should I eat in Seattle? Newer »
This thread is closed to new comments.
Here's some software you can download, though it's hard to tell exactly what it does or how well it does it. Sounds kinda fun though, I might try it out later.
posted by contraption at 11:26 AM on November 11, 2008 [1 favorite]