Online Author Comparison
August 4, 2009 4:41 AM   Subscribe

Is there such thing as an online tool to compare two pieces of text and determine if they were written by the same author?

If so, please link to them here and tell me about your experience using them. If not, please let me know about any offline tools or services that might be available. Thanks!
posted by cell divide to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
Are you talking about a sort of plagiarism-detector, or something that would compare texts in a stylistic sense?
posted by SebastianKnight at 4:48 AM on August 4, 2009


Response by poster: Stylistic, attempting to see if something is being written by one person, or if multiple people are writing under the same name.
posted by cell divide at 4:54 AM on August 4, 2009


A plagiarism detector is as easy as using Google but a style detector? An expert would be hard-pressed to give a 100% accurate match and I doubt the AI exists to even get a match with any accuracy.
posted by JJ86 at 5:42 AM on August 4, 2009


I think that the algorithm you could be shooting for would be either word or phrase frequency analysis. A letter frequency analysis can give a convincing probabilistic pointer as to the language a Caesar cyphered text is written in. In the same way analysis of word or phrase frequency is supposed to act as a hidden signature of a particular author. Some details. This type of algorithm is built into a lot of plagiarism detection software.

However the problem of determining whether the text is all by one author or not is trickier. If each author had written only a section of the overall text then you might be able to split down the text into sections and run each section past an analysis tool to compare with an example by the individual authors. However if multiple authors have edited each other's sections then the distinctions would be more blurred,
posted by rongorongo at 6:22 AM on August 4, 2009 [1 favorite]


I'm not sure what your intent here is, but as far as offline stuff, you could try a private forensic document analyzer such as this lab. Since this kind of thing is generally done for court cases, I assume it's pretty expensive. Full document analysis encompasses much more than just looking at the style and content of the writing and even then it's often hard to generate a definitive answer. Like JJ86 said, I too doubt an accurate AI exists.
posted by Midnight Rambler at 6:28 AM on August 4, 2009 [1 favorite]


Not really. The guy who wrote the book Author Unknown, Don Foster, talked a lot about the techniques he uses to try to identify authors via textual analysis. At the time of that writing, there wasn't anything available and I don't think there is now. He relies on a combination of looking for

- specific sentence construction
- "statistically improbably phrases" [amazon.com will show you this sort of thing]
- reliance on certain grammatical tics like using a lot of adverbs or "verbing" words etc.

His comment "The notion has been perpetuated that there's a computer program that can identify authorship, and there isn't."
posted by jessamyn at 6:48 AM on August 4, 2009 [2 favorites]


I am not sure how to search for it, but I recall an 'Amateur Scientist' column in Scientific American many years ago that talked about computer programs that attempted to do this.
posted by Killick at 7:03 AM on August 4, 2009


Best answer: This has been asked (but not really answered) previously:
I'm aware that it is possible to compare different samples of writing to assess whether they have the same or different authors. How would I go about doing this?

From the answers there:

- It's generally called "Stylometry" or sometimes "forensic linguistics"
- That thread will give you some more useful search terms
- there weren't any links to software that does it, but I wrote a quick explanation of how you might DIY:

---

I don't know any out of the box solutions, but I know a little bit about the theory. Back in undergrad, a group of us did a project on author fingerprinting for a CS class. The way we did it was to look at several features of the text. Off the top of my head, some of them were:

- punctuation use and frequency
- sentence length
- sentence std deviation
- word length
- word std deviation
- common word frequency (most common 100 or so words in english lang)
- trigram frequency
- type-token ratio (TTR), which is the ratio of different words to the total number of words used.

The only one that really needs explanation is trigram frequency. A trigram is defined as a set of three consecutive words. You parse your way through some text and store all trigrams in a big-ass data structure (or database), then calculate the frequency of each. The idea is that most authors have key phrases or methods of sentence construction that they'll reuse.

I remember that we ran these all through a training corpus of text from project Gutenberg, and then used some hill-climbing algorithms to find the proper weighting of each parameter. Unfortunately, I don't have that data anymore, so I can't tell you which metric was most informative. (though I seem to recall that trigrams were the way to go) If you're computationally-inclined, coding something up would be a fun little weekend project.
posted by chrisamiller at 8:16 AM on August 4, 2009 [1 favorite]


Response by poster: Thank you all for your answers, they are much appreciated.
posted by cell divide at 9:40 AM on August 5, 2009


« Older Looking for something between assisted-living and...   |   Will a 13.3 inch notebook screen give me eyestrain Newer »
This thread is closed to new comments.