Determining "sameness" of documents
December 21, 2006 5:11 AM   Subscribe

I'm writing a load of content for a client and a condition I need to meet is that each of roughly 50 pages must be at least 30% unique. How can I compare them?

Each document is about 10 paragraphs long and is very informational (not particularly conversational). It's tough to rephrase things without repeating myself to some degree from one document to the next.

The geekiness factor can be fairly high if necessary and I can deal with Windows, Mac, and Unix-y options. The copy is currently in txt format, but I'm not opposed to ideas that would require conversion to something else. I've been googling my ass off and trying to think of my problem in other terms—seeing if some kind of plagarism detection method could be used, thinking of the documents as code and using something like diffs, etc.—But the light bulb hasn't come on yet.
posted by braintoast to Computers & Internet (10 answers total) 1 user marked this as a favorite
 
I use examdiff to compare different versions of texts from Project Gutenberg, is that the kind of thing you're thinking of?

It can compare documents you have, but it doesn't check a database for plagiarism or anything.
posted by Science! at 5:22 AM on December 21, 2006


Don't be put off by the purchase/price link, the current version 1.7 is free but it looks like they'll be charging for it when it reaches 3.0.
posted by Science! at 5:24 AM on December 21, 2006


Figure out a way to paginate the text document into the pages that will be printed. Spit each page out into a seperate text file with the page number as the filename.

Write a script that runs diff between consecutive pages (or all the pages, but that'd be a whole lot more diffs), pipes it to a grep command that looks for only the lines that describe changes (not the context information), pipe that to a wc -l, and pipe that to a document. That way, you'll have a number for the number of lines of text that are different between each page.

Once you have that, stick it into a spreadsheet, do some dividing equations, search for cells less than 30% and there are your pages that are less than 30% unique.
posted by yellowbkpk at 5:32 AM on December 21, 2006


Response by poster: That's along the lines of what I've been thinking so far. I'm just not sure how I might apply using diffs in a more "global" sense. I'm not certain, but this "sameness" stipulation is most likely an SEO thing. In other words, I'm trying to think in terms of how similar the search engines might find the copy.
posted by braintoast at 5:32 AM on December 21, 2006


Response by poster: nice, yellowbkpk. I'm glad to see you weren't afraid to put my geekiness to the test. I'll give it a go.

I'm still open to other suggestions, as I'll be busy with this for while. :)
posted by braintoast at 5:37 AM on December 21, 2006


I don't like yellowbkpk's method at all. It assumes that repeated text would be included in exactly the same form in all parts of the document, which I can't imagine is guaranteed. For example, suppose a huge block of text is repeated in two sections, but in one of those sections it's prefixed by the string "As you recall from before". Since that would cause the text to wrap differently, the diff-by-lines method would declare the text block to be unique.

I think what's needed is an algorithm similar to that used by anti-plagiarism sites like turnitin.com; they check sequences of words against the corpus for matches. In this case, the corpus isn't the internet as a whole, it's a "hold-one-out" of your own text. Each block of text to be tested is matched against a corpus that consists of the document itself with the block under test removed.
posted by dmd at 6:54 AM on December 21, 2006


Response by poster: dmd - right. I've seen that problem first hand now.

Any ideas as to how I could apply something remotely like the craziness that turnitin.com is using? Tall order, I know. :)
posted by braintoast at 7:12 AM on December 21, 2006


Well, one scientific paper I've read from a coupla years ago used compression as a measure of sameness, to the point that the system could be used to identify the author of an unattributed piece of text by its style alone. On top of everything this is fairly easy to code:

Basically take a few pieces of your "corpus" of documents that you want to use as your good basis of unique content and zip them (or gzip or whatever). Just to avoid some technical issues, it makes sense to put all the text into one big file first. Let's call that A and its compressed version Az.

Now, take A and add to it (again in the same text file) a piece of text that you also know is fairly different from what's in A (this is your control). Let's call that B and it's compressed version Bz.

Finally, for every piece of text you want to compare, add it to A (again, same text file) and compress it (this is your test). Let's call that C, and its compressed version Cz.

If Cz is smaller than Bz, then C is fairly close to A. If Cz only fractionally bigger than Az (by some percentage you'll have to determine), then C is very likely made up of fragments of the corpus in A.

You'll have to run some tests to determine the comparison thresholds that fit your problem, but this is a robust, mathematically sound way to determine sameness...
posted by costas at 7:42 AM on December 21, 2006 [1 favorite]


Found a reference here.
posted by costas at 7:45 AM on December 21, 2006


The perl module String::Compare might be a start, although I'm not sure how happy it would be taking many pages of text as input.
posted by dmd at 7:48 AM on December 21, 2006


« Older Mac video encoding   |   Please suggest great books for girls. Newer »
This thread is closed to new comments.