Finding common text between bills
March 1, 2011 10:03 AM   Subscribe

Is there common text between the various bills that outlaw collective bargaining in some contexts, or limit the domains on which it applies? Do you know of a good way to find this automatically?

I'm trying to find common text between (say) the Wisconsin, Indiana, and Ohio bills limiting collective bargaining rights. Do you know of common text elements? Alternately, can you suggest a good program for doing this kind of text analysis?
posted by yomimono to Computers & Internet (2 answers total)
 
There are a number of online plagiarism detectors; that's where I'd start looking. Maybe one of them will let you compare two sources for similarities.

There was also a recent post to projects, Churnalism, which does this very thing except searching for similar text between news articles and press releases. You might contact its author Donch.

As for the techniques to do this, yes, it's been well studied. I recall from an undergrad information retrieval class that "shingling" is something that might work well. See page 461 of this PDF, starting with the second paragraph. Maybe someone with more direct knowledge in this subject will chime in. It would be pretty easy for any programmer to whip of a script that applies shingling to two text files to find similarities.

You might also try various diff utilities (e.g., vim can display side by side a diff of two files), but those usually make the assumption that the two things you're comparing are mostly identical, so they might be difficult to wade through for two things that are mostly different.
posted by qxntpqbbbqxl at 10:24 AM on March 1, 2011


Response by poster: Thanks qxntpqbbbqxl. I have some programming experience but zero knowledge of this field, so the point toward shingling is much appreciated.
posted by yomimono at 1:10 PM on March 1, 2011


« Older Backup!   |   daddy, i want more io Newer »
This thread is closed to new comments.