Finding common text between bills
March 1, 2011 10:03 AM Subscribe
Is there common text between the various bills that outlaw collective bargaining in some contexts, or limit the domains on which it applies? Do you know of a good way to find this automatically?
I'm trying to find common text between (say) the Wisconsin, Indiana, and Ohio bills limiting collective bargaining rights. Do you know of common text elements? Alternately, can you suggest a good program for doing this kind of text analysis?
I'm trying to find common text between (say) the Wisconsin, Indiana, and Ohio bills limiting collective bargaining rights. Do you know of common text elements? Alternately, can you suggest a good program for doing this kind of text analysis?
Response by poster: Thanks qxntpqbbbqxl. I have some programming experience but zero knowledge of this field, so the point toward shingling is much appreciated.
posted by yomimono at 1:10 PM on March 1, 2011
posted by yomimono at 1:10 PM on March 1, 2011
This thread is closed to new comments.
There was also a recent post to projects, Churnalism, which does this very thing except searching for similar text between news articles and press releases. You might contact its author Donch.
As for the techniques to do this, yes, it's been well studied. I recall from an undergrad information retrieval class that "shingling" is something that might work well. See page 461 of this PDF, starting with the second paragraph. Maybe someone with more direct knowledge in this subject will chime in. It would be pretty easy for any programmer to whip of a script that applies shingling to two text files to find similarities.
You might also try various diff utilities (e.g., vim can display side by side a diff of two files), but those usually make the assumption that the two things you're comparing are mostly identical, so they might be difficult to wade through for two things that are mostly different.
posted by qxntpqbbbqxl at 10:24 AM on March 1, 2011