How to do automated text searches?
February 3, 2004 9:11 AM Subscribe
Metafilter, as a source to journalists for story ideas and also as a reservoir of good writing for plagiarists, has been discussed on a recent Metatalk thread. How could I acquire a script (I don't have much of a clue about this, technically), for my Mac (OS 9.2.2 and OS 10.2.8) to run automated searches looking for matchups on chunks of text from my 500 (or whatever it is) pages of Metafilter commentary ? (more inside)
You'd also need a database of your existing comments to check against. And then dedicate the machine to the search for your words in print somewhere else. It seems time- and resource-intensive.
posted by yerfatma at 9:33 AM on February 3, 2004
posted by yerfatma at 9:33 AM on February 3, 2004
Response by poster: yerfatma - the database wouldn't be much of a problem.
Here's the "more" which I couldn't post at the time, due to some technical glitch:
I was reading this Metatalk thread discussion, concerning ""Metafilter. Now you know where I get a least half my story ideas." -- Chicago Tribune features writer Maureen Ryan, listing her 10 favorite websites in a guest appearance on co-worker Eric Zorn's weblog. "
In the discussion, adamgreenfield related his discovery that the Philly Tribune had published a large chunk of his writing , from his weblog, without consulting him first (with proper attribution though). y2Karl found out that a syndicated columnist had stolen some of his material from the "Annotated Blonde on Blonde" thread. Both of these discoveries were accidental, and so the actual incidence of this sort of thing must be far, far higher.
This made me wonder : I haven't done a word count, but I must have generated something on the order of 500 pages of text during my 500-odd days on Metafilter. How hard would it be to write a script which would take the whole body of my commentary on Metafilter and - paragraph by paragraph - run Net searches to look for close or identical matchups ?
Technically, this is way beyond me. Not for many here on this site though.....
posted by troutfishing at 10:13 AM on February 3, 2004
Here's the "more" which I couldn't post at the time, due to some technical glitch:
I was reading this Metatalk thread discussion, concerning ""Metafilter. Now you know where I get a least half my story ideas." -- Chicago Tribune features writer Maureen Ryan, listing her 10 favorite websites in a guest appearance on co-worker Eric Zorn's weblog. "
In the discussion, adamgreenfield related his discovery that the Philly Tribune had published a large chunk of his writing , from his weblog, without consulting him first (with proper attribution though). y2Karl found out that a syndicated columnist had stolen some of his material from the "Annotated Blonde on Blonde" thread. Both of these discoveries were accidental, and so the actual incidence of this sort of thing must be far, far higher.
This made me wonder : I haven't done a word count, but I must have generated something on the order of 500 pages of text during my 500-odd days on Metafilter. How hard would it be to write a script which would take the whole body of my commentary on Metafilter and - paragraph by paragraph - run Net searches to look for close or identical matchups ?
Technically, this is way beyond me. Not for many here on this site though.....
posted by troutfishing at 10:13 AM on February 3, 2004
the google web api would be a great place to start.
posted by lescour at 10:47 AM on February 3, 2004
posted by lescour at 10:47 AM on February 3, 2004
Would be much more interesting if someone took this on as a MeFi-wide project. By which I mean do the same search as trout is suggesting but for all comments by all users here.
posted by billsaysthis at 11:00 AM on February 3, 2004
posted by billsaysthis at 11:00 AM on February 3, 2004
heehee, billsaysthis -- things are ALWAYS more interesting when they're about me, as opposed to other people.
Although you poo-poo it, troutfishing, this seems like a very resource intensive task, more than just diskspace. You're talking about a query that incorporates every word you've ever written here, cut up into google-able chunks the size of 3-4 words, I'm guessing. You're going to run that against the google cache every day, or every few days, and then you're going to -- probably manually -- go through the hits weeding out false positives. You're also going to scrape MeFi for your future writings, I guess, and add that to your code in some automated fashion, again breaking up into chunks.
If you're so bent on seeing your work in print it would be a lot easier to distill your wisdom to a 500 word op-ed piece and submit it around. :)
(of course, you're thinking about copyright)
posted by luser at 11:31 AM on February 3, 2004
Although you poo-poo it, troutfishing, this seems like a very resource intensive task, more than just diskspace. You're talking about a query that incorporates every word you've ever written here, cut up into google-able chunks the size of 3-4 words, I'm guessing. You're going to run that against the google cache every day, or every few days, and then you're going to -- probably manually -- go through the hits weeding out false positives. You're also going to scrape MeFi for your future writings, I guess, and add that to your code in some automated fashion, again breaking up into chunks.
If you're so bent on seeing your work in print it would be a lot easier to distill your wisdom to a 500 word op-ed piece and submit it around. :)
(of course, you're thinking about copyright)
posted by luser at 11:31 AM on February 3, 2004
Maybe you could go at the problem backwards: Get something like EVE or sign up for a plagiarism-detection service, and feed your writings into it as if they were papers you suspect of being plagiarized. If the software reports that something you wrote may have been plagiarized from a particular online source, then you can investigate whether the reverse is true.
posted by staggernation at 12:06 PM on February 3, 2004
posted by staggernation at 12:06 PM on February 3, 2004
Another thought: copy-n-paste all of your "significant" posts on MeFi and publish them on your own site, with appropriate Creative Commons (or similar) license. Won't stop someone from publishing your work, but it may provide some options should you choose to go after them.
posted by davidmsc at 12:35 PM on February 3, 2004
posted by davidmsc at 12:35 PM on February 3, 2004
"You're talking about a query that incorporates every word you've ever written here"
Not quite. You only need to try the things you've written in the last 30 days. If someone doesn't get it b then, they probably never will.
And you could take your task out of iteration hell by removing, for instance, the 1000 most common words and then searching for those words. I'd guess that if a website matched that list there would be a good chance that it drew from your post. You could then reparse with exact text.
posted by y6y6y6 at 12:51 PM on February 3, 2004
Not quite. You only need to try the things you've written in the last 30 days. If someone doesn't get it b then, they probably never will.
And you could take your task out of iteration hell by removing, for instance, the 1000 most common words and then searching for those words. I'd guess that if a website matched that list there would be a good chance that it drew from your post. You could then reparse with exact text.
posted by y6y6y6 at 12:51 PM on February 3, 2004
It's too late. I just stopped by Barnes & Noble and saw a 500-page tome called Fishing for Trout in America. Curious, I opened it and the first thing I saw was:
posted by languagehat at 12:55 PM on February 3, 2004
Metafilter, as a source to journalists for story ideas and also as a reservoir of good writing for plagiarists, has been discussed on a recent Metatalk thread. How could I acquire a script (I don't have much of a clue about this, technically), for my Mac (OS 9.2.2 and OS 10.2.8) to run automated searches looking for matchups on chunks of text from my 500 (or whatever it is) pages of Metafilter commentary ? (more inside)I flipped through the pages and quickly discovered that it contained every word you've written on MetaFilter. To add insult to injury, it was published by Regnery and I suspect the profits will go to right-wing causes. Time to call a good lawyer.
posted by languagehat at 12:55 PM on February 3, 2004
I found out about the The Annotated Blonde On Blonde when I idly Googled the titles of the last few threads I posted with said titles between quotes myself. The first seven or eight words of the text will do in a pinch if you write cleverly enough--just put them between quotes.
posted by y2karl at 10:21 PM on February 3, 2004
posted by y2karl at 10:21 PM on February 3, 2004
For example, from here, I Googled ''Want a nut. Nnn, uh, tuh.'' Some credit MetaFilter and some thought of the same links on the very same day I posted the thread--isn't that a coincidence!
posted by y2karl at 10:35 PM on February 3, 2004
posted by y2karl at 10:35 PM on February 3, 2004
Response by poster: y2karl - you need to phrase that as : Want a nut. Nnn, uh, tuh.™
That'll scare 'em off.
languagehat - damn that instant, on demand printing. But, as I say, don't get mad, get even !
I'm merely practicing for the Great Crash of 2005 - I'll hire small armies of hungry waifs, children of ex-American IT workers. I'll make those kids wear those big grey caps, you'll see. They'll hawk my wares, for thin copper and thin gruel - "Get yrrr daily troutfishing.....Get yrrr daily....
I know they're plotting against me already. Some things never change - these humans, so relentlessly upwardly mobile.....
posted by troutfishing at 9:26 PM on February 4, 2004
That'll scare 'em off.
languagehat - damn that instant, on demand printing. But, as I say, don't get mad, get even !
I'm merely practicing for the Great Crash of 2005 - I'll hire small armies of hungry waifs, children of ex-American IT workers. I'll make those kids wear those big grey caps, you'll see. They'll hawk my wares, for thin copper and thin gruel - "Get yrrr daily troutfishing.....Get yrrr daily....
I know they're plotting against me already. Some things never change - these humans, so relentlessly upwardly mobile.....
posted by troutfishing at 9:26 PM on February 4, 2004
This thread is closed to new comments.
posted by yerfatma at 9:32 AM on February 3, 2004