Here a Turk. There a Turk. Everywhere a Turk Turk
August 12, 2012 10:04 PM   Subscribe

Have you ever used Mechanical Turk to validate, classify, and clean data sets that would otherwise be mind numbingly impossible for someone to do? If so, were the results useful? What are some tips to make sure one's Mechanical Turk experience goes off without a hitch? How do i best check their work.

In specific I need to take a list of misspelled words, slang, brand names, colloquialisms, and similarly malformed words and provide both (if they both exist) the correct word and/or the intended part of speech. Since it may be hard to verify the accuracy of the final product, is it a good idea to have multiple people score the list , effectively pitting the results against each other.
posted by Nanukthedog to Technology (7 answers total) 10 users marked this as a favorite
 
Will this be only the list of words, or will you show the words in context?
posted by exphysicist345 at 10:38 PM on August 12, 2012


Best answer: I've used MT to run psychological experiments.

If you're having individual people provide answers to multiple questions, one way to verify accuracy is to include a number of "check" questions that are super obvious and you already know the answer for (e.g., "what letter does the word 'apple' start with?" or whatever). Then exclude all data from anybody who gets those questions wrong. This will at least eliminate people who are just responding mindlessly, especially if your check questions look similar in format to the actual questions (i.e., aren't obviously check questions).

Getting multiple (3 or more) answers from different people per question is also good; this is in fact necessary if it's the sort of question for which multiple answers are in fact correct (thus you get a good sense of the distribution of possibilities). This is very effective and given how cheap MT is, not too expensive. If even with MT prices it's too much to get that many answers, one thing you could do is just get two people to answer to each question, and then for ones where those two answers differ, ask a third person.

Finally, another good technique I've heard of is before people begin, giving them a multiple-choice question or two to make sure they understand the instructions. In other words, present them with the instructions and then ask them a question or two to make sure they understood them. Either don't let people who got it wrong continue, or make them have to get them right before they can continue. This will help weed out the people not paying attention, and also make sure they know what you are asking them to do.
posted by forza at 10:41 PM on August 12, 2012 [4 favorites]


Response by poster: Initially, individual words only, no context. Getting the part(s) of speech for single words is important. If they can't identify a word or the slang is unclear or untranslatable, that would constitute a valid classification. I may also presented words which resist classification in later datasets in an n-gram format.
posted by Nanukthedog at 10:56 PM on August 12, 2012


This doesn't directly address your question, but are you familiar with Google Refine?
posted by adamrice at 7:27 AM on August 13, 2012 [1 favorite]


Maybe you could run the codification in cells, and then require a sentence using the word or term in context. That way you could get your statistics quickly, and use the contextual sentence for verification.

Appels:Apples, plural, noun, tree fruit | I buy my apples at an orchard stand on the highway.

Tide: singular, brand name, detergent | I wash my clothes with Tide.

Hooker: slang, noun, prostitute | She dresses like a cheap hooker.
posted by halfbuckaroo at 7:40 AM on August 13, 2012


Best answer: I also run psychology experiments on Mechanical Turk. Halfbuckaroo is describing an alternative to forza's check-question strategy for making sure that you are getting considered answers rather than random clicks. I've gone back and forth. Requiring people to do a 'human' task like writing a sentence (alone or in addition to multiple choice) can sometimes give you more useful information, and stronger confidence that people are doing the task you think they are doing, which could be good if you've never done something like this before. But Forza's method scales up much better, assuming that you automate the process of screening for the obvious answers and for checking inter-coder agreement, and seems more appropriate for this kind of task.

You may also have more luck with your coding project if you can split up some of the questions you have into simpler/more specific HITs (tasks.) Depending on how noisy you can tolerate your data being, you might have better luck asking one group of people to look for misspellings and then a second to guess the part of speech/report brand names and slang. (As a side note, I'd be prepared for people to be slightly lousy at giving parts of speech for words in isolation - Forza's suggestion to aim for distributions is good.)


If you're just starting out on MT, I also recommend stalking some of the user forums (r/Hitsworthturkingfor, Turkopticon ratings) to get an idea of what users expect. Like anything on the internet there are cultural norms and ways of acting that piss people off, and learning some of those before you start out can save you a lot of grief. For instance, if your survey smells like spam, people will avoid it (and it's definitely possible to look accidentally spammy.) You may want to be rejecting/not paying for very low-quality data, but you will attract THE RAGE if you do this in a way that users find tricky or dishonest.
posted by heyforfour at 8:08 AM on August 13, 2012 [3 favorites]


Best answer: > You may want to be rejecting/not paying for very low-quality data, but you will attract THE RAGE if you do this in a way that users find tricky or dishonest.

To expand on this, even if the way you reject work is entirely above board, you should set up your questions so that a reasonable person (e.g., yourself, or someone whose brain you trust) can get well over 90% of tasks correct. If you're rejecting more than one out of every ten (or even twenty or fifty) answers, you're basically ensuring that no one will work for you, and, if a worker is new, that they won't work for anyone else either -- since most requesters require a 90% success ratio (if not 95% or 98%) to even look at their tasks and you'll have burned that new worker good.

The most important thing is to be crystal clear in your instructions, to evaluate things according to those crystal clear instructions, and to recognize that absolutes are not possible in some areas (e.g., evaluations of sentiment). Plenty of requesters will say very explicitly, "If there is any doubt at all, choose A over B", and then go on to reject people's work with the gloriously inconsistent response, "Although it was somewhat unclear, the correct answer should have been B". If you do this, you'll get tagged (justifiably) as a scammer. There are browser extensions for exactly that purpose.
posted by matlock expressway at 10:06 AM on August 13, 2012 [2 favorites]


« Older Parents Won't Store My Stuff Any More   |   Music to accompany sunrise, after being out all... Newer »
This thread is closed to new comments.