Large test data has it's answer
October 14, 2009 11:23 PM   Subscribe

What large sets of collected or observed data have been offered, such that a hard to find hidden solution of some sort is buried in it, and the known solution is provided together, for the purpose of testing algorithms that find solutions to similar large data sets?

Besides the Netflix prize, of course. I've recently heard about a similar challenge created many decades ago, and provided to military logistics experts working the Knapsack Problem. It describes a chaotic battlefield on the night after the army captures a chunk of territory. When night falls, each fighting unit, with a variety of weapons, sends one guy with a knapsack running back to the field supply depot, and there everyone is loaded up and sent back out carrying just the right amounts -- of whatever ammunition is stockpiled at the depot -- such that the battlefield situation is helped most.

Someone has somehow found the exactly right answer for one fixed set of battlefield conditions, and this data set is used to grade strategies.

Know of any others such things, in any field?
posted by StickyCarpet to Science & Nature (17 answers total) 6 users marked this as a favorite
 
How about the RSA Factoring Challenge?
posted by Chocolate Pickle at 11:33 PM on October 14, 2009


Best answer: A standard set of benchmark instances of the Traveling Salesman Problem is TSPLIB; it also includes test sets for other related problems.
posted by mhum at 11:35 PM on October 14, 2009 [1 favorite]


Also, here are Eric Taillard's instances of flow shop, job shop, and open shop scheduling. I'm not sure how standard they are but I've seen them referenced more than once and he provides the best known bounds for each instance.
posted by mhum at 11:45 PM on October 14, 2009 [1 favorite]


The CASP protein structure prediction contest might fit the bill. The known solution in this case is an experimentally determined protein structure.
posted by benzenedream at 11:45 PM on October 14, 2009 [2 favorites]


A different one occurs to me, from history:

After Pearl Harbor, the US and UK revealed to each other their deepest, most valuable secrets: what they had accomplished in the way of cracking enemy codes and ciphers. America revealed to the UK that it had the ability to perfectly read the Japanese "Purple" diplomatic cipher, and the UK revealed to the US that it often was able to break into the German "Enigma" system.

The Enigma was an example of a rotor machine. The Americans went into the war with their own version of a rotor machine, and once they learned about the Enigma crack, they began to wonder whether their own system would be sufficiently secure.

So they submitted a very large amount of text enciphered using the American system to the Brits to evaluate. The Brits concluded that the American system was strong enough -- and history shows they were right. The Germans never did break into it.

Obviously the Americans knew what it was they enciphered for the test, so it sounds like the kind of thing you're talking about.
posted by Chocolate Pickle at 11:51 PM on October 14, 2009


How about evolution? Anything living right now that we have data for are "solutions". Fossil records and genomic sequencing provide data sets with prior observations. There are lots of algorithms developed and being refined to, among other things, maximize parsimony and measure species relatedness.
posted by Blazecock Pileon at 11:53 PM on October 14, 2009


Not entirely sure if it's what you're after, but an armada of particle physicists spends their sorry lives doing little else while waiting for the next postponement of the LHC startup.

Physics processes are simulated, then run through a detector simulator to get simulated "raw data", from which one attempts to reconstruct the original physics again. Of course, the most obvious measurement of reconstruction quality is comparing the simulated processes to the reconstructed processes.

Since the simulation is computationally intensive, it is usually done centrally, and working physicists just download ready-made datasets.
posted by themel at 12:12 AM on October 15, 2009


There are corpora of spam available for developing spam filters.
posted by Rhomboid at 12:40 AM on October 15, 2009


Best answer: There's about a billion spam datasets you can use. Here's one for web spam from Yahoo. There's also ham datasets. One thing to keep in mind is that in linguistics these datasets are usually called a "corpus". This also applies to the spoken word: Speech recognition corpus.

In other machine learning contexts, these datasets are called "training sets", and are used to train the program; to prevent "overfitting" the algorithm is also applied to a "validation set". This is the general structure you describe in the Netflix challenge.

If you're really smart, you can design computer systems that generate these sorts of things for you. Google famously tracks users fixing their own spelling mistakes for use in "did you mean?".

Finally, there's Rainbow Tables. Generally speaking, passwords are stored in computers as a hashed value. They're supposed to be one way; you're not supposed to be able to calculate a the password from the hash. Rainbow tables are a gigantic dataset to look up a string that hashes to a given hash value, and therefore fits your criteria of problems with a known answer.
posted by pwnguin at 12:57 AM on October 15, 2009 [1 favorite]


Best answer: There's also reCaptcha, which pairs known and unknown OCR challenges to build a large OCR corpus.
posted by pwnguin at 1:11 AM on October 15, 2009 [1 favorite]


Best answer: I hate to spam an AskMe, but here's another handy page of data sets from Luis von Ahn, author of an awesome set of games designed to encourage people to generate them. In the ESP game, for example, you're paired with an anonymous person, and given an image every round. You "win" the round when both you and guess a common word for the image. Google picked this up for their image search, you can imagine.

GWAP has a number of these sorts of games for other domains like music and video. The original ESP game had a flaw; people established a guessing order of words unlikely to appear. This meme would mean people would guess the same ten words over and over for high scores. This is called "output agreement".

Anyways, there's published papers for all this stuff. I hope you've enjoyed my excessive writing, and I'll leave you with a final thought: the "story" you shared about the Knapsack problem is likely a fabrication. The Knapsack problem is known to NP complete and well defined. NP complete problems don't usually admit a perfect and fast solution. So what you can do is implement a perfect and slow solution to compute some problems, and compare the output with those of imperfect and fast solutions. If you can get within 90 percent of perfect, that's sometimes an acceptable tradeoff.

Newer designs use "input agreement". Instead of winning when both people agree on an unseen output, you win the round when both you and the partner agree on whether the input is identical for you and your partner or not, based on the output. Much more robust, but perhaps a bit slower paced game.
posted by pwnguin at 1:31 AM on October 15, 2009 [1 favorite]


Response by poster: Thanks everyone! My selections might not be the actual best answers, but they are the ones that most helped me advance my inquiry.
posted by StickyCarpet at 4:05 PM on October 15, 2009


Response by poster: Blazecock Pileon: How about evolution? Anything living right now that we have data for are "solutions".

This is probably the perfect example, except when you're trying to get funding based on your prediction that an elephant will become an elephant.


benzenedream: CASP protein structure prediction
Chocolate Pickle: How about the RSA Factoring Challenge?

These two answers helped me better understand what I was trying to ask. These seem to be a matter of finding something that is already embedded, where the Knapsack Problem entails designing a creative response to the data.
posted by StickyCarpet at 6:53 PM on October 15, 2009


I'd argue the Knapsack Problem is straightforward to mathematically solve, while protein folding is currently only verifiable by experimental observation, and therefore, protein requires the greater dose of creativity.
posted by pwnguin at 10:26 PM on October 15, 2009 [1 favorite]


Response by poster: OK, pwnguin, let me further try to formulate the question I'm trying to ask.

It's true that I have no idea what the goal of the protein folding is, but the distinction in my mind is not so much how hard it is to find a solution, but the idea that the soldiers have a big set of data and are trying to understand how that can help them solve a separate problem -- winning the battle.

It's this idea that someone has a personal goal external to the data, and is trying to parse and fit the data for their own purposes, rather than discover something hiding within the data.

Traveling Salesman fits the bill, except it's such a canonical problem that it lacks pizazz.
posted by StickyCarpet at 3:18 PM on October 16, 2009


Best answer: Traveling Salesman and Knapsack are both optimization problems. It turns out that a number of these sorts of problems are both NP complete and have real world applications; anywhere you see substantial debate about which algorithm performs best is usually one solving an NP complete problem. Operating systems have a number of algorithms that try to cope with incomplete information and the exponential cost of truly "solving" the problems they face in scheduling and allocation.

I still think this soldier example is a hypothetical that's never been implemented in practice. If you can cite it, that might help me get past this disbelief of practical value you seem to cherish.

But rereading your original question definition, the word logistics reminds me of an AI challenge involving supply chains in market oriented systems. Apparently it's become very organized since I last looked.
posted by pwnguin at 5:16 PM on October 16, 2009 [1 favorite]


I present ExpDB, a website I just found on the subject of Machine Learning experiments, including datasets. Apparently it's been around enough that my ignorance of it's existence should disqualify me as an expert.
posted by pwnguin at 6:12 PM on November 29, 2009


« Older Any $100k recession-proof business startup ideas?   |   Electronics! NYC! Quickly! Newer »
This thread is closed to new comments.