Details of the German Tank Counting Method.
September 19, 2007 9:38 PM   Subscribe

Details of the German tank counting method (as seen on reddit)?

I'm interested in knowing more about the statistics (or at least the history) behind the story of the german tank counting method. More information (and a citation that is useless to me, as I don't have access to the journal) here.
posted by tehgeekmeister to Education (13 answers total) 5 users marked this as a favorite
 
I believe it had to do with the serial numbers that the manufacturers stamped on the tanks. By seeing only a few tanks, one could extrapolate all the numbers in between.
posted by chrisamiller at 10:04 PM on September 19, 2007


Here's a post talking about some of the details. Collecting a few stories about this would make a good post in the blue, I imagine. It's fascinating stuff.
posted by chrisamiller at 10:06 PM on September 19, 2007


I rewrote this a couple times, but here is the best way to explain it without getting too mathematical.

We are making the following assumptions:
- The serial numbers are meaningful and accurate. That is the Germans (or Google) doesn't just give serial numbers willy nilly. Also there is no replacement for serial numbers.
- The serial numbers are unbias. That is that certain serial numbers aren't more likely to be included than others.
- The found serial numbers follow a Gaussian distribution and all its implications, including finite variance.
- That the highest serial number is at equal to or less than the largest amount produced. That we won't see 1500 and only have 1498 tanks.

Then, knowing all this, you take the lowest number and the greatest number and computed variance using statistical analysis and calculus (which we won't get into here, google minimum variance unbiased estimator for the calculus, I am sure there is a lot on it). It computes what the population estimate should be based on the variance of the serial numbers given that the the observed sample is random (that is in the Gaussian sense).

I don't know if this would work with Google serves. It certainly might, and is interesting, but there is nothing to say that the observed sample is both unbias and of finite variance without replacement. Simple there's a lot of assumptions made that aren't verified.
posted by geoff. at 10:26 PM on September 19, 2007


I should add there's a lot of ways to estimating N from the sample population, including t-distributions. Because we have no idea how or why google numbers its servers there is no reason to assume that it is normally distributed. Given that Google uses Markove-state switching for its Page Rank, I would not be surprised if they used some weird, non-normal distribution for this -- for reasons we might not realize but makes sense when designing the system.

I could just be over thinking this though.
posted by geoff. at 10:30 PM on September 19, 2007


Suppose the tanks (or servers) are numbered 1, 2 ... n where n is the total number of tanks (or servers). When you observe the serial number of a few(k) tanks, you look to see what the highest one is. Lets suppose that number is m. That obviously implies that the number of tanks is at least m, but in fact, with all likelihood, there are more than m tanks.

Think of what happens when you randomly pick one number from 1 to n. The average number you'll pick will be (n+1)/2. So, if the one number which you saw was m, your best estimate of m is the solution to m=(n+1)/2 which means that n = 2m-1 and the number of tanks is 2m-1 (remember the 0th tank).

Now, imagine that you have captured several (k) tanks, each one chosen uniformly (no tank has a higher probability of being captured than others, and capturing one tank doesn't make capturing others any less likely). If the true number of tanks is n, the highest serial number observed (m) will, on average, be (n+1)*k/(k+1), and so, the most reasonable estimate for n will be m *(k+1)/k - 1. I may be messing up some of the constants here (ie, it may not be -1 but +1 somewhere), but this is the idea.

We make the following assumptions: a) the tanks are numbered 1 ... n and no serial number is skipped. b) each tank has the same probability of being captured. c) Tanks are captured independently of each other. Some of these assumptions are somewhat easy to relax; if we think the tanks are numbered starting with some number that's not 1, we can estimate the starting point as well. However, the uniform sampling and independence assumptions are practically impossible to get rid of, because that's what gives you the power to say anything about the total number of tanks.
posted by bsdfish at 1:37 AM on September 20, 2007


Incidentally, this is why your e-commerce store should randomize order numbers. Otherwise competitors can do the same analysis on your # of sales.
posted by smackfu at 5:29 AM on September 20, 2007


We did this with continuous distributions (to death) in math stat.

The likelihood function for drawing k x's from n without replacement is fact(n-k)/fact(n) if max(x)==k, so the maximum likelihood estimate n_hat is just max(x). I'd have to think a bit about a least MSE or UMVUE, but I recall the other options not being so different.
posted by a robot made out of meat at 6:00 AM on September 20, 2007


HTML ate my math with less thans in it. The key bit: the likelihood function is fact(n-k)/fact(n) if max(x) lt eq k. So L is ONLY a function of max(x) and is strictly decreasing for all n_hat (the guess at n) gt eq max(x), so the maximum likelihood estimate n_hat is just max(x).
posted by a robot made out of meat at 6:06 AM on September 20, 2007


tehgeekmeister, if you would like to read the article of the citation in full, my e-mail is in my profile.
posted by lioness at 8:39 AM on September 20, 2007


I have nothing to add, but I'll 2nd the call to have this stuff FPPed. Fascinating!
posted by cowbellemoo at 8:41 AM on September 20, 2007


Very short summary of the article:

“Part I of this article describes the historical development and problems of a technique of economic intelligence which sought to overcome the basic inadequacies of other types of intelligence. This technique involved analyzing the markings found on enemy equipment in order to obtain useful information about German armaments production.

In Part II, the reliability of the estimates achieved by this analysis have been assessed on the basis of official German production records which have since become available.

The first product to be analyzed by this technique was tires, then tanks, trucks, guns, flying bombs and rockets. Aircraft markings were not studied by the Economic Warfare Division, since, by previous agreement, the British Air Ministry bore the resonsability for all estimates on aircraft production.”

The article is very detailed about the used technique and gives clear examples, but is unfortunately too long to summarize.
posted by lioness at 3:33 AM on September 21, 2007


For people interested in the statistics:
Goodman (1952). “Serial Number Analysis,” Journal of the American Statistical Association, 47:622-634, which is a follow-up study on Ruggles & Brodie (1947).
posted by lioness at 4:23 PM on September 21, 2007


This topic was discussed in my statistics class just recently. Here is the hand out: http://pages.pomona.edu/~jsh04747/courses/math152/tanks_all.pdf
unfortunately it does require a good amount of knowledge of beginning statistics. Particularly interesting are the graphs in back that show the accuricy rate of a variety of different methods that could be used to predict the total number of tanks produced given some random subset of those numbers that you find.
posted by vegetableagony at 11:46 PM on October 9, 2007


« Older Is it worth the stamp   |   Male Cat Afraid of Dominant Female Cat Newer »
This thread is closed to new comments.