Skip
# Normal, poisson with a twist of outliers.

(adsbygoogle = window.adsbygoogle || []).push({});

posted by zennie at 12:07 AM on December 10, 2007

posted by louigi at 12:08 AM on December 10, 2007

It's not a good idea to use raw American adult shoe size to model this, as it is an interval scale. There are definitely people with what would be a negative size in this system. Better to use a ratio scale measurement like centimeters.

Human height is widely used as an example of a normally distributed quantity in biostatistics textbooks. I have just tested NHANES data on upper leg length, and it appears normally distributed within age groups. While foot size is not measured in that study, I would be shocked if this were not true for that as well.

Steven C. Den Beste is wrong when he says there should be noticeable bumps due to race. There aren't.

posted by grouse at 4:59 AM on December 10, 2007

But the fact is, it never makes sense to talk about data fitting a normal curve unless the data can, in principle, assume arbitrarily large positive and negative values. If foot size was normally distributed, and there were enough humans, we would be almost certain to see some people with feet of negative length.

posted by louigi at 5:49 AM on December 10, 2007

That's an interesting viewpoint, louigi, but it seems to be at odds with a century of use of the normal distribution in biostatistics to describe quantities that are necessarily positive.

posted by grouse at 6:13 AM on December 10, 2007

It makes perfect sense if you use the blessed weasel word "approximately."

posted by ROU_Xenophobe at 6:39 AM on December 10, 2007

If this were not the case, you could not use the normal distribution to talk about anything in the physical universe, since nothing and no quantity or measurement of the physical universe can be arbitrarily large or small (though the limits are decidedly far out). This is doubly true since physical quantities or measurements often cannot be negative, ruling out half of the real line immediately.

I expect you will find that there is not any meaningful disagreement about what it means to analysts as opposed to mathematicians: It is accurate enough to make useful inferences, and not too many mistakes.

posted by ROU_Xenophobe at 7:56 AM on December 10, 2007

When I looked at the upper leg length data, I eyeballed a quantile-quantile plot. I read somewhere today that hypothesis tests for normality are not that powerful, and one is frequently better off using a graphical method.

For my purposes, it was pretty close to normal, but I can see that others may not think this is so at the tails.

posted by grouse at 11:45 AM on December 10, 2007

(adsbygoogle = window.adsbygoogle || []).push({});

Post

# Normal, poisson with a twist of outliers.

December 9, 2007 10:48 PM Subscribe

Do foot sizes follow a normal distribution?

You could argue that, according to the Central Limit Theorem, and given that there's six billion people in the world, it will tend to a normal distribution. The issue is that foot size is not an independent random variable due to genetic differences. However, I'd guess that the variance between groups of people is not terribly high. Note also that foot size is close enough to being identically distributed, so qualitatively, it's close enough.

posted by spiderskull at 11:42 PM on December 9, 2007

posted by spiderskull at 11:42 PM on December 9, 2007

That Google result was for Japan, which is ethnically homogenous. I suspect that an equivalent chart for the US would be lumpy, with a small/narrow peak for Chinese/Japanese/Central-American and a larger/wider peak for whites and blacks.

That'll be the case for any kind of clothing.

posted by Steven C. Den Beste at 11:44 PM on December 9, 2007

That'll be the case for any kind of clothing.

posted by Steven C. Den Beste at 11:44 PM on December 9, 2007

My guess is, like many natural distributions, it will be unimodal with a long rightward tail -- if the average is a size six, then there are many size twelves versus size zeros (!) and even more size thirteen+ than size <0. Shaq may wear a size 23 but even Tattoo wears a size positive.

posted by Rumple at 11:52 PM on December 9, 2007

posted by Rumple at 11:52 PM on December 9, 2007

**Steven C. Den Beste**, I would be very surprised to see such a result, if you plotted it all out. It's likely more as

**spiderskull**said, simply because foot size is a continuous biological measurement. The shape of the curve in this case is probably going to depend more on how big and how randomly selected your sample is. And so long as Japan isn't populated with robot clones, there's still going to be a Normal distribution.

posted by zennie at 12:07 AM on December 10, 2007

**spiderkull**, people's foot sizes aren't random variables, they are fixed quantities, so the central limit theorem doesn't apply.

**Rumple**makes a good point about the differences in the tails. From what Rumple says, it is also easy to conclude that even if we do not know exactly what distribution foot sizes follow, it is certainly not normal, because the tails are not symmetric. The largest feet entry on this page, together with the fact that foot sizes are positive, kind of proves it's not symmetric unless the average foot size is larger than a U.S. size 14.

posted by louigi at 12:08 AM on December 10, 2007

louigi -- it depends entirely on how you set up your model, but in this case it's obvious to me. Foot size

posted by spiderskull at 2:02 AM on December 10, 2007

*is*a random variable for all intents and purposes. There's no deterministic equation involved. If I define my event space as the world's population, and the random variable is a mapping of each person to a foot size (on the real line), then it's a perfectly valid random variable. From there, you can construct a distribution, which I say is probably normal.posted by spiderskull at 2:02 AM on December 10, 2007

SCDB -- I didn't really think of it that way, but I think you may be right about having small "lumps". I guess the amount of genetic diversity within your sample space is essential here. In the case of defined ethnic groups, it's going to be a superposition of normal curves (i.e. as SCDB said, one for Chinese, one for Anglo, and so on).

posted by spiderskull at 2:12 AM on December 10, 2007

posted by spiderskull at 2:12 AM on December 10, 2007

Rumple, no one said the distribution can't have an offset. All you have to do is correct for the lack of size zero shoes by simply offsetting the value. We're not concerned with absolute value here, just the general shape of the curve over

posted by spiderskull at 2:15 AM on December 10, 2007

*some*range.posted by spiderskull at 2:15 AM on December 10, 2007

There's some distribution statistics for US women here, with a potential source for more in the text. Worldwide, it's different, and you also have the problem of conversion.

That being said, a size 6 is by no freaking means the average women's size in the US. It's hard to find adult shoes that run less than 6.5, as one of my (Chinese) co-workers is always complaining. However, children's sizes do overlap into 'adult sizes' with different numbers.

posted by cobaltnine at 3:35 AM on December 10, 2007

That being said, a size 6 is by no freaking means the average women's size in the US. It's hard to find adult shoes that run less than 6.5, as one of my (Chinese) co-workers is always complaining. However, children's sizes do overlap into 'adult sizes' with different numbers.

posted by cobaltnine at 3:35 AM on December 10, 2007

The plot in this paper seems to indicate so.

Also, you people are abusing the CLT. MEANS of {other restrictions} random variables are normal. You can sample 6 billion times from an exponential distribution and get an exponential distribution. Yes, the tails frequently don't have symmetric or infinite support. It's an approximation.

posted by a robot made out of meat at 4:46 AM on December 10, 2007

Also, you people are abusing the CLT. MEANS of {other restrictions} random variables are normal. You can sample 6 billion times from an exponential distribution and get an exponential distribution. Yes, the tails frequently don't have symmetric or infinite support. It's an approximation.

posted by a robot made out of meat at 4:46 AM on December 10, 2007

*The largest feet entry on this page, together with the fact that foot sizes are positive, kind of proves it's not symmetric unless the average foot size is larger than a U.S. size 14.*

It's not a good idea to use raw American adult shoe size to model this, as it is an interval scale. There are definitely people with what would be a negative size in this system. Better to use a ratio scale measurement like centimeters.

Human height is widely used as an example of a normally distributed quantity in biostatistics textbooks. I have just tested NHANES data on upper leg length, and it appears normally distributed within age groups. While foot size is not measured in that study, I would be shocked if this were not true for that as well.

Steven C. Den Beste is wrong when he says there should be noticeable bumps due to race. There aren't.

posted by grouse at 4:59 AM on December 10, 2007

**spiderskull**-- the central limit theorem is about sums of random variables. Unless I'm mistaken, blue_beetle is talking about data fitting a normal curve; there are no sums involved, so the central limit theorem doesn't apply.

**grouse**-- good point about the scale. Those biggest feet I was talking about earlier were about 18 inches long, incidentally.

But the fact is, it never makes sense to talk about data fitting a normal curve unless the data can, in principle, assume arbitrarily large positive and negative values. If foot size was normally distributed, and there were enough humans, we would be almost certain to see some people with feet of negative length.

posted by louigi at 5:49 AM on December 10, 2007

*it never makes sense to talk about data fitting a normal curve unless the data can, in principle, assume arbitrarily large positive and negative values.*

That's an interesting viewpoint, louigi, but it seems to be at odds with a century of use of the normal distribution in biostatistics to describe quantities that are necessarily positive.

posted by grouse at 6:13 AM on December 10, 2007

*But the fact is, it never makes sense to talk about data fitting a normal curve unless the data can, in principle, assume arbitrarily large positive and negative values.*

It makes perfect sense if you use the blessed weasel word "approximately."

posted by ROU_Xenophobe at 6:39 AM on December 10, 2007

Hmm. I guess biostatisticians and probabilists must disagree about what it means for data to fit a normal curve. Perhaps normal curves are a good enough in many real world situations, even if there are some technical issues that arise if you try to be perfectly precise about what you mean. Or maybe biostatisticians are really talking about truncated normal distributions.

I don't actually think this is too much of a derail; if there's disagreement about what it

posted by louigi at 7:02 AM on December 10, 2007

I don't actually think this is too much of a derail; if there's disagreement about what it

*means*for data to fit a normal curve, then the answer to the question may well depend on what perspective you take.posted by louigi at 7:02 AM on December 10, 2007

The Central Limit Theorem applies to a distribution of MEANS from repeated sampling from a population. It says nothing about the distribution of INDIVIDUALS (such as individual feet).

For example, there are six billion people in the world, but that doesn't mean the distribution of something like income is normally distributed. You have to look at a histogram of the data to see what shape the distribution takes.

posted by tiburon at 7:20 AM on December 10, 2007

For example, there are six billion people in the world, but that doesn't mean the distribution of something like income is normally distributed. You have to look at a histogram of the data to see what shape the distribution takes.

posted by tiburon at 7:20 AM on December 10, 2007

*Perhaps normal curves are a good enough in many real world situations, even if there are some technical issues that arise if you try to be perfectly precise about what you mean.*

If this were not the case, you could not use the normal distribution to talk about anything in the physical universe, since nothing and no quantity or measurement of the physical universe can be arbitrarily large or small (though the limits are decidedly far out). This is doubly true since physical quantities or measurements often cannot be negative, ruling out half of the real line immediately.

*if there's disagreement about what it means for data to fit a normal curve, then the answer to the question may well depend on what perspective you take.*

I expect you will find that there is not any meaningful disagreement about what it means to analysts as opposed to mathematicians: It is accurate enough to make useful inferences, and not too many mistakes.

posted by ROU_Xenophobe at 7:56 AM on December 10, 2007

That's a lot more smartass than I meant to say.

All I really meant is that the concerns of people examining the normal distribution as an interesting object of study in its own right will not have much in common with the concerns of people using the normal distribution as a tool for making guesses about the real world.

posted by ROU_Xenophobe at 10:38 AM on December 10, 2007

All I really meant is that the concerns of people examining the normal distribution as an interesting object of study in its own right will not have much in common with the concerns of people using the normal distribution as a tool for making guesses about the real world.

posted by ROU_Xenophobe at 10:38 AM on December 10, 2007

If you had a set of data to test, the thing to do would be to calculate the skewness and kurtosis. These can then be compared with the standard error of skewness and standard error of kurtosis (which can be approximated as sqrt(6/N) and sqrt(24/N), respectively); if the absolute value of either parameter exceeds twice its standard error, the data can be considered to be significantly non-normal. If not, that's strong evidence (but not proof) that the data is approximately normal.

As for the people pointing out that a normal distribution entails the possibility of negative values, this is true, but I don't believe it significantly affects the usefulness of a normal distribution as an approximation. For example, if Japanese adult males' foot sizes were normally distributed with the mean and standard deviation from the JLIA data given in zixyer's link, less than one out of every 10

posted by DevilsAdvocate at 10:45 AM on December 10, 2007

As for the people pointing out that a normal distribution entails the possibility of negative values, this is true, but I don't believe it significantly affects the usefulness of a normal distribution as an approximation. For example, if Japanese adult males' foot sizes were normally distributed with the mean and standard deviation from the JLIA data given in zixyer's link, less than one out of every 10

^{123}Japanese adult males would have a negative foot length. Which seems to me to fit the observed data pretty well.posted by DevilsAdvocate at 10:45 AM on December 10, 2007

But the question is, "Do foot sizes follow a normal distribution?". The answer is, clearly, "no". Do they approximate a normal distribution if you truncate the tails, or are they normal "over some range", a range that does not include all foot sizes? Probably, yes. "Offsetting the value" does not change the shape of the distribution unless coupled to some range truncation. And all of this might not matter, except that it may indeed be the outliers that are of interest and imposition of normalcy can trim the most interesting cases.

Oh, and,

Metafilter: the blessed weasel word "approximately".

On Preview: NOT 10^123IST

posted by Rumple at 10:57 AM on December 10, 2007

Oh, and,

Metafilter: the blessed weasel word "approximately".

On Preview: NOT 10^123IST

posted by Rumple at 10:57 AM on December 10, 2007

Frankly, I think that if you had data to test you'd be best off just throwing up the kernel density of your data and the normal distribution with the same mean and SD and eyeballing them.

I don't think it's reasonable to read that question as asking "Are foot sizes distributed exactly normally, without even the slightest deviation from the distribution as specified by Gauss, and allowing for the full possible range of foot sizes including feet that are larger than galaxies and inverted anti-feet with negative length?"

A more reasonable reading of that question will be more along the lines of "Do foot sizes follow a normal distribution, within the bounds that "normal distribution" is normally used to describe actual objects?"

Googling around, I would guess that the answer is: not quite. The things I can find show some noticeable skew with a long tail to the right, but the normal isn't too bad. Whether this skew is a meaningful departure from the normal distribution would depend on what you want to do with the data and what the consequences for fucking up are.

If you're just curious, the answer that pops out of the few things I could find in a moment's googling is "Almost." or "More or less."

posted by ROU_Xenophobe at 11:15 AM on December 10, 2007

*But the question is, "Do foot sizes follow a normal distribution?". The answer is, clearly, "no".*I don't think it's reasonable to read that question as asking "Are foot sizes distributed exactly normally, without even the slightest deviation from the distribution as specified by Gauss, and allowing for the full possible range of foot sizes including feet that are larger than galaxies and inverted anti-feet with negative length?"

A more reasonable reading of that question will be more along the lines of "Do foot sizes follow a normal distribution, within the bounds that "normal distribution" is normally used to describe actual objects?"

Googling around, I would guess that the answer is: not quite. The things I can find show some noticeable skew with a long tail to the right, but the normal isn't too bad. Whether this skew is a meaningful departure from the normal distribution would depend on what you want to do with the data and what the consequences for fucking up are.

If you're just curious, the answer that pops out of the few things I could find in a moment's googling is "Almost." or "More or less."

posted by ROU_Xenophobe at 11:15 AM on December 10, 2007

*Frankly, I think that if you had data to test you'd be best off just throwing up the kernel density of your data and the normal distribution with the same mean and SD and eyeballing them.*

When I looked at the upper leg length data, I eyeballed a quantile-quantile plot. I read somewhere today that hypothesis tests for normality are not that powerful, and one is frequently better off using a graphical method.

For my purposes, it was pretty close to normal, but I can see that others may not think this is so at the tails.

posted by grouse at 11:45 AM on December 10, 2007

Ah, my mistake. CLT is indeed for distribution of expectations. I still think it's going to be normal, though. I agree with what lennie said:

Normal curves are described by a mean and variance (of different degrees sometimes). What you're talking about here are zero-mean curves, but a dataset can still be normal if the mean is, say, 100 with a variance of 10.

posted by spiderskull at 1:17 PM on December 10, 2007

*foot size is a continuous biological measurement*.*But the fact is, it never makes sense to talk about data fitting a normal curve unless the data can, in principle, assume arbitrarily large positive and negative values.*Normal curves are described by a mean and variance (of different degrees sometimes). What you're talking about here are zero-mean curves, but a dataset can still be normal if the mean is, say, 100 with a variance of 10.

posted by spiderskull at 1:17 PM on December 10, 2007

To clarify, I understand that there needs to be very large positive and negative values since the curve doesn't intersect with zero, but since there are not an infinite number of people in the world, at some point we'd need to quantize the graph. The moment you do that, I'd venture that the negative values fall off.

posted by spiderskull at 1:21 PM on December 10, 2007

posted by spiderskull at 1:21 PM on December 10, 2007

Wow, I didn't realize that there would be so much debate on this topic. I was wondering about this the other day, because it seemed reasonable that there could be multiple clusters of sizes, and that the distribution might be quite odd. It looks like there might not be enough empirical evidence to answer the question definatively, other than in small homogenous populations. If anyone has more data (similar to what zixyer linked to) I'd love to see it.

ROU_Xenophobe: I would love to hear more about your "inverted anti-feet with negative length" Can you point me towards a scholarly article on the subject?

posted by blue_beetle at 2:35 PM on December 10, 2007

ROU_Xenophobe: I would love to hear more about your "inverted anti-feet with negative length" Can you point me towards a scholarly article on the subject?

posted by blue_beetle at 2:35 PM on December 10, 2007

Shoe manufacturers will have the data. Whether they share it with us, though, is another question.

My guess: foot size heel-to-toe, length measurement, will correlate very well with height. Breadth across the pad of the foot will correlate well with height, and well with weight. Cultural tendencies to go barefoot will broaden feet. There will be some racial variation in height to foot length and height to foot breadth ratios. There will also be gender variation, but this will be mostly swamped by height effects. Right and left foot data will be slightly different; these numbers will again be slightly different for right-hander and left-hander populations. Regarding outliers, the biggest outlier question is whether to treat the absence of a foot, natural or, more commonly, accidental, as a size 0 or the absence of a data point entirely. The second-biggest question is how to treat various rare deformities of the feet - for example, negative foot sizes are, somewhat facetiously, possible (may be disturbing). I think discounting both is the most sensible approach, for any analysis based on averaging.

posted by aeschenkarnos at 3:01 PM on December 10, 2007

My guess: foot size heel-to-toe, length measurement, will correlate very well with height. Breadth across the pad of the foot will correlate well with height, and well with weight. Cultural tendencies to go barefoot will broaden feet. There will be some racial variation in height to foot length and height to foot breadth ratios. There will also be gender variation, but this will be mostly swamped by height effects. Right and left foot data will be slightly different; these numbers will again be slightly different for right-hander and left-hander populations. Regarding outliers, the biggest outlier question is whether to treat the absence of a foot, natural or, more commonly, accidental, as a size 0 or the absence of a data point entirely. The second-biggest question is how to treat various rare deformities of the feet - for example, negative foot sizes are, somewhat facetiously, possible (may be disturbing). I think discounting both is the most sensible approach, for any analysis based on averaging.

posted by aeschenkarnos at 3:01 PM on December 10, 2007

Duh. Forgot the biggest one of all: age. Children's feet and hands are larger as compared to the rest of their bodies, than those of adults. Also elderly people have narrower feet due to soft-tissue wasting.

Actually you should be able to find all of this data, including discussions of outliers, in textbooks for podiatry at a university or medical school.

posted by aeschenkarnos at 3:04 PM on December 10, 2007

Actually you should be able to find all of this data, including discussions of outliers, in textbooks for podiatry at a university or medical school.

posted by aeschenkarnos at 3:04 PM on December 10, 2007

A useful google term for this is "anthropometry". For example:

Accession Number : ADA126189

Title : Comparative Anthropometry of the Foot.

Descriptive Note : Technical rept.,

Corporate Author : ARMY NATICK RESEARCH AND DEVELOPMENT LABS MA INDIVIDUAL PROTECTION LAB

Personal Author(s) : White,Robert M.

Report Date : DEC 1982

Pagination or Media Count : 323

Abstract : Comparative anthropometric data on the human foot are presented and discussed in detail in this technical report. Since reliable and definitive data on the feet of the U.S. civilian population are lacking, anthropometric data on the feet of the U.S. military population of men and women may be utilized in analyses of footwear sizing. Data are presented for fourteen foot measurements: Foot Length, Instep Length, Foot Breadth, Heel Breadth, Bimalleolar Breadth, Ball of Foot Circumference, Instep Circumference, Heel-Ankle Circumference, Lateral Malleolus Height, Medial Malleolus Height, Ankle Height, Ankle Circumference, Calf Height, and Calf Circumference. These foot measurements are defined and illustrated. Detailed anthropometric data on the feet of U.S. Army men and women are presented in the form of bivariate tables which depict the distribution of various categories of foot sizes and show the interrelationships among foot dimensions. Selected anthropometric data on feet also are presented for a variety of foreign military populations in order to illustrate the range of variation in foot size to be found in different part of the world. In the final section, feet and footwear are examined in terms of the sizing of footwear, and the development of tariffs for footwear is explained with illustrative examples.

Descriptors : *ANTHROPOMETRY, *FEET, MEASUREMENT, SIZES(DIMENSIONS), ARMY PERSONNEL, TABLES(DATA), INDEXES, MALES, WOMEN, FEMALES.

Subject Categories : PERSONNEL MANAGEMENT AND LABOR RELATIONS

Distribution Statement : APPROVED FOR PUBLIC RELEASE

posted by Rumple at 8:50 PM on December 10, 2007

Accession Number : ADA126189

Title : Comparative Anthropometry of the Foot.

Descriptive Note : Technical rept.,

Corporate Author : ARMY NATICK RESEARCH AND DEVELOPMENT LABS MA INDIVIDUAL PROTECTION LAB

Personal Author(s) : White,Robert M.

Report Date : DEC 1982

Pagination or Media Count : 323

Abstract : Comparative anthropometric data on the human foot are presented and discussed in detail in this technical report. Since reliable and definitive data on the feet of the U.S. civilian population are lacking, anthropometric data on the feet of the U.S. military population of men and women may be utilized in analyses of footwear sizing. Data are presented for fourteen foot measurements: Foot Length, Instep Length, Foot Breadth, Heel Breadth, Bimalleolar Breadth, Ball of Foot Circumference, Instep Circumference, Heel-Ankle Circumference, Lateral Malleolus Height, Medial Malleolus Height, Ankle Height, Ankle Circumference, Calf Height, and Calf Circumference. These foot measurements are defined and illustrated. Detailed anthropometric data on the feet of U.S. Army men and women are presented in the form of bivariate tables which depict the distribution of various categories of foot sizes and show the interrelationships among foot dimensions. Selected anthropometric data on feet also are presented for a variety of foreign military populations in order to illustrate the range of variation in foot size to be found in different part of the world. In the final section, feet and footwear are examined in terms of the sizing of footwear, and the development of tariffs for footwear is explained with illustrative examples.

Descriptors : *ANTHROPOMETRY, *FEET, MEASUREMENT, SIZES(DIMENSIONS), ARMY PERSONNEL, TABLES(DATA), INDEXES, MALES, WOMEN, FEMALES.

Subject Categories : PERSONNEL MANAGEMENT AND LABOR RELATIONS

Distribution Statement : APPROVED FOR PUBLIC RELEASE

posted by Rumple at 8:50 PM on December 10, 2007

This thread is closed to new comments.

posted by zixyer at 11:02 PM on December 9, 2007