Graphing complicated data
June 1, 2007 3:00 PM Subscribe
I need help graphically representing some data.
So I’ve got this data. Each item is three digits long. The first digit refers to um say hair colour (1 brown, 2 blonde, 3 black 4 white 5 red), the second digit to hair length (say nearest inch) and the third as to whether the hair is curly (1)or not. (0). So a short sample of my data set might be like this: 121, 220, 310, 320, 120, 111, 430, 501.
I can quite easily separate the numbers from each other (text to columns excel) if necessary. Excel & Illustrator are my tools. I do not have a budget to change that. I have high school maths skills. My audience will be middle school teachers. (Obviously the data set isn’t really hair styles).
How do I graph this? How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly? What are good terms to search for graphing techniques that will narrow down my results?
I do know about this website http://infosthetics.com/ but what am I looking for, exactly? I did think of colour coding the digits and sorting by most common, but the problem is the digits in each of the columns has a different meaning. Brown <> 1 inch <> curly.
So I’ve got this data. Each item is three digits long. The first digit refers to um say hair colour (1 brown, 2 blonde, 3 black 4 white 5 red), the second digit to hair length (say nearest inch) and the third as to whether the hair is curly (1)or not. (0). So a short sample of my data set might be like this: 121, 220, 310, 320, 120, 111, 430, 501.
I can quite easily separate the numbers from each other (text to columns excel) if necessary. Excel & Illustrator are my tools. I do not have a budget to change that. I have high school maths skills. My audience will be middle school teachers. (Obviously the data set isn’t really hair styles).
How do I graph this? How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly? What are good terms to search for graphing techniques that will narrow down my results?
I do know about this website http://infosthetics.com/ but what am I looking for, exactly? I did think of colour coding the digits and sorting by most common, but the problem is the digits in each of the columns has a different meaning. Brown <> 1 inch <> curly.
I'd probably plot hair color on the xaxis and length on the yaxis; with the curly and noncurly data split into two data sets. Curly hair would be colored red and noncurly would be blue. This is easy to do in excel  sort by Column 3 (curly vs. non), then plot only the data that isn't curly. Add a new series and plot the data that is curly.
posted by muddgirl at 3:15 PM on June 1, 2007
posted by muddgirl at 3:15 PM on June 1, 2007
Okay, I'm sorry I'm not understanding something. Let's say I do it muddgirl's way, and I have my hair colour along the bottom, I end up with 5 bars, brown 3, blonde 1 black 2 white and red 1 each. Normal bar graph. How do I add in the extra information. For the length of hair, 2 people with brown hair have hair that is 1 inch long, and 1 person with brown hair has hair that is 2 inches long. So is the brown hair bar now divided into 1/3 and 2/3s?, the 2/3s representing the 1 inch, and the 1/3 representing the 2 inch? And there's no curls at all there.
(Sorry I'm dense about this.)
posted by b33j at 3:25 PM on June 1, 2007
(Sorry I'm dense about this.)
posted by b33j at 3:25 PM on June 1, 2007
Does it need to be one graph to rule them all? It may be easier to bring certain correlations into view if you can show multiple graphs. If it turns out that, say, curliness among redheads is strongly correlated to hair length, and you want to demonstrate that, you want one graph showing hair length for redheads on the X axis, population on the Y, and two lines for curly/noncurly (which should make an X shape). Then similar graphs for brunettes, blonds, etc. And remember to keep your scales the same.
Still, if I had to condense it to one graph, I'd probably do a multipleline graph, where the X axis is hair length, the Y axis is population, and there's a separate line for each hair color/curliness combination.
If you wanted to get fancy, you could do this in quasi3D, with hair length on the X axis, color on the Z axis, population on the Y axis, and pairs of curly/noncurly for each.
posted by adamrice at 3:32 PM on June 1, 2007
Still, if I had to condense it to one graph, I'd probably do a multipleline graph, where the X axis is hair length, the Y axis is population, and there's a separate line for each hair color/curliness combination.
If you wanted to get fancy, you could do this in quasi3D, with hair length on the X axis, color on the Z axis, population on the Y axis, and pairs of curly/noncurly for each.
posted by adamrice at 3:32 PM on June 1, 2007
Well, first of all, you have two categorical variables and one continuous variable. It all depends what you're trying to show. For your example, you're ignoring hair color  most of your examples will probably ignore one variable, because its very hard to think about correlations that go three ways. Your unspoken other is occurance  you want to graph all of these against occurance in some group. For your example, I'd use a stacked bar graph: I'd use length on the X axis, and total occurance of that hair length on the Y axis. So say those bars are blue. Then put the number of those people that have curly hair in red over the blue bar. Since your numerical variable is actually discrete, it fits using bars well.
posted by devilsbrigade at 3:37 PM on June 1, 2007
posted by devilsbrigade at 3:37 PM on June 1, 2007
I have to go out so I won't be able to answer questions for a while.
adamrice, one of the reasons I'm trying to put it on one graph is that there are 11 different groups of data. If I separate it out, that's 2233 graphs. Secondly, what the data actually represents are children's scores in math, whether they got the answer right or wrong, what kind of wrong answer, and what kind of reasoning they used to get the wrong answer. So the same kind of graph can be applied to all 11 questions, if there is, indeed, a way to show it.
posted by b33j at 3:38 PM on June 1, 2007
adamrice, one of the reasons I'm trying to put it on one graph is that there are 11 different groups of data. If I separate it out, that's 2233 graphs. Secondly, what the data actually represents are children's scores in math, whether they got the answer right or wrong, what kind of wrong answer, and what kind of reasoning they used to get the wrong answer. So the same kind of graph can be applied to all 11 questions, if there is, indeed, a way to show it.
posted by b33j at 3:38 PM on June 1, 2007
Um, I don't think muddgirl was talking about a bar graph. Think points.
Here's a terrible representation:
10 *
8
6 *
4 *
2
0
Brown, Blonde, Black, White, Red
This means that there are three people with brown hair, and lengths of 10, 6 and 4 inches. And also only using the noncurly set, like muddgirl also suggested.
I don't know how to make mefi not swallow a bunch of spaces. Otherwise, there would be more dots, and in other hair colors.
posted by philomathoholic at 3:42 PM on June 1, 2007
Here's a terrible representation:
10 *
8
6 *
4 *
2
0
Brown, Blonde, Black, White, Red
This means that there are three people with brown hair, and lengths of 10, 6 and 4 inches. And also only using the noncurly set, like muddgirl also suggested.
I don't know how to make mefi not swallow a bunch of spaces. Otherwise, there would be more dots, and in other hair colors.
posted by philomathoholic at 3:42 PM on June 1, 2007
Excel & Illustrator are my tools. I do not have a budget to change that.
R is free. R has vastly better graphing capabilities than Excel. OTOH, R is a pain to deal with for newbies.
How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly?
I would do this by displaying the relevant sets of kernel densities, divided by whatever was of interest. That is, one line for the density of length for blondes, one line for the density of length for redheads, etc. Then I'd use a separate figure to show the density for curly and the density for straight. Here's a quick example in R showing that redheads tend to have shorter hair than brunettes do; I couldn't be bothered to set a title or anything. Once I had the data in, generating the figure was a matter of two lines:
plot(density(redheads),col="red")
lines(density(brunettes),col="black")
You can fake this in Excel by using the histogram functions, but there's substantial beatingintosubmission involved. Frankly, I think it would take less time to install and do this in R than it would to beat Excel into doing it.
The really important thing is that you have to know, in advance, what information you're trying to get across. You need to slice and dice the information first, so that you know that you're trying to show that blondes have longer hair than redheads, or whatever. If you try to convey a mishmash of information, it's going to be a mishmash of information no matter how whizbang your graphs are.
Also, a lot of the questions I think you're interested in are really matters for various multivariate / multiple statistical techniques that I suspect you might not want to mess with.
posted by ROU_Xenophobe at 4:22 PM on June 1, 2007
R is free. R has vastly better graphing capabilities than Excel. OTOH, R is a pain to deal with for newbies.
How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly?
I would do this by displaying the relevant sets of kernel densities, divided by whatever was of interest. That is, one line for the density of length for blondes, one line for the density of length for redheads, etc. Then I'd use a separate figure to show the density for curly and the density for straight. Here's a quick example in R showing that redheads tend to have shorter hair than brunettes do; I couldn't be bothered to set a title or anything. Once I had the data in, generating the figure was a matter of two lines:
plot(density(redheads),col="red")
lines(density(brunettes),col="black")
You can fake this in Excel by using the histogram functions, but there's substantial beatingintosubmission involved. Frankly, I think it would take less time to install and do this in R than it would to beat Excel into doing it.
The really important thing is that you have to know, in advance, what information you're trying to get across. You need to slice and dice the information first, so that you know that you're trying to show that blondes have longer hair than redheads, or whatever. If you try to convey a mishmash of information, it's going to be a mishmash of information no matter how whizbang your graphs are.
Also, a lot of the questions I think you're interested in are really matters for various multivariate / multiple statistical techniques that I suspect you might not want to mess with.
posted by ROU_Xenophobe at 4:22 PM on June 1, 2007
Assuming you want to show the relationships between length/curls, color/curls, and color/length then you could use three separate graphics of the averages. If you want to show all relationships in one graph it gets complex and may not be easily understood.
The root of your exercise is communication. What do you want to communicate? Presenting data is usually not enough. The graphic has to communicate. A twisted 3d line or surface may make someone wonder WTF? You have to break down "what you have to say" in easily digestible "sentences" and not just have one long "runon sentence/paragraph".
Seeing that you don't sound too much of a visual person, try to think out what conclusions you want to try and give to your audience. Is there a trend that you can see in the data or is there absolutely no relationship?
posted by JJ86 at 4:47 PM on June 1, 2007
The root of your exercise is communication. What do you want to communicate? Presenting data is usually not enough. The graphic has to communicate. A twisted 3d line or surface may make someone wonder WTF? You have to break down "what you have to say" in easily digestible "sentences" and not just have one long "runon sentence/paragraph".
Seeing that you don't sound too much of a visual person, try to think out what conclusions you want to try and give to your audience. Is there a trend that you can see in the data or is there absolutely no relationship?
posted by JJ86 at 4:47 PM on June 1, 2007
OK, it is way easier to work with the actual meaningful data than this hair abstraction. WHen you talk about math scores I think 'histogram', because usually you are interested in seeing how many people did well and how many did poorly and what the distribution is. Moreover you are probably also interested in seeing if there was a common theme among students who did poorly.
(a histogram is just a bar graph where you group the data into 'bins', like 020, 2040, 4060 etc, and count the number in each bin. you can google 'excel histogram' to figure out how to do this.)
that's step one.
once you have the data in 'bins' you can look for other relationships  slice the histogram bars into percentages or something  but still, as others have said, you need a better idea of what you want to show ahead of time. perhaps you could give a specific example.
posted by PercussivePaul at 12:28 AM on June 2, 2007
(a histogram is just a bar graph where you group the data into 'bins', like 020, 2040, 4060 etc, and count the number in each bin. you can google 'excel histogram' to figure out how to do this.)
that's step one.
once you have the data in 'bins' you can look for other relationships  slice the histogram bars into percentages or something  but still, as others have said, you need a better idea of what you want to show ahead of time. perhaps you could give a specific example.
posted by PercussivePaul at 12:28 AM on June 2, 2007
I think this would be best as a scatter chart, not as a bar or line chart. Each person would have one dot on the graph. It should look something like this:
Length of hair
____________________________________

Brown  @ @ / /@ / / // /

Blonde  / @ / / / / /// /

Black  @ @ @@ @ / /

White  @ @

Red  @ @ /

@ curly
/ straight
If you've numbered the hair colors you're already partway there: that's one axis, length is another axis, and (as muddgirl said) plot the curlyhaired and straighthaired people as separate series.
If your sample set was very large, you would need to do numbercrunching of the sort ROU_Xenophobe describes (with error bars showing the average length, and upper/lower quartiles, for each color). But for fewer than 500 people or so, one dot per person should work fine — the human brain alone will be powerful enough to see the patterns in the clumps of dots.
posted by mpt at 6:59 AM on June 2, 2007
Length of hair
____________________________________

Brown  @ @ / /@ / / // /

Blonde  / @ / / / / /// /

Black  @ @ @@ @ / /

White  @ @

Red  @ @ /

@ curly
/ straight
If you've numbered the hair colors you're already partway there: that's one axis, length is another axis, and (as muddgirl said) plot the curlyhaired and straighthaired people as separate series.
If your sample set was very large, you would need to do numbercrunching of the sort ROU_Xenophobe describes (with error bars showing the average length, and upper/lower quartiles, for each color). But for fewer than 500 people or so, one dot per person should work fine — the human brain alone will be powerful enough to see the patterns in the clumps of dots.
posted by mpt at 6:59 AM on June 2, 2007
One thing just to keep in mind if you do a scatter chart is that while the human brain can see patterns, it can also be easily misled. If you have a dataset that isn't very clearly correlated, you can seriously fuck up your results if you leave it to intuition (this is why a lot of statistics exists...).
posted by devilsbrigade at 9:24 PM on June 4, 2007
posted by devilsbrigade at 9:24 PM on June 4, 2007
This thread is closed to new comments.
You can easily do this in Excel.
posted by Loto at 3:09 PM on June 1, 2007