Graphing complicated data
June 1, 2007 3:00 PM   Subscribe

I need help graphically representing some data.

So I’ve got this data. Each item is three digits long. The first digit refers to um say hair colour (1 brown, 2 blonde, 3 black 4 white 5 red), the second digit to hair length (say nearest inch) and the third as to whether the hair is curly (1)or not. (0). So a short sample of my data set might be like this: 121, 220, 310, 320, 120, 111, 430, 501.

I can quite easily separate the numbers from each other (text to columns excel) if necessary. Excel & Illustrator are my tools. I do not have a budget to change that. I have high school maths skills. My audience will be middle school teachers. (Obviously the data set isn’t really hair styles).

How do I graph this? How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly? What are good terms to search for graphing techniques that will narrow down my results?

I do know about this website but what am I looking for, exactly? I did think of colour coding the digits and sorting by most common, but the problem is the digits in each of the columns has a different meaning. Brown <> 1 inch <> curly.
posted by b33j to Media & Arts (12 answers total) 3 users marked this as a favorite
Graph the length of the hair on one axis, and the percentage of people with that hair length that have curly hair. Each data set should be split into hair color, and then the points connected by lines, so you'd have 5 different color lines, each varying based on hair length and percent of subjects with curly hair.

You can easily do this in Excel.
posted by Loto at 3:09 PM on June 1, 2007

I'd probably plot hair color on the x-axis and length on the y-axis; with the curly and non-curly data split into two data sets. Curly hair would be colored red and non-curly would be blue. This is easy to do in excel - sort by Column 3 (curly vs. non), then plot only the data that isn't curly. Add a new series and plot the data that is curly.
posted by muddgirl at 3:15 PM on June 1, 2007

Response by poster: Okay, I'm sorry I'm not understanding something. Let's say I do it muddgirl's way, and I have my hair colour along the bottom, I end up with 5 bars, brown 3, blonde 1 black 2 white and red 1 each. Normal bar graph. How do I add in the extra information. For the length of hair, 2 people with brown hair have hair that is 1 inch long, and 1 person with brown hair has hair that is 2 inches long. So is the brown hair bar now divided into 1/3 and 2/3s?, the 2/3s representing the 1 inch, and the 1/3 representing the 2 inch? And there's no curls at all there.

(Sorry I'm dense about this.)
posted by b33j at 3:25 PM on June 1, 2007

Does it need to be one graph to rule them all? It may be easier to bring certain correlations into view if you can show multiple graphs. If it turns out that, say, curliness among redheads is strongly correlated to hair length, and you want to demonstrate that, you want one graph showing hair length for redheads on the X axis, population on the Y, and two lines for curly/non-curly (which should make an X shape). Then similar graphs for brunettes, blonds, etc. And remember to keep your scales the same.

Still, if I had to condense it to one graph, I'd probably do a multiple-line graph, where the X axis is hair length, the Y axis is population, and there's a separate line for each hair color/curliness combination.

If you wanted to get fancy, you could do this in quasi-3D, with hair length on the X axis, color on the Z axis, population on the Y axis, and pairs of curly/non-curly for each.
posted by adamrice at 3:32 PM on June 1, 2007

Well, first of all, you have two categorical variables and one continuous variable. It all depends what you're trying to show. For your example, you're ignoring hair color - most of your examples will probably ignore one variable, because its very hard to think about correlations that go three ways. Your unspoken other is occurance - you want to graph all of these against occurance in some group. For your example, I'd use a stacked bar graph: I'd use length on the X axis, and total occurance of that hair length on the Y axis. So say those bars are blue. Then put the number of those people that have curly hair in red over the blue bar. Since your numerical variable is actually discrete, it fits using bars well.
posted by devilsbrigade at 3:37 PM on June 1, 2007

Response by poster: I have to go out so I won't be able to answer questions for a while.

adamrice, one of the reasons I'm trying to put it on one graph is that there are 11 different groups of data. If I separate it out, that's 22-33 graphs. Secondly, what the data actually represents are children's scores in math, whether they got the answer right or wrong, what kind of wrong answer, and what kind of reasoning they used to get the wrong answer. So the same kind of graph can be applied to all 11 questions, if there is, indeed, a way to show it.
posted by b33j at 3:38 PM on June 1, 2007

Um, I don't think muddgirl was talking about a bar graph. Think points.

Here's a terrible representation:

10 *
6 *
4 *
Brown, Blonde, Black, White, Red

This means that there are three people with brown hair, and lengths of 10, 6 and 4 inches. And also only using the non-curly set, like muddgirl also suggested.

I don't know how to make mefi not swallow a bunch of spaces. Otherwise, there would be more dots, and in other hair colors.
posted by philomathoholic at 3:42 PM on June 1, 2007

Excel & Illustrator are my tools. I do not have a budget to change that.

R is free. R has vastly better graphing capabilities than Excel. OTOH, R is a pain to deal with for newbies.

How do I show visually that most of my sample don’t have curly hair, but the shorter the hair the more likely it is to be curly?

I would do this by displaying the relevant sets of kernel densities, divided by whatever was of interest. That is, one line for the density of length for blondes, one line for the density of length for redheads, etc. Then I'd use a separate figure to show the density for curly and the density for straight. Here's a quick example in R showing that redheads tend to have shorter hair than brunettes do; I couldn't be bothered to set a title or anything. Once I had the data in, generating the figure was a matter of two lines:


You can fake this in Excel by using the histogram functions, but there's substantial beating-into-submission involved. Frankly, I think it would take less time to install and do this in R than it would to beat Excel into doing it.

The really important thing is that you have to know, in advance, what information you're trying to get across. You need to slice and dice the information first, so that you know that you're trying to show that blondes have longer hair than redheads, or whatever. If you try to convey a mishmash of information, it's going to be a mishmash of information no matter how whizbang your graphs are.

Also, a lot of the questions I think you're interested in are really matters for various multivariate / multiple statistical techniques that I suspect you might not want to mess with.
posted by ROU_Xenophobe at 4:22 PM on June 1, 2007

Assuming you want to show the relationships between length/curls, color/curls, and color/length then you could use three separate graphics of the averages. If you want to show all relationships in one graph it gets complex and may not be easily understood.

The root of your exercise is communication. What do you want to communicate? Presenting data is usually not enough. The graphic has to communicate. A twisted 3d line or surface may make someone wonder WTF? You have to break down "what you have to say" in easily digestible "sentences" and not just have one long "run-on sentence/paragraph".

Seeing that you don't sound too much of a visual person, try to think out what conclusions you want to try and give to your audience. Is there a trend that you can see in the data or is there absolutely no relationship?
posted by JJ86 at 4:47 PM on June 1, 2007

OK, it is way easier to work with the actual meaningful data than this hair abstraction. WHen you talk about math scores I think 'histogram', because usually you are interested in seeing how many people did well and how many did poorly and what the distribution is. Moreover you are probably also interested in seeing if there was a common theme among students who did poorly.
(a histogram is just a bar graph where you group the data into 'bins', like 0-20, 20-40, 40-60 etc, and count the number in each bin. you can google 'excel histogram' to figure out how to do this.)
that's step one.

once you have the data in 'bins' you can look for other relationships - slice the histogram bars into percentages or something - but still, as others have said, you need a better idea of what you want to show ahead of time. perhaps you could give a specific example.
posted by PercussivePaul at 12:28 AM on June 2, 2007

Best answer: I think this would be best as a scatter chart, not as a bar or line chart. Each person would have one dot on the graph. It should look something like this:

                   Length of hair
 Brown |       @        @ /   /@  / / //   /
Blonde |  /  @ /      /    /  / ///    /
 Black |    @   @  @@   @      /       /
 White |       @  @ 
   Red |         @ @             /

@ curly
/ straight

If you've numbered the hair colors you're already part-way there: that's one axis, length is another axis, and (as muddgirl said) plot the curly-haired and straight-haired people as separate series.

If your sample set was very large, you would need to do number-crunching of the sort ROU_Xenophobe describes (with error bars showing the average length, and upper/lower quartiles, for each color). But for fewer than 500 people or so, one dot per person should work fine — the human brain alone will be powerful enough to see the patterns in the clumps of dots.
posted by mpt at 6:59 AM on June 2, 2007

One thing just to keep in mind if you do a scatter chart is that while the human brain can see patterns, it can also be easily misled. If you have a dataset that isn't very clearly correlated, you can seriously fuck up your results if you leave it to intuition (this is why a lot of statistics exists...).
posted by devilsbrigade at 9:24 PM on June 4, 2007

« Older MYSTERIOUS PHOTOS TAKEN IN BANGKOK   |   Evening sightseeing/photography in Philadelphia? Newer »
This thread is closed to new comments.