Step 1: Data; Step 2: ???; Step 3: Insight!
January 17, 2012 8:01 AM   Subscribe

What's your methodology for turning a bunch of data into insights?

Or maybe "What's your brainstorming process look like?" It can't just be: Stare at the data/problem for a long time, take a shower, and then the answer just appears.

Methodologies, dogma, processes, and lore all accepted.
posted by rev- to Education (17 answers total) 22 users marked this as a favorite
 
My methodology does not start with data. Step 1 is think about what you want to know, and why you want to know it. Ideally, Step 2 is then collect data that will help you answer that, but in some cases you aren't the person who chooses or collects that data, so you have to work with what you have. Step 3 is then use the data to answer your questions. Pulling random "insights" out of your data is not helpful if you don't already have a reason to know that information -- it's dangerous to make up justifications only after you know the answers to your questions. Start with the questions: what would be useful for me to know, and why?
posted by brainmouse at 8:44 AM on January 17, 2012 [2 favorites]


I had the same thought, brainmouse, but I wonder if rev- is talking about using data from previous experiments to brainstorm future experiments? I think this is an acceptable way to use data that has already been collected.
posted by muddgirl at 8:55 AM on January 17, 2012


There's a line in the movie "Real Genius" where Prof. Hathaway tells the government flak that "You can't dictate innovation, Don. We're not making cheese sandwiches, here," which is exactly the wrong way to think of science. Science is getting up in the morning, going to work, and making cheese sandwiches. (In my cases, cheese sandwiches=graphs.)

I generate graphs. Lots of them. I slice and dice the data any way I can think of. Any two variables that I can put on two axes, I try that. Any constants I can subtract off? Subtract 'em. Any linear trends I can subtract out? Subtract 'em. What trends remain? Where are the data clustered? What is there about the graphs that I can explain quantitatively (and am I sure?) and what can't I explain? I take my graphs and show them to other people and try explain them. (This is great for discovering that, oops, yeah, that think you were 100% sure of? It's not actually there, in the cold light of day. But hey, what's that other thing over here?)

I also draw a lot of diagrams, from lots of different points of view, because the geometry of the system is very important to my work.

Taking a shower and waiting for insight to magically appear—that only works after you've really internalized the problem, which you do by looking at it from lots of different directions.
posted by BrashTech at 9:08 AM on January 17, 2012 [3 favorites]


I just go the graphical route. I do tons of exploratory graphing just to get a feel of the data.
posted by dhruva at 9:08 AM on January 17, 2012


heh. or what BrashTech said.
posted by dhruva at 9:09 AM on January 17, 2012


Statistics is the science of collecting and analyzing data for the purpose of drawing inferences and conclusions, and making decisions.

take an introductory course in it.
posted by BadgerDoctor at 9:17 AM on January 17, 2012 [1 favorite]


Always, always plot. Oh, and think. But plot, really.
posted by cromagnon at 9:56 AM on January 17, 2012


If the data is numeric, graphing helps. Try a whole bunch of different curve-fits and see if any lines seem to describe the data (be wary of things like log-log plots, which fit nearly anything). Curve-fitting isn't actual analysis, but the type of curve-fit may give you some insight into the mechanism of your system, and you can run with those hunches. Example: the Lineweaver-Burk plot for enzyme kinetics.

For pictorial data (photographs, maps, etc) it's harder. Lots of times, people try to "numberize" this kind of data to make analysis more amenable to standard approaches like statistical analysis, measuring the area under the curve, etc. You would digitize photos and, say, analyze how many pixels were a certain color. But I admit that many times I just stare at pictures (gels, blots, mass spectra, etc) side by side and look for what's different. Then I try to come up with a hypothesis about the difference, and usually I'm wrong but at least it's a start. Pictorial data is hard; numbers are much easier to crunch with all the software packages out there.

P.S. Scientists almost always have a preconceived notion of what their system is doing, be it a cell or a galaxy, and this mental model influences what we see in the data. It's virtually inescapable and it often works out OK - there aren't too many discoveries that overturn what we thought we knew before. So channel that urge to make your data fit your mental model of the universe and see if takes you somewhere productive. Just make sure you consider alternate hypotheses, and don't be too quick to discard outliers - sometimes they're trying to tell you something important.
posted by Quietgal at 10:11 AM on January 17, 2012


I'm not sure what software you use, but my favorite way of exploring data is to load it into R, and type "plot( data)". This plots every variable, against every other variable and you can visually view patterns quite easily. I've done this for up to 100 variables, and printed it out on an A1 page.

Then, I usually mark the variables I am interested, sometimes remove outliers and then explore these relationships in more detail using ggplot2 (an R package). I find facet plotting a very useful approach in general (or use of 'small multiples').
posted by a womble is an active kind of sloth at 10:12 AM on January 17, 2012 [3 favorites]


Number 2 is input (to yourself). Break it, feel it, see it. You must ask and demand an answer.until you drop from fatigue. Then rest until you have enough gumption to try a new angle. Insight is hard.
posted by JohnR at 10:20 AM on January 17, 2012


I think this must be one of those things that differs between physical and social sciences, presumably because your installed base of theory is powerful enough to weed out a lot of dumb ideas.

In the social science world, I would caution you against this sort of data mining. Plotting or regressing everything against everything is a great way to get 100 pieces of spurious nonsense and 1 true thing, and be unable to tell the difference with the data you have.
posted by ROU_Xenophobe at 10:30 AM on January 17, 2012 [3 favorites]


To expand on ROU_Zenophobe, physical sciences also have a taboo against the kind of data dredging that was mentioned above (plot everything! take stuff out! add stuff in!). Don't do that.

I usually read up on a topic a lot, think about what questions come up, and then I should have some idea of what tests I want to run on the data.

Like say you have a bunch of data about students' grades including teacher, subject, sex, parents' income, etc. There are a lot of ways you could look at that data - and a lot of significant correlations you could make. And of course, when you add in variables your correlation always gets better. But what are you interested in? Do you want to know how well your teachers are doing? Do you want to see how well males vs. females do? Do you want to pick the easiest course to take?

Figure out your questions.
posted by hydrobatidae at 11:46 AM on January 17, 2012 [1 favorite]


To expand on us both, that sort of data dredging only becomes halfway-okay when you have an extensive base of theoretical expectations already in your head. At that point, what you're doing is more or less quickly and dirtily evaluating the performance of many different theories for the data you have -- I see more X goes with more Y, which is consistent with Foo's theory but not Bar's.

The catch is that doing this renders your data completely useless for actually testing whether Foo's theory is a good explanation for whatever you're looking at, because the fact that more X goes with more Y is why you think it might be applicable. Instead, you need new data to recheck that more X goes with more Y or --even better-- you need to sit down and think about what *else*, that doesn't involve X or Y, would be true in a universe where Foo's theory was right, finding that it also predicts more A goes with less B. Then grab new data and test in that dataset.
posted by ROU_Xenophobe at 12:21 PM on January 17, 2012


Honestly, unless I'm misunderstanding your question, I think the answer for me is "Design a better experiment to begin with."

Maybe your field (you might get better answers if you specify this) is different from mine, but if I'm doing my job properly, i don't start with data and end with insight. I start with a possible insight, i.e. a hypothesis, and then collect data that tests it. When I'm dealing with all the data I've collected and it doesn't immediately make sense, I break my hypothesis and my experiment into the smallest possible questions and see how the data answered those questions. So, possible insight leads to data leads to new possible insight (repeat repeat repeat.) If I look at my data and it isn't working this way, I know that I haven't done the right controls to answer my question, or I tried to ask too many questions at once, or (and this happens a lot) I actually don't have as much background in the question as I thought I did, and I'm going to have to do a lot more reading and talking with people to get a testable possible insight.
posted by juliapangolin at 1:00 PM on January 17, 2012 [2 favorites]


What kind of insight are you looking for? Are you looking for relationships, correlation, probabilities across two sub-groups, etc.? What available data do you have--is it quantitative, qualitative, or mixed? Is this for personal purposes or professional? What analytical resources do you have available that fit the type of data that you have?

If this is supposed to be for something remotely scientific, Brainmouse is spot on in saying that that you need to develop your research question and hypotheses before digging into the data. I'm assuming you're using a pre-existing data set (yay, gov't data sets!). Once you know your question and your research & null hypotheses, you'll be ready to ask someone to help you analyze it, after which you can draw your conclusions.

I agree that you need some sort of research methods course, qual or quant (or preferably, both) if this is something you are serious about. Oh, and also remember that just because you have a bunch of data doesn't mean that you can slice it however you like; if this is a raw data set, then there are a whole bunch of things you need to know about before you can start running tests or models, such as sample size, sampling method, general population form which the sample came, missing or incomplete observations, etc.
posted by smirkette at 2:17 PM on January 17, 2012


If you are not in the hard sciences, you might find it interesting to look at the methods anthropologists use for working from their data to their insights. The main one is tagging with keywords. (They call this "coding", if you want to know what to google for.) There is a variety of software around to support this.
posted by lollusc at 3:06 PM on January 17, 2012 [1 favorite]


As a group, the simplest thing you can do is to decide on a goal and figure out exactly what you mean by that goal in very concrete terms. Know what you want to know.

1. Our goal G is some fact we need to know.

2. Define exactly what we mean by G. G = A + B + C.

3. Do we know A, B, and C? If yes, we're done. Just do the math. If no, for each unknown we need to go to 1 and define that subgoal until we eventually hit bottom (the raw data).

etc.

Just narrow it down rule by rule, formula by formula, until you can convert all the variables to tedious little facts you already have in your database or that you can add to your database.

If you can reduce your goal to such rules, you can write a simple program (in Prolog, for example) to apply those rules to your data and show you the answer.

Brainstorming could be something like this:

1. Decide exactly what you want to learn or do (the goal) and exactly what you already know or have (the data). We need to know or do X. We know or have Z.

2. Write the goal at the top of a white board and the data (or some characteristic piece of it) at the bottom of the white board.
X = the goal

Z = 3 [the hard data]
3. Develop rules in between that will eventually bridge the gap between what you know and what you want to know.

X = Y + Z
Y = Z/2
Z = 3

So
X = Y + Z
X = (Z/2) + 3
X = 1.5 + 3
X = 4.5

You might discover that you just don't have the data to support your goal, in which case you need to get better data or change your goal. OK, we don't want to "cure all cancers everywhere", we want to "reduce the occurrence of cervical cancer caused by HPV transmission during sex between Czech teenagers", and what we know is that condoms and vaccines are great for this. Now convert the vague stuff ("reduce" by how much? "great"? used alone or in combination?) into hard goals, rules, and data, and keep going at it iteratively.
posted by pracowity at 2:50 AM on January 18, 2012


« Older Help me figure out how to address sexual issues   |   What to read before Athens and Istanbul Newer »
This thread is closed to new comments.