February 8, 2012 2:25 PM Subscribe

How can I analyze some data about linkages between data elements in a way that will give me meaningful information about the similarity of the relationships? This has got to be a solved problem!

Let's say I've got a list of everyone who has walked through some doors, say, in a mall. In fact, I excellent records about which doors each person has walked through. I'm looking for some sort of analysis that will help me to collect people into groups based on how similar they are in terms of which doors they mostly walk through (this group often goes to Sears and JC Penny, that group only goes to Cinnabon, etc). I'm also looking to be able to be able to calculate some sort of degree of "aberrance" in behavior based on grouping - say, when someone who belongs to a group that would otherwise only ever shop at shoe stores also went to at Tower Records that one time.

The behavioral aspect is purely illustrative - I'm strictly looking to be able to group things together with some sort of numerical "confidence", hopefully that I could analyze for anomalies or change over time.

What is this area of math called? Are there any good tool sets (say, as part of R, or MATLAB/Octave) that I can use? Can anyone recommend a good textbook? I haven't taken linear algebra, so fewer pre-reqs would be nice.
posted by TheNewWazoo to Education (9 answers total) 2 users marked this as a favorite

Let's say I've got a list of everyone who has walked through some doors, say, in a mall. In fact, I excellent records about which doors each person has walked through. I'm looking for some sort of analysis that will help me to collect people into groups based on how similar they are in terms of which doors they mostly walk through (this group often goes to Sears and JC Penny, that group only goes to Cinnabon, etc). I'm also looking to be able to be able to calculate some sort of degree of "aberrance" in behavior based on grouping - say, when someone who belongs to a group that would otherwise only ever shop at shoe stores also went to at Tower Records that one time.

The behavioral aspect is purely illustrative - I'm strictly looking to be able to group things together with some sort of numerical "confidence", hopefully that I could analyze for anomalies or change over time.

What is this area of math called? Are there any good tool sets (say, as part of R, or MATLAB/Octave) that I can use? Can anyone recommend a good textbook? I haven't taken linear algebra, so fewer pre-reqs would be nice.

You're describing categorical data. The area of statistics you're looking for is Analysis of Categorical Data.

posted by demiurge at 2:45 PM on February 8, 2012 [1 favorite]

posted by demiurge at 2:45 PM on February 8, 2012 [1 favorite]

At first glance, Eureqa seems to do something different than what I seek. Instead of having data that's

(x_{1},y_{1})

(x_{2},y_{2})

I've got data that's more like

(x, {a,a,b,c})

(y, {c,d,e})

Can I translate one into the other? Forgive me if it's a silly question.

posted by TheNewWazoo at 2:46 PM on February 8, 2012

(x

(x

I've got data that's more like

(x, {a,a,b,c})

(y, {c,d,e})

Can I translate one into the other? Forgive me if it's a silly question.

posted by TheNewWazoo at 2:46 PM on February 8, 2012

I believe Eureqa is designed for continuous variables (real numbers) and not for categorical data.

posted by demiurge at 2:55 PM on February 8, 2012

posted by demiurge at 2:55 PM on February 8, 2012

Social network analysis is sort of what you're talking about. But also discriminant analysis is worth looking at too.

posted by k8t at 2:55 PM on February 8, 2012

posted by k8t at 2:55 PM on February 8, 2012

How about association rule learning? It's been around forever, and I've used it before with some pretty interesting results.

posted by un petit cadeau at 4:00 PM on February 8, 2012

posted by un petit cadeau at 4:00 PM on February 8, 2012

Excellent! Thank you, everyone, for the keywords I can search against. Time to brew some tea and get my study on. If anyone knows of any good open courseware or study material on these subjects, I'd appreciate a pointer.

posted by TheNewWazoo at 7:48 PM on February 8, 2012

posted by TheNewWazoo at 7:48 PM on February 8, 2012

For actually plotting the association rules, I like Gephi.

posted by gregglind at 8:33 AM on February 9, 2012

posted by gregglind at 8:33 AM on February 9, 2012

I think a chi-squared test (for variance in the population) might work for this: analyze sample subgroups (age cohorts for example) by doors of interest. My understanding is that this test would tell you whether the visitation patterns you're seeing in sample subgroups are significantly different from what might occur due to chance (i.e. the null hypothesis, or no relationship between the sample subgroup and the doors). This test assumes a large sample and more than five observations per table cell.

R or any other statistical software would be able to do this; Excel has a lot of built in stats, so that would be worth a look as well.

Disclaimer: I'm still learning stats & data analysis, so I could be wrong.

posted by smirkette at 9:23 AM on February 9, 2012

R or any other statistical software would be able to do this; Excel has a lot of built in stats, so that would be worth a look as well.

Disclaimer: I'm still learning stats & data analysis, so I could be wrong.

posted by smirkette at 9:23 AM on February 9, 2012

This thread is closed to new comments.

posted by j03 at 2:34 PM on February 8, 2012