February 8, 2014 6:09 PM Subscribe

Using a dataset I would like to perform the type of analysis shown here:
How would I go about doing this?

Effectively what I want to do is group members of a large dataset together by observed characteristics as opposed to existing formal categories. In the article linked above, the researcher does just this (with a very effective visualization, no less) by looking at basketball players' individual statistics rather than their formal positions. In an article that was linked to Metafilter, another individual does this with online dating profiles, scraping a dataset and grouping based on observed characteristics.

Any software that could help with the visualization would be awesome too. If my question seems a little unfocused it's because I'm a bit of a data newbie, I'm only just now learning STATA and my programming skills are lackluster. The answer to the above question can be "hire a programmer", I'd just need to know what type of programming skills to look for.
posted by Ndwright to Computers & Internet (5 answers total) 7 users marked this as a favorite

Effectively what I want to do is group members of a large dataset together by observed characteristics as opposed to existing formal categories. In the article linked above, the researcher does just this (with a very effective visualization, no less) by looking at basketball players' individual statistics rather than their formal positions. In an article that was linked to Metafilter, another individual does this with online dating profiles, scraping a dataset and grouping based on observed characteristics.

Any software that could help with the visualization would be awesome too. If my question seems a little unfocused it's because I'm a bit of a data newbie, I'm only just now learning STATA and my programming skills are lackluster. The answer to the above question can be "hire a programmer", I'd just need to know what type of programming skills to look for.

Grouping objects together based on their observed characteristics is called clustering. On the other hand if you have a set of pre-defined categories it is called classification.

The NBA analysis used something called a 'topological similarity network'. I'm not familiar with this but it is one of many algorithms you can use to solve a clustering problem. Here's an overview which includes a brief discussion of topological clustering, among others. There are many other clustering algorithms you could try out; different algorithms are more appropriate for different types of problems. k-means is a classic and is the first place to start, probably, since it is easier to understand and there will be lots of resources on the web.

There are free implementations of many of these in most statistical and scientific software platforms. I personally like R, but I'm certain STATA would have this too. Lots of people use Python these days with the numpy/pandas/scipy libraries. I tend to find R's plotting libraries, especially ggplot, to be ahead of the others.

posted by PercussivePaul at 6:30 PM on February 8 [1 favorite]

The NBA analysis used something called a 'topological similarity network'. I'm not familiar with this but it is one of many algorithms you can use to solve a clustering problem. Here's an overview which includes a brief discussion of topological clustering, among others. There are many other clustering algorithms you could try out; different algorithms are more appropriate for different types of problems. k-means is a classic and is the first place to start, probably, since it is easier to understand and there will be lots of resources on the web.

There are free implementations of many of these in most statistical and scientific software platforms. I personally like R, but I'm certain STATA would have this too. Lots of people use Python these days with the numpy/pandas/scipy libraries. I tend to find R's plotting libraries, especially ggplot, to be ahead of the others.

posted by PercussivePaul at 6:30 PM on February 8 [1 favorite]

Yep, Ayasdi is based on the method of "persistent homology" developed by Gunnar Carlsson and his collaborators at Stanford. I know it reasonably well. This stuff can be thought of as a generalization of clustering -- I don't want to give a whole intro grad course in algebraic topology in a MeFi comment but, if you know this stuff, you can say that clustering is the "persistent H^0" part of Carlsson's thing and then the novelty of his method is that it can look at H^i for positive i as well.

But that's the thing to say to you if you know the math but not the programming. If you don't know this kind of math, don't worry, you can do without it, because if what you want to do is "group members of a large dataset together by observed characteristics as opposed to existing formal categories," then as everybody else said, what you want to do is clustering, and indeed learning about k-means is a great place to start (and will be already implemented in whatever conceivable platform you might be using.)

So what's the difference between persistent homology and clustering? In a nutshell: not all data falls into neat, round clumps. Suppose, for instance, your data naturally forms the shape of a ring. It's all one connected piece, so clustering is not going to get you anywhere. But somehow there's a difference between a ring and a blob, and this is a feature of your data you might want to notice! And it's this kind of feature (as well as lots of higher-dimensional ones) that Carlsson's theory is designed to detect.

posted by escabeche at 7:20 PM on February 8 [3 favorites]

But that's the thing to say to you if you know the math but not the programming. If you don't know this kind of math, don't worry, you can do without it, because if what you want to do is "group members of a large dataset together by observed characteristics as opposed to existing formal categories," then as everybody else said, what you want to do is clustering, and indeed learning about k-means is a great place to start (and will be already implemented in whatever conceivable platform you might be using.)

So what's the difference between persistent homology and clustering? In a nutshell: not all data falls into neat, round clumps. Suppose, for instance, your data naturally forms the shape of a ring. It's all one connected piece, so clustering is not going to get you anywhere. But somehow there's a difference between a ring and a blob, and this is a feature of your data you might want to notice! And it's this kind of feature (as well as lots of higher-dimensional ones) that Carlsson's theory is designed to detect.

posted by escabeche at 7:20 PM on February 8 [3 favorites]

I work on problems like this. I don't necessarily do a lot of visualizations like this, but I work on questions like this.

I started doing segmentation, clustering, variable reduction, decision trees and linear modeling, then I started to develop an interest in Bayesian inference models, now I'm working on using machine learning techniques.

Machine learning is effectively using data to characterize something based on observable datapoints, then rigorously test those characterizations to see which ones are actually useful, and ultimately the machine to predict performance. Basically I got to a point where I said, what I assume is important as a predictor of success isn't necessarily important, lets let the data show us what characteristics make up those that succeed and those that fail, and use that information to understand how the whole ecosystem functions as opposed to just whether something will win and loose.

In your specific NBA mapping, I'd look at it as follows. What are the characteristics of all these players, what is the best performing permutation of players teamwise, what would be a better permutation of players for the team? Can I get them on my Fantasy NBA draft? What is the optimal team given my starting options?

I alluded to the actual indentured slavery result that this effectively creates in the long run when applied to real life in the recent NBA statistics thread.

Technique:

Machine Learning

Free software I use to model:

RapidMiner

This is all statistical programming. If I was looking to learn something, I'd look to learn SQL because you are going to need to know data storage inside and out and that's a pretty common database structure... Regarding statistical programming, I'd want to learn R while in school, SAS if possible. In addition, I'd look to learn regex inside and out, as I've found it is a major building block for converting non-standard data into quantifiable data. Python would be handy to know, as well as XML - as python is great for some heavy lifting, and XML is how a boatload of web data is stored that you'll want to scrape from web formats - because you'll be interested in as much data as possible.

posted by Nanukthedog at 7:12 AM on February 9 [1 favorite]

I started doing segmentation, clustering, variable reduction, decision trees and linear modeling, then I started to develop an interest in Bayesian inference models, now I'm working on using machine learning techniques.

Machine learning is effectively using data to characterize something based on observable datapoints, then rigorously test those characterizations to see which ones are actually useful, and ultimately the machine to predict performance. Basically I got to a point where I said, what I assume is important as a predictor of success isn't necessarily important, lets let the data show us what characteristics make up those that succeed and those that fail, and use that information to understand how the whole ecosystem functions as opposed to just whether something will win and loose.

In your specific NBA mapping, I'd look at it as follows. What are the characteristics of all these players, what is the best performing permutation of players teamwise, what would be a better permutation of players for the team? Can I get them on my Fantasy NBA draft? What is the optimal team given my starting options?

I alluded to the actual indentured slavery result that this effectively creates in the long run when applied to real life in the recent NBA statistics thread.

Technique:

Machine Learning

Free software I use to model:

RapidMiner

This is all statistical programming. If I was looking to learn something, I'd look to learn SQL because you are going to need to know data storage inside and out and that's a pretty common database structure... Regarding statistical programming, I'd want to learn R while in school, SAS if possible. In addition, I'd look to learn regex inside and out, as I've found it is a major building block for converting non-standard data into quantifiable data. Python would be handy to know, as well as XML - as python is great for some heavy lifting, and XML is how a boatload of web data is stored that you'll want to scrape from web formats - because you'll be interested in as much data as possible.

posted by Nanukthedog at 7:12 AM on February 9 [1 favorite]

The first three answers came just over an hour after I posted this. God I love Metafilter.

Thanks everyone!

posted by Ndwright at 8:27 PM on February 11

Thanks everyone!

posted by Ndwright at 8:27 PM on February 11

You are not logged in, either login or create an account to post comments

posted by zscore at 6:29 PM on February 8