How to cluster when dealing with more than one factor?
September 30, 2008 2:07 AM   Subscribe

How to cluster when dealing with more than one factor?

Let's say I have 100 observations of X, each of length n, with two factors:

1 aX1 bX1
2 aX1 bX2
n aXn bXn

1 aX1 bX1
2 aX1 bX2
n aXn bXn

(...100 of these guys)

Let's say I have 200 observations of Y, each of length n, with two factors:

1 aY1 bY1
2 aY1 bY2
n aYn bYn

1 aY1 bY1
2 aY1 bY2
n aYn bYn

(...200 of these guys)

I'd like to calculate the correlation (distance) between X and Y, so that I can cluster them. The two factors may have differing levels of dependence from 1...n.

Is there a general approach for reducing X and Y in such a way that I can cluster them?
posted by Blazecock Pileon to Science & Nature (8 answers total) 2 users marked this as a favorite
What do you want as output? Clusters, or a discriminant function?

With only 2 dimensions, the latter could be a simpler method. What does a plot look like? Are they linearly discriminable? If so, you could look into LDA.

If you're after clusters, and then you want to say X is so-and-so distance from Y, K-means is probably the simplest (with K=2).
posted by handee at 3:09 AM on September 30, 2008

Techniques for clustering? Prepare to be confused.

Have you tried a simple scatter plot of a against b, with X as red points and Y as blue points?
posted by Mike1024 at 3:16 AM on September 30, 2008

(after a little thought)

You have two groups measured on two dimensions. You have 200 sets of these measurements.

Mike1024's suggestion is the first step - plot them and see what they look like. Once you have the plots things become much clearer.

Given the plots - if you want to determine what it is about the measurements that makes one thing an A and another a B, see if they are visually separable. Then you have two options:

1: they're visually separable with a line: use LDA
2: they're visually separable with some sort of other function: look into support vector machines.

If you want something else - if for example you're looking to find patterns or clusters within the two groups (do the different As clump into separate things?) then you want some sort of clustering algorithm. Again the plot will help here. K-means is the easiest to understand, This tutorial might help. You need to know K but that's what the plot's for (or just try lots of values and see what happens).
posted by handee at 7:09 AM on September 30, 2008

I don't think I can do a scatterplot, because each random variable has 2n dimensions, which makes it impossible to visualize after a single point.

Each of those n points within an observation are dependent, and I'd need to be able to visualize the overall clustering of them on a flat plane.
posted by Blazecock Pileon at 9:58 AM on September 30, 2008

I was just coming in to suggest k-means as an easy starting point.

Agglomerative clustering may also work, especially if you have no idea how many groups to expect. Be sure to pay attention to your distance function, as how you calculate it will make a difference
posted by chrisamiller at 10:11 AM on September 30, 2008

Let me see if I can take the iris example from R and extend it to my case.

In the original iris, each flower has a vector of four measurements: sepal length, sepal width, petal length, and petal width.

In my example case, imagine that each new flower instead has a vector of two vectors, not four points. I need to preserve the structure of the two vectors because there is dependency there.

I'd like to cluster flowers based on these two vectors. What techniques could I use?
posted by Blazecock Pileon at 12:51 PM on September 30, 2008

So your dataset for each observation is of dimension 2n... Well you could try Principal Component Analysis (PCA) first then. PCA reduces dimensionality by finding some linear combination of the original data (each linear combination is called a component) which maximises the variance and therefore retains as much structure as possible (linearly) from the original dataset. It's a common technique for dimensionality reduction, and you can measure how much of the original dataset's variance is retained by comparing the variance explained by the first n principal components to the variance of the original dataset.

First you should standardise your data to z scores by subtracting the mean and dividing by the variance for each dimension - this is a vital step (for most datasets), that some statistical packages do automagically for you and others don't.

Most statistical packages have PCA and there is an overview on wikipedia here:

Try the first two or three principal components first and plot those - if you're lucky, the clusters will then be obvious from the 2d or 3d plot. The first principal component tries to account for as much variance as it can in the original dataset, and the 2nd tries to account for what is left, so each successive principal component "explains" less of the original data. You can plot the variance explained for each component and work out how many to retain from that plot (called a scree plot).

(I've got an exam on this stuff next Wednesday so it's fresh in my mind!)
posted by handee at 12:43 AM on October 1, 2008

If, as you describe above, your situation is basically performing a cluster analysis but with only two variables, then the answer is: cluster analysis doesn't care how many variables you're clustering on. 2? 4? 7? It doesn't matter. That's because clustering algorithms operate (depending on which kind you use) on either the net distances or similarities between your data points. For agglomerative or divisive clustering methods you calculate the distance matrix first (and depending on your data may have many options/decisions for how you calculate distance and preprocess that data - for latter, seconding handee's comment on standardization); k-means handles distance indirectly through iteratively reassessing the centroid of each cluster and reassigning points as need be to new cluster centroids depending on distance to those new centroids. I don't know enough about your question to recommend a particular clustering approach, but you should read at least a good overview guide so that you know the pros/pitfalls of the various algorithms.

If I'm interpreting your overall question correctly, it sounds like you have the equivalent of two sets of "irises" and want to not only cluster but determine whether the cluster structure is the same. I agree that plotting is a great step: perform the clustering, then plot the data color coding by cluster assignment. For example, you could overlay scatterplots of both data sets on one plot, using different symbols to represent the source data and different colors to represent the cluster assignments. Also, do boxplots of your variables (assuming underlying data is on continuous scale) sliced by cluster. This would be a very helpful step before getting into more formal tests of cluster overlap.
posted by shelbaroo at 5:33 AM on October 1, 2008

« Older Intelligent podcasts for my cubicle dwelling mom?   |   How do I set the correct time on my Accurist watch... Newer »
This thread is closed to new comments.