Clustering techniques
June 11, 2008 2:04 PM   Subscribe

What clustering technique to use?

I have a set of roughly 500 curves, each curve representing the numerical representation of the behavior of a transcription factor (represented by its binding motif) along a set of genomic coordinates.

In addition, I have six pre-ordained structural classifications. Each of the 500 transcription factor types is a member of one classification.

Presently, I have performed hierarchical clustering of the distances between the curves. I then color the leaves with the according classification, in order to see how the factors organize.

Is there a way to incorporate information from the structural classifications to help assist clustering? What techniques would be better suited for that?

I looked into k-means clustering, but I'm uncertain how I merge the curve information with, say, a six-dimensional unit vector (each axis being the structural classification) that represents membership to a class.

Thanks for your advice.
posted by Blazecock Pileon to Science & Nature (6 answers total) 2 users marked this as a favorite
 
Are you assuming that the curves will shuffle into the 6 classifications? Then k-means (with k=6) would seem to be the way to go.
posted by demiurge at 3:11 PM on June 11, 2008


Response by poster: Then k-means (with k=6) would seem to be the way to go.

The usage examples I see in Eisen's cluster and the kmeans function in R suggest they work with single points (e.g. gene expression level) as opposed to a vector of points (my score curve, or "profile", which are multiple "expression levels"). Do I just consider that the "time course" or "trial" columns are analogous to each position along the genome, and rows are score values for each TF?
posted by Blazecock Pileon at 4:39 PM on June 11, 2008


kmeans works with vector data, you just make the first argument a matrix (data in rows I think). You have really big N-tupples, so it probably will work like crap. Here's example code.


x1<-c(4,0,0,0,1);
x2<-c(0,4,1,0,0);
x3<-c(0,0,4,2,0);
x4<-c(1,0,0,0,3);
y<-rbind(x1,x2,x3,x4);
x<-y;
for (i in 1:15) {
y<-rbind(y,x1+rnorm(n=5),x2+rnorm(n=5),x3+rnorm(n=5),x4+rnorm(n=5));
}
z<-kmeans(y,x);
print(z);

posted by a robot made out of meat at 5:45 PM on June 11, 2008


I'm not a genomics guy, so I'm not sure what your data is like exactly. Each curve you want to cluster is a vector of points? How long is the vector and how many dimensions are each point?

The first thing I would do is just expand your curve so you have just a vector of floats and do the k-means clustering from that, in a (vector length)*(point dimension) -space.
posted by demiurge at 5:47 PM on June 11, 2008


Oh, and you could use your aprori knowledge in a couple of ways that I can think of. I have no idea how these behave statistically
1) Use the categories to generate initial clusters, since all minimization algorithms work better with a good guess. This is probably the best/most justifiable.
2) Tack on an extra dimension (or six, equivalently) to your N-tuples with values that will place varying weight on trying to push together like-catogied data.
posted by a robot made out of meat at 5:51 PM on June 11, 2008


Response by poster: Each curve you want to cluster is a vector of points? How long is the vector and how many dimensions are each point?

It would be the minimum length of the vectors I'm comparing. The vector could be as short as 40 points.
posted by Blazecock Pileon at 7:32 PM on June 11, 2008


« Older Self publish much?   |   How Best to Set Up a Furniture Rearrangement... Newer »
This thread is closed to new comments.