# Clustering techniques

June 11, 2008 2:04 PM Subscribe

What clustering technique to use?

I have a set of roughly 500 curves, each curve representing the numerical representation of the behavior of a transcription factor (represented by its binding motif) along a set of genomic coordinates.

In addition, I have six pre-ordained structural classifications. Each of the 500 transcription factor types is a member of one classification.

Presently, I have performed hierarchical clustering of the distances between the curves. I then color the leaves with the according classification, in order to see how the factors organize.

Is there a way to incorporate information from the structural classifications to help assist clustering? What techniques would be better suited for that?

I looked into

Thanks for your advice.

I have a set of roughly 500 curves, each curve representing the numerical representation of the behavior of a transcription factor (represented by its binding motif) along a set of genomic coordinates.

In addition, I have six pre-ordained structural classifications. Each of the 500 transcription factor types is a member of one classification.

Presently, I have performed hierarchical clustering of the distances between the curves. I then color the leaves with the according classification, in order to see how the factors organize.

Is there a way to incorporate information from the structural classifications to help assist clustering? What techniques would be better suited for that?

I looked into

*k*-means clustering, but I'm uncertain how I merge the curve information with, say, a six-dimensional unit vector (each axis being the structural classification) that represents membership to a class.

Thanks for your advice.

Response by poster:

The usage examples I see in Eisen's

posted by Blazecock Pileon at 4:39 PM on June 11, 2008

*Then k-means (with k=6) would seem to be the way to go.*The usage examples I see in Eisen's

`cluster`

and the `kmeans`

function in R suggest they work with single points (e.g. gene expression level) as opposed to a vector of points (my score curve, or "profile", which are multiple "expression levels"). Do I just consider that the "time course" or "trial" columns are analogous to each position along the genome, and rows are score values for each TF?posted by Blazecock Pileon at 4:39 PM on June 11, 2008

kmeans works with vector data, you just make the first argument a matrix (data in rows I think). You have really big N-tupples, so it probably will work like crap. Here's example code.

posted by a robot made out of meat at 5:45 PM on June 11, 2008

x1<-c(4,0,0,0,1);

x2<-c(0,4,1,0,0);

x3<-c(0,0,4,2,0);

x4<-c(1,0,0,0,3);

y<-rbind(x1,x2,x3,x4);

x<-y;

for (i in 1:15) {

y<-rbind(y,x1+rnorm(n=5),x2+rnorm(n=5),x3+rnorm(n=5),x4+rnorm(n=5));

}

z<-kmeans(y,x);

print(z);

posted by a robot made out of meat at 5:45 PM on June 11, 2008

I'm not a genomics guy, so I'm not sure what your data is like exactly. Each curve you want to cluster is a vector of points? How long is the vector and how many dimensions are each point?

The first thing I would do is just expand your curve so you have just a vector of floats and do the k-means clustering from that, in a (vector length)*(point dimension) -space.

posted by demiurge at 5:47 PM on June 11, 2008

The first thing I would do is just expand your curve so you have just a vector of floats and do the k-means clustering from that, in a (vector length)*(point dimension) -space.

posted by demiurge at 5:47 PM on June 11, 2008

Oh, and you could use your aprori knowledge in a couple of ways that I can think of. I have no idea how these behave statistically

1) Use the categories to generate initial clusters, since all minimization algorithms work better with a good guess. This is probably the best/most justifiable.

2) Tack on an extra dimension (or six, equivalently) to your N-tuples with values that will place varying weight on trying to push together like-catogied data.

posted by a robot made out of meat at 5:51 PM on June 11, 2008

1) Use the categories to generate initial clusters, since all minimization algorithms work better with a good guess. This is probably the best/most justifiable.

2) Tack on an extra dimension (or six, equivalently) to your N-tuples with values that will place varying weight on trying to push together like-catogied data.

posted by a robot made out of meat at 5:51 PM on June 11, 2008

Response by poster:

It would be the minimum length of the vectors I'm comparing. The vector could be as short as 40 points.

posted by Blazecock Pileon at 7:32 PM on June 11, 2008

*Each curve you want to cluster is a vector of points? How long is the vector and how many dimensions are each point?*It would be the minimum length of the vectors I'm comparing. The vector could be as short as 40 points.

posted by Blazecock Pileon at 7:32 PM on June 11, 2008

This thread is closed to new comments.

posted by demiurge at 3:11 PM on June 11, 2008