What are the odds that I can solve this problem?
June 19, 2007 7:46 AM   RSS feed for this thread Subscribe

Probability/stat question: I'm looking for patterns in a protein sequence, and I've found a few that occur quite frequently. How do I know these are actual patterns and not just an artifact of random amino acid distribution?

I have a roughly 1000 amino acid sequence, and I've used a sliding window to chop it up into overlapping 6-mers. Some of these 6-mers occur much more frequently than others and I suspect they have some sort of biological significance. Unfortunately, I don't know to test whether these are true pattern in the biological sense, or if they could just as easily have been the result of random distribution.

I've tried comparing the expected frequency of these 6-mers based on the amino acid distribution with the observed frequency; but the chance of getting any given 6-mer randomly is so low that almost anything I observe (even the ones that only show up once) seem really significant. I'll be happy to clarify things if this post is a bit messy.
posted by reformedjerk to science & nature (11 comments total)
You need to do a little reading from Knuth's Art of Computer Programming, Vol. 2. It has a good overview of testing randomness hypotheses and quite a lot of specific tests. When the chance for each outcome is very small and you have a limited number of observations, the typical chi-square test is not at its best, but Knuth's Collision Test might suit your scenario well.
posted by Wolfdog at 7:56 AM on June 19, 2007


Well proteins are all about structure / function. If you are seeing a pattern of residues then it probably makes up some kind of tertiary structure like a helix or ribbon. You can test it with 3D structure prediction websites to see how probable that is.
posted by dendrite at 7:57 AM on June 19, 2007


Secondly, abandon your preconception that residue sequences are somehow random. They are not. Proteins are very specific little machines. Any sequence of amino acids that are somehow randomly incorporated would not be biologically relevant and would not be evolutionarily conserved in the organism.
posted by dendrite at 8:01 AM on June 19, 2007


Here's my mostly mathematical explanation:

the chance of getting any given 6-mer randomly is so low that almost anything I observe (even the ones that only show up once) seem really significant

This is an easy fallacy to fall into. The chance of some 6-mer occuring in a given window is 100%, so there has to be something there. They're all equally unlikely, but it's not a surprise that they would occur. For example, it was unlikely, all other things being equal, that I would wake up this morning to a genetics probability question on AskMe, but here it is and it's not surprising.

What makes an occurrence statistically significant (roughly speaking) is if it forms a surprising coincidence. By analogy, if I had just read an article on solving this problem, then paged over to AskMe to find the same problem, then it would be a surprising occurrence, as both of those events are very unlikely but unsurprising by themselves.

Since there are 6^20 possibilities for a given 6-mer, and you have only 995 present (one starting at each of the first 995 amino acids), I'd say that any single repetition is statistically significant. If you see a given 6-mer once, no surprise. If you see it again, then you've got a pattern worth investigating.
posted by lostburner at 8:06 AM on June 19, 2007


How do I know these are actual patterns and not just an artifact of random amino acid distribution?

The problem is in the question you're asking. No protein sequence is random, because of the structure/function relationship of proteins. You need to phrase your question differently. For example, you might say:

"'Is this sequence correlated with disease X?"

or

"Is the presence of this sequence correlated with this kind of protein function?"

or

"Is this mutation in this sequence under selection pressure?"

If you can give us a better idea of what you're trying to do, it may be easier to give you an answer that helps.
posted by chrisamiller at 8:19 AM on June 19, 2007


Ah yes, proteins. Just realized that this isn't a genetics problem at all (as I said it was in my previous answer). For DNA you might be interested in random vs. nonrandom sequences, but for proteins I think these guys are right. They're always going to be patterned.
posted by lostburner at 8:39 AM on June 19, 2007


Thanks for the answers so far. I understand that proteins aren't simply sequences of random amino acid arrangements. But the problem I have is that for a given 6-mer I don't know how to show that it is statistically significant beyond just being a result of the amino acid distribution bias of the sequence. The sequence in question has an unusual abundance of glycines and valines and prolines, so when a lot of the top patterns are rich in these amino acids, I can't say with certainty that they're not merely results of the aa distribution.
posted by reformedjerk at 11:02 AM on June 19, 2007


Were you expecting to the see those proteins repeated for some reason? Instead of trying to test stat. sig. of your current results, what I recommend is trying to find an underlying reason that those proteins might be repeated, frame that reason as an hypothesis, and then design an experiment to test that hypothesis. There are a lot of stat tests you could do on your current data, but I would not find the conclusions very strong w/o an underlying hypothesis.

But to answer the question, I have a few ideas. I should first say that I have no idea what a mer is. But I will give this a shot anyway. I assume a 6-mer is made up of individual mers? And you want to identify if 6 of them occuring together has any significance?

First, you could choose one mer, one of the ones in the 6-mer pattern you are interested in, as you independent variable, let's call it Q-mer. Code it as 1 if it is in a given 6 mer sequence and 0 if it is not in the sequence. Then, you can code the ALL other mers as dependent dummy variables that are =0 if the mer is not present and =1 it the mer is present in the same 6 mer sequence. Do this for all of your 6 mer sequences. You can use these dependent variables to predict the independent variable using multinomial logit analysis. You can compare the strength of the estimated terms on the dependent variables, and they would all have stat. sig. You would be able to say things like X mer has ## value for predicting the presence of Q-mer in a 6 mer sequence, while Y mer only has # predictive value. You could statistically show that X mer has more predicitive value.

You could also count up all the different mer patterns you see. Maybe this would fit some sort of distribution (normal, poisson, etc). You could stat. test whether it did, then you could say something about that distribution.
posted by Eringatang at 1:07 PM on June 19, 2007


I also have this feeling that your "sliding window" approach is not accounting for "repeated measurements" when using statistics. I suggest choosing 6-mer sequences at random from your sample, making sure they don't overlap. Many fewer than 1000/6 should be enough to gain predictive power using a multinomial logit model, if there is truly a "pattern."
posted by Eringatang at 1:18 PM on June 19, 2007


This

"it is statistically significant beyond just being a result of the amino acid distribution bias of the sequence"

is probably not a biologically meaningful hypothesis. You seem to be asking

"given the frequency of various residues in this sequence, if they were lined up randomly, what is the chance that these six residues would appear in this particular order X number of times over a total length of Y residues."

If that is the question you want to ask, maybe a statistician here can answer it for you.

However, it seems to me that there are almost no biological systems for which this would be an interesting question to ask, because we already know that the sequence of functional proteins is NOT random.
posted by amphioxus at 4:41 PM on June 19, 2007


You're hardly the first person to look for patterns in protein sequences, so rather than writing your own program to count the 6-mers start by looking in the bioinformatics literature about "Motif Discovery". I haven't kept up to date but 10 years ago I thought the program "Pratt" by Inge Jonassen was very nice. There are even web sites where you can use it by just pasting your collection of sequences into a form. There is also an algorithm from IBM called SPLASH but I don't recall its pros and cons, other than that it is very fast.

A real motif finding program brings many advantages - it will find patterns both shorter and longer than 6, with wildcards too. Also someone else does the probability theory for you.

If you really want to persue the 6-mers, you need to learn about markov chains so you can let the probability distribution of the amino acids depend on the amino acids at the previous N sites, and have a hard think about what your null hypothesis is: a particular 6-mer may be frequent because it is a biologically significant motif, but its just as likely that it happens to contain a 5-mer motif or is part of a 7-mer motif.
posted by Canard de Vasco at 5:37 PM on June 19, 2007


« Older Has anyone in the D.C. area us...   |   If a husband buys a house in h... Newer »

You are not logged in, either login or create an account to post comments



Related Questions
Help with GIS July 23, 2008
Help me count the ways! How popular is mobile... December 6, 2007
Formulas for Food August 4, 2007
Your favorite stats & graph tools April 2, 2007
What statistical tests do I need to run on this data? November 29, 2006