The Pattern in the Rough
November 23, 2011 11:43 AM   Subscribe

What tools are available for determining patterns in short, semi-random sequences of variables?

I have a list of approximately three hundred short peptides, containing a large number of non-proteogenic amino acids. I want to see if there are any repeating patterns among the peptides (XYZ, XY*Z, X(polar)YZ, etc.), but the non-standard residues prevent me from using the common bioinformatic tools.

Are there any simple tools for conducting similarity searches in small variable strings? This seems like a problem that would crop up many times (ie. cryptography), but I haven't been able to find anything.


Example:

TRSWELPM
SDSNLPM
ERJSNAIMA

Sequences one and two both contain an "S" in their third position and end with "LPM". Sequence 3 has no overlap.
posted by Orange Pamplemousse to Science & Nature (8 answers total)
 
Look up 'longest common subsequence', and other string search algorithms. You will find easy algorithms for them on wikipedia or the like, and can go from there.
posted by gregglind at 12:05 PM on November 23, 2011


You could do a clustering based on pairwise edit distances.

I don't know a whole lot about BLAST and it's fellows, but I do know that simply having different letters in the sequence won't change the algorithm. If there's a existing technique for all-to-all comparisons of a set of amino acid sequences, adding some amino acids to the language won't make a difference. It'll only affect searching on an existing database of sequences.
posted by demiurge at 12:09 PM on November 23, 2011


I can't find my old askme about a similar problem but I remembered the promising algorithm: Levenshtein distance. It's expensive in runtime (O(n*m)) but pretty simple.
posted by chairface at 12:55 PM on November 23, 2011


Response by poster: demiurge, in theory the BLAST and Clustal tools would work perfect for me. The problem comes in the input string. Every tool I can find is designed for proteins, and therefore requires an input like:

>
KLISTQDE

Nice, simple and proteogenic. But my sequence has almost as many different types of amino acids as peptides, so the 20-letter amino acid code isn't going to cut it. If I had the skills to edit the source code things would of course be different, but I'm hoping someone knows of something a little easier.
posted by Orange Pamplemousse at 1:28 PM on November 23, 2011


So your problem is that the alphabet you are using is about 40 characters and doesn't fit the regular amino acid format that is fairly standard? I'm not sure if there's an easy way to make the current tools work for that. Maybe making the difference between upper and lower case meaningful would help, but that would require altering the source code.
posted by demiurge at 3:04 PM on November 23, 2011


On the other hand, writing a basic tool that did this soft of thing would not be that difficult, if you had someone with programming experience.
posted by demiurge at 3:16 PM on November 23, 2011


Is a Smith-Waterman implementation in Python helpful? There are several implementations that look reasonably simple to extend from a programming point of view. Apparently, BLAST uses a related heuristic, so maybe this already occurred to you.
posted by Monsieur Caution at 6:41 PM on November 23, 2011


Doing a quick search in Ubuntu yields a wide variety of options. A trimmed list of packages that might be relevant:
clustalw - global multiple nucleotide or peptide sequence alignment
gentle - suite to plan genetic cloning
gmap - spliced and SNP-tolerant alignment for mRNA and short reads
python-cogent - framework for genomic biology
seaview - Multiplatform interface for sequence alignment and phylogeny
abacas - Algorithm Based Automatic Contiguation of Assembled Sequences
acedb-other-belvu - multiple sequence alignment editor
amap-align - Protein multiple alignment by sequence annealing
blast2 - Basic Local Alignment Search Tool
bwa - Burrows-Wheeler Aligner
dialign - Segment-based multiple sequence alignment
embassy-domalign - Extra EMBOSS commands for protein domain alignment
exonerate - generic tool for pairwise sequence comparison
glam2 - gapped protein motifs from unaligned sequences
hmmer - profile hidden Markov models for protein sequence analysis
infernal - inference of RNA secondary structural alignments
kalign - Global and progressive multiple sequence alignment
libjebl2-java - Java Evolutionary Biology Library
libsam-java - Java library to manipulate SAM and BAM files
mafft - Multiple alignment program for amino acid or nucleotide sequences
mummer - Efficient sequence alignment of full genomes
muscle - Multiple alignment program of protein sequences
mustang - multiple structural alignment of proteins
ncbi-blast+ - next generation suite of BLAST sequence search tools
ncbi-epcr - Tool to test a DNA sequence for the presence of sequence tagged sites
phyml - Phylogenetic estimation using Maximum Likelihood
picard-tools - Command line tools to manipulate SAM and BAM files
poa - Partial Order Alignment for multiple sequence alignment
probalign - multiple sequence alignment using partition function posterior probabilities
probcons - PROBabilistic CONSistency-based multiple sequence alignment
proda - multiple alignment of protein sequences
samtools - processing sequence alignments in SAM and BAM formats
sigma-align - Simple greedy multiple alignment of non-coding DNA sequences
sim4 - tool for aligning cDNA and genomic DNA
squizz - Sequence/alignment converter
t-coffee - Multiple Sequence Alignment
theseus - superimpose macromolecules using maximum likelihood
tree-puzzle - Reconstruction of phylogenetic trees by maximum likelihood
wise - comparison of biopolymers, commonly DNA and protein sequences
The bolded ones I think might be worth looking at first. In particular exonerate's claim to be generic sounds promising.
posted by pwnguin at 9:15 AM on November 24, 2011


« Older Hippies smell bad and don't have jobs, LOL.   |   Any Advice on Buying a Cool Mom-Mobile? Newer »
This thread is closed to new comments.