# You have to take off your shoes before you can take off your stockings.

January 9, 2013 7:47 PM Subscribe

I'm slowly working through the problems on Project Rosalind, a bioinformatics primer. Would you suggest working out each solution from first principles, or do you think it makes more sense to look up existing algorithms and implement those? WWaBSD?

One great feature of Rosalind problems is that your program has to find the solution in under 5 minutes. I am writing Matlab code, because Matlab is what I know. I realize that Matlab has shortcomings, speed among them. I can imagine going back and rewriting most of this code in C or using it as a way to learn Python/NumPy.

If I approach a problem "cold" and without foreknowledge, I usually write the "naïve" algorithm. In the case of problems where the naïve approach takes O(n²) or even O(n³) operations, this usually forces me to look for places to cut the fat and write more efficient code. So that's good.

On the other hand, the majority of these problems are very basic: find the longest common substring of a collection of strings, count the number of subtrees given a list of edges, that sort of thing. Very efficient solutions to these have existed for decades. Sometimes I'll look at the Rosalind forums and see that user ABC has used approach XYZ and his solution runs in 5 milliseconds. Sometimes this is prefaced with "I looked on Wikipedia and found XYZ algorithm…"

I don't know what makes the most sense for me to do, as a non-computer science, non-bioinformatics person. Doing the problems from scratch teaches me about complexity, but I feel like I'm missing important algorithm theory this way. I know this is impossible to answer as stated, but do most CS students already know about these basic algorithms and then just go and apply them as needed? How much knowledge of specific algorithms does a typical CS undergrad have? How would you suggest I approach these problems to reap the biggest benefit?

One great feature of Rosalind problems is that your program has to find the solution in under 5 minutes. I am writing Matlab code, because Matlab is what I know. I realize that Matlab has shortcomings, speed among them. I can imagine going back and rewriting most of this code in C or using it as a way to learn Python/NumPy.

If I approach a problem "cold" and without foreknowledge, I usually write the "naïve" algorithm. In the case of problems where the naïve approach takes O(n²) or even O(n³) operations, this usually forces me to look for places to cut the fat and write more efficient code. So that's good.

On the other hand, the majority of these problems are very basic: find the longest common substring of a collection of strings, count the number of subtrees given a list of edges, that sort of thing. Very efficient solutions to these have existed for decades. Sometimes I'll look at the Rosalind forums and see that user ABC has used approach XYZ and his solution runs in 5 milliseconds. Sometimes this is prefaced with "I looked on Wikipedia and found XYZ algorithm…"

I don't know what makes the most sense for me to do, as a non-computer science, non-bioinformatics person. Doing the problems from scratch teaches me about complexity, but I feel like I'm missing important algorithm theory this way. I know this is impossible to answer as stated, but do most CS students already know about these basic algorithms and then just go and apply them as needed? How much knowledge of specific algorithms does a typical CS undergrad have? How would you suggest I approach these problems to reap the biggest benefit?

I suspect that if you were a bioinformatics student you would take a course that taught you the algorithms. Many problems in bioinformatics and computer science more generally are intractable without knowing the proper algorithm, and no one expects you to invent those algorithms from scratch. If you actually want to solve real problems, knowing the existing algorithms is critical.

posted by pombe at 8:20 PM on January 9, 2013 [1 favorite]

posted by pombe at 8:20 PM on January 9, 2013 [1 favorite]

*Imprimis*, good luck with Rosalind.

You haven't mentioned why you're doing this, but you might want to give another thought to using Matlab. Especially if you're thinking of entering bioinformatics and obtaining a position in a lab. Most fundamentally, using Matlab means that your code will always depend on Mathworks, Inc., and that you won't be able to share your code with anyone who doesn't have a Matlab license. In bioinformatics, that "anyone" means "practically everyone". The converse holds as well: you will benefit from a large bioinformatics community, which AFAICT Matlab lacks. Your suggestion about Python sounds good: why rewrite code in Python when you can write it in Python in the first place?:) I don't mean to begin a language war, but I would advise you, as I would advise any student in my lab, to use Python and especially not to use Matlab.

Anyway. In the short run, big O is relative. Sometimes you will consider yourself very lucky to obtain O(n^3) time complexity: many central problems in bioinformatics are NP-complete. At other times, the naive algorithm may be linear but still too slow for practical use. So you use a very clever, very implementation-resistant algorithm that is asymptotically optimal, because you're probably already at the asymptote if you're using these algorithms in anger.

My generic advice would be to implement things naively at first so that you can prove to yourself that you understand the problem and have solved it correctly. Afterwards, if you're still not satisfied, go back and make

*qualitative*optimizations: reimplementing the same algorithm in C will just take a constant off your runtime. Lastly, you will almost always understand better, and appreciate more, a given algorithm or theorem once you have attempted to derive it yourself (let alone make do without it).

To actually answer your question, though: most CS undergraduates will be unfamiliar with the

*specific*algorithms used in bioinformatics, but will probably have been exposed to some of the motivating principles. A CS grad may not have seen Needleman-Wunsch or Smith-Waterman, but certainly will have seen dynamic programming. Probably not PSSM search with extended suffix arrays, but perhaps suffix arrays or suffix trees, and definitely branch and bound algorithms. Happily, though, once you've grokked branch and bound in the context of PSSM search, you can generalize it immediately and add it to your mental toolbox.

Good luck!

PS: If your background is not in molecular biology, you may want to acquire some sort of background, especially if you want to begin posing and answering your own problems. David Mount's

*Bioinformatics*gives most of the relevant basics. Watson's

*Molecular Biology of the Gene*is a very readable classic (but you don't have to read the whole thing).

posted by lambdaphage at 9:14 PM on January 9, 2013 [11 favorites]

I'm coming at it from the computer programming direction and these are my suggestions:

- Ditch Matlab and switch to Python. Python seems to be the language of choice for many areas of scientific programming. Heck, I prefer Perl to Python, but think that Python's the way to go.

- Most of the algorithms I've seen mentioned in the BioInfo I've read are fairly specific to BioInfo. I doubt the average CS undergrad graduates knowing all of them.

- Also, if something's a core algorithm of BioInfo I'd expect there's already a debugged and highly efficient library coded. You just need to know how to use that implementation.

- But one of the best skills to have is the confidence that if someone gives you a new algorithm (or you come up with one yourself), you can code it in a given language. Pick some of the complicated and highly efficient algorithms that people mention and implement them yourself. You don't need to do all of them, but the experience of thrashing out every last detail of an actual working implementation is irreplaceable experience.

I think Project Rosalind is a great set of problems to learn a new language against. The problems aren't absolutely trivial (which is my impression of Project Euler), and while you can solve them with elementary methods you can also choose to try advanced methods. For instance, I'm learning Haskell against them, and for the "longest common substring" problem I had an idea that turned out to be slower than the obvious straight-ahead strategy, but I worked it all out just to get the experience of implementing my data structure in Haskell. It's also not a bad idea to learn when you can get away with simple, less-than-maximally-efficient algorithms.

(Take all that with a caveat that I don't work in BioInfo, but that I've just learned a bunch of programming languages.)

posted by benito.strauss at 8:39 PM on January 12, 2013

- Ditch Matlab and switch to Python. Python seems to be the language of choice for many areas of scientific programming. Heck, I prefer Perl to Python, but think that Python's the way to go.

- Most of the algorithms I've seen mentioned in the BioInfo I've read are fairly specific to BioInfo. I doubt the average CS undergrad graduates knowing all of them.

- Also, if something's a core algorithm of BioInfo I'd expect there's already a debugged and highly efficient library coded. You just need to know how to use that implementation.

- But one of the best skills to have is the confidence that if someone gives you a new algorithm (or you come up with one yourself), you can code it in a given language. Pick some of the complicated and highly efficient algorithms that people mention and implement them yourself. You don't need to do all of them, but the experience of thrashing out every last detail of an actual working implementation is irreplaceable experience.

I think Project Rosalind is a great set of problems to learn a new language against. The problems aren't absolutely trivial (which is my impression of Project Euler), and while you can solve them with elementary methods you can also choose to try advanced methods. For instance, I'm learning Haskell against them, and for the "longest common substring" problem I had an idea that turned out to be slower than the obvious straight-ahead strategy, but I worked it all out just to get the experience of implementing my data structure in Haskell. It's also not a bad idea to learn when you can get away with simple, less-than-maximally-efficient algorithms.

(Take all that with a caveat that I don't work in BioInfo, but that I've just learned a bunch of programming languages.)

posted by benito.strauss at 8:39 PM on January 12, 2013

I'm also working on Project Rosalind right now, in a language with no existing library support (OCaml), after studying CS and biochem in undergrad. A few comments from that perspective:

lambdaphage is absolutely correct that a standard CS curriculum will not cover the specific algorithms but rather the general approaches from which these algorithms can be derived. Nonetheless, even as a CS major who has taken electives in computational biology, I just go to Wikipedia. I'm working on that longest common substring problem you mentioned right now, and I'm doing it by looking up Ukkonen's algorithm for generalized suffix trees, because no way do I want to spend the next year(s) of my life re-deriving and re-proving it.

A lot of the benefit of a CS major is the learned ability to read an algorithm from Wikipedia or a paper or something and then implement it. Programming interviews tend to test the ability to improvise a novel algorithm on the fly, but that's not because they care how well you can rediscover solutions to solved problems. It's because they're trying to infer how well you can improvise solutions to unsolved problems.

Plus, how do you think CS majors learned these algorithms in the first place? They read it out of their textbooks and/or Wikipedia and/or their lecture notes.

posted by d. z. wang at 7:49 PM on January 14, 2013

lambdaphage is absolutely correct that a standard CS curriculum will not cover the specific algorithms but rather the general approaches from which these algorithms can be derived. Nonetheless, even as a CS major who has taken electives in computational biology, I just go to Wikipedia. I'm working on that longest common substring problem you mentioned right now, and I'm doing it by looking up Ukkonen's algorithm for generalized suffix trees, because no way do I want to spend the next year(s) of my life re-deriving and re-proving it.

A lot of the benefit of a CS major is the learned ability to read an algorithm from Wikipedia or a paper or something and then implement it. Programming interviews tend to test the ability to improvise a novel algorithm on the fly, but that's not because they care how well you can rediscover solutions to solved problems. It's because they're trying to infer how well you can improvise solutions to unsolved problems.

Plus, how do you think CS majors learned these algorithms in the first place? They read it out of their textbooks and/or Wikipedia and/or their lecture notes.

posted by d. z. wang at 7:49 PM on January 14, 2013

This thread is closed to new comments.

posted by michaelh at 7:57 PM on January 9, 2013