How can I find out if someone's already solved this (fairly specific) programming problem?
October 8, 2006 3:29 PM   Subscribe

I have many short strings of varying lengths — between a dozen and a few hundred, most likely — and one long string. I need to write code that will find a way to assemble a copy of the long string using as few of the short strings as possible. If it's not possible to assemble a complete copy of the long string, it should assemble a copy with as few and as small gaps as possible.

Has someone already solved this problem? I'd hate to reinvent the wheel, and this seems like the sort of thing someone is likely to have thought about already.

Bonus question: How could I go about researching this sort of question on my own? Where do you look if you have a task and need to know if an algorithm exists to perform it?
posted by nebulawindphone to Computers & Internet (26 answers total) 1 user marked this as a favorite
 
About 25 years ago I had to solve a moderately related problem: fitting object modules into discontinuous ROMs so as to minimize wasted space. And I solved it just fine with a BASIC program that ran overnight on a PDP-11/70. It wasn't very fancy; it was just an exhaustive tree search.

I don't believe that there's any elegant solution to your problem. I think you have to rely on brute force, like I did. These kinds of optimization problems almost always require brute force; it's very rare for there to be an elegant solution.

But if a straight exhaustive tree search is too long, you might want to investigate genetic algorithms.
posted by Steven C. Den Beste at 3:52 PM on October 8, 2006


This is extremely similar to many of the problems faced in bioinformatics, specifically those related to sequence assembly. When assembling a genome, we get many short strings (700 basepair long strings) and try to match it to regions in a genome (a 3 billion bp long string). There are many fast and efficient algorithms for doing so, most based on the Smith-Waterman algorithm and Dynamic programming.

Looking at the way that packages such as BLAST work. may help. You may be able to utilize some of the modules in BioPerl too. You might also look for Bioinformatics course websites, which may have slides that walk step by step through the algorithmic details.

Hopefully, this is enough information to get you started. Good Luck!
posted by chrisamiller at 4:03 PM on October 8, 2006


For a sample size this small (in programming terms, a few hundred equals a dozen, really), you're talking about nanoseconds of difference between brute force and a fancy schmancy algorithm. So all that work on the fancy algorithm? Wasted effort. Just brute force it.
posted by antifuse at 4:05 PM on October 8, 2006


Sequence assembly algorithms are not a good solution for this if the matches have to be exact, as they usually expect some error.
posted by grouse at 4:09 PM on October 8, 2006


"How could I go about researching this sort of question on my own?"

Sounds similar to how DNA fragment are re-assembled, so you might look there.

This is homework, isn't it? This is your homework and I've just spent half an hour brainstorming and coming up with algorithms. Bad nebulawindphone.

Read up on Boyer-Moore and recursive divide and conquer.
posted by orthogonality at 4:10 PM on October 8, 2006


One good reference book for "What algorithms are out there? Which ones might solve my problem?" is The Algorithm Design Manual.
posted by xil at 4:34 PM on October 8, 2006


This is so homeworkfilter.
posted by Civil_Disobedient at 4:36 PM on October 8, 2006


You might look into Greedy String Tiling. I believe that if you rework it so that each of the smaller strings is treated as one token, you will get good results...

I used this algorithm for my undergraduate thesis which was how to show similarity between compouter programs, and it worked out well.
posted by mge at 5:16 PM on October 8, 2006


A good place to start.
posted by flabdablet at 5:28 PM on October 8, 2006


Awesome. A few helpful leads and a good reason not to follow them.

I think antifuse is probably right, and anyway speed isn't a huge issue here, but if I do need something clever I'll know where to start.
posted by nebulawindphone at 5:34 PM on October 8, 2006


Are the short strings that build the copy of the long string allowed to overlap?
posted by flabdablet at 5:35 PM on October 8, 2006


Quick question :

for longstring = "aaabbccaaa"
and short strings = "aa", "bb", "cc"

Does your required algorithm copy all this string? Or is it left with "aa". Another way to ask the question, when rebuilding the copy can you insert into the long string or are you appending?
posted by ill3 at 5:35 PM on October 8, 2006


flabdablet — no, they aren't.
posted by nebulawindphone at 5:35 PM on October 8, 2006


ill3 — I had in mind something that would give "aa?bbccaa?" (i.e. a reconstruction with gaps in it, because the long string can't be fully assembled out of copies of the short strings).
posted by nebulawindphone at 5:38 PM on October 8, 2006


(Oh, I think I may have misread your question. No, it's appending, not inserting.)
posted by nebulawindphone at 5:41 PM on October 8, 2006


Well this isn't elegant, or fast probably, but it was certainly easy to code up in python. I can email you the code, cause it doesn't paste to well in here.

This :

small = ["a", "bb", "ccc","dddd"]
orig_large = "bbabababcccaaabbbbcccddddbbff"
large = orig_large[:] # copy the string
nsmall = small[:] # copy the list
total = 0
print small
print orig_large
print "Word\tTotal Char Count"
for i in range(len(small)):
 biggest_c = biggest_j = 0
 for j in range(len(nsmall)):
   c = large.count(nsmall[j])* len(nsmall[j])
   if c > biggest_c:
    biggest_c = c
    biggest_j = j
 print nsmall[biggest_j],"\t",biggest_c
 total += biggest_c
 large =large.replace(nsmall[biggest_j],'?' * len(nsmall[biggest_j]))
 del(nsmall[biggest_j])
print total, "of", len(orig_large), "chars of the long string were constructed"
print "This is the remainder of the long string :", large


Produces this :

$ python test.py
['a', 'bb', 'ccc', 'dddd']
bbabababcccaaabbbbcccddddbbff
Word Total Char Count
bb 8
a 6
ccc 6
dddd 4
24 of 29 chars of the long string were constructed
This is the remainder of the long string : ???b?b?b???????????????????ff
posted by ill3 at 5:46 PM on October 8, 2006


In that case, I'd probably go for something involving finding the position of one of the short strings in the long string, then breaking the long string into before-match, match and after-match regions, and applying the same algorithm recursively to the before-match and after-match regions. Each time the recursion bottomed out, I'd assign the resulting solution a score based on the total length of unmatched regions. I'd fool around with this design until I was convinced it covered all possibilities, then I'd implement it, profile it and optimize as required.
posted by flabdablet at 5:58 PM on October 8, 2006


ill3: given ['a', 'bb', 'ccc', 'dddd'] and
bbabababcccaaabbbbcccddddbbff, a correct algorithm should be able to find at least bba?a?a?cccaaabbbbcccddddbb??, shouldn't it?

Cases where the short strings include bits of each other need consideration as well. For example, if the long string is "A hag has had half a flask" and the short strings are ["hag has had hal", "lf a flask", "flask", " ha"], then "? ha? ha? ha? half a flask" is a more complete solution than "??hag has had hal????flask".
posted by flabdablet at 6:16 PM on October 8, 2006


I realize my example gives you the opposite of what you wanted. Append this to the end of my other program :

out = ""
for i in range(len(large)):
 if large[i] == "?" : out += orig_large[i]
 else: out+= "?"
print out

and it produces this :

bba?a?a?cccaaabbbbcccddddbb??

Which is what I think you want.
posted by ill3 at 6:19 PM on October 8, 2006


flabdablet,


You're other example is a good one, as it that trips my algorithm up exactly as you say...

Great...now the OCD is kicking in...
posted by ill3 at 6:26 PM on October 8, 2006


Whoops! I misread ill3's test output, and didn't realize that the thing with ? in it at the end was actually showing leftover characters and not the solution.

However, running my suggested test strings through test.py gives

hag has had hal 15
flask 5
lf a flask 0
ha 0
20 of 26 chars of the long string were constructed
This is the remainder of the long string : A ???????????????f a ?????

showing that it's missed the opportunity to use " ha" and "flask" for the more complete solution.

On preview: I'm already late for work, so I'll leave you holding the OCD baby :-)
posted by flabdablet at 6:28 PM on October 8, 2006


flabdablet writes "In that case, I'd probably go for something involving finding the position of one of the short strings in the long string, then breaking the long string into before-match, match and after-match regions, and applying the same algorithm recursively to the before-match and after-match regions."


flabdablet, this was my initial solution too, which I didn't post because I figured this was homework. However, it isn't guaranteed to find a solution.

I quote what I didn't post
Requirement is to use the fewest fragments as possible. To do this, you want to use the largest fragments you can, first.

So order the fragments by length.

Ignore any fragments longer that the string to be matched to (from her eon we'll call that "the target").
Then do a Boyer-Moore search for the largest remaining fragment in the string to be replicated.
If you don't get a match, go to the next largest fragment.
If you do make a match, that divides the target into three parts, the part prior to the match ("the prefix") the matched part, and the part after the match ("the suffix").
Now recursively do the same thing for any non-zero-length prefix and suffix.

Note that this is NOT gauntleted to use the fewest possible matches, or even to produce a match when it's possible. Consider the following targets and fragments:

Target: ABCDEF
Fragments, in order of length
BCDEF, length 5
ABC, length 3,
DEF, length 3
Result: match fails, even though second and third fragment would succeed


Target: ABCDEF
Fragments, in order of length
BCDE, length 4
ABC, length 3
DEF, length 3
A, length 1
F, length 1
Result: match succeeds, but not with fewest possible fragments, finding fragments 4, 1, and 5 instead of 2 and 3.
I think what's necessary is to bottom out first, then travel back up looking for longer matches that cover would cover all the nodes of the sub-tree rooted at the current position.

To cover everything, we at least need to allow partial matches, matches that don't match a node's entire sub-tree, eg:
Target: ABCDEF
Fragments, in order of length
AB, length 2
CD, length 2
EF, length 2

Tree: ABCDEF -> ABC, DEF, -> A, BC, DE, F -> A, B, C, D, E, F
After bottoming out, we return up to level two (ABC, DEF), and can make two partial matches (AB, EF).
Then we return to the top level node, joining the two sub-trees to get the original target, and have to "fill-in" "CD" in the middle.
posted by orthogonality at 6:42 PM on October 8, 2006


It's the weekend, flabdablet, you have custody of the OCD baby.
posted by ill3 at 6:47 PM on October 8, 2006


It's Monday where I am, I still haven't left, and I demand you take this baby off my hands!

Ortho, using the largest fragments possible was my initial idea as well, but if any of the "few as possible" short strings can in fact be used multiple times each, it doesn't help.

That's why I thought just assigning a score at the bottom of the recursion, and selecting by high score at the end, would be the Right Thing.

Also, Boyer-Moore might be overkill. The matching could all be done once up front; then, during the recursion, we could just keep track of the adjusted match positions as the long string chops and changes.

If we are actually going to hang around in here reinventing wheels, I feel bound to point out that triangular wheels are better than square ones, because they eliminate one bump :-)
posted by flabdablet at 7:06 PM on October 8, 2006


Recursion might not be the best way, either. My gut says that sorting the list of short strings by match position would lead to a workable O(n2) or better iterative search solution.
posted by flabdablet at 7:10 PM on October 8, 2006


flabdablet writes "That's why I thought just assigning a score at the bottom of the recursion, and selecting by high score at the end, would be the Right Thing."


I don't think a score is necessary: you'd just have the recursive process return a tree (or even an ordered list) or fragments used. At the end of the recursion, you'd in-order traverse the tree, or iterate the list, to get the fragments used in the order they are used.
posted by orthogonality at 7:12 PM on October 8, 2006


« Older Looking for advice on selling goods at a popular...   |   How is the MTBing in Breckenridge, CO right now... Newer »
This thread is closed to new comments.