Algorithm for file comparison
February 2, 2006 8:32 AM   Subscribe

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.

What I have are two text files each containing a list of files. They are snapshots of all the files on my system, one old and one new. I want to figure out which files have been added or deleted between the two snapshots.

I could use diff to compare these files, but I don't want to because:

(a) I want to have the code inline and don't want to invoke an external tool. My project is not GPL and so I can't just take the diff source code.
(b) diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
(c) Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and space.

Here are the constraints on the problem:

(a) The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same order.
(b) Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
(c) I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.

I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists.

I'm sure there must be a solved problem, but I'm not sure what to search for as far as algorithm names or the name of the problem. That would be most helpful, if you can point me to an algorithm for this or even just the academic name for this problem.

Thanks!
posted by Khalad to Computers & Internet (25 answers total)
 
Keep two position counters, one for each list. Call them a and b, in lists A and B.

Compare A[a] to B[b]. If A[a] is greater, increment a and B[b] is a new file. If B[b] is greater, increment b and A[a] is a deleted file. If they match, they match. Repeat until you a > len(A) and b > len(B).

(This might be reversed and/or not work at all.)

This should work at O(n), I think.
posted by smackfu at 8:43 AM on February 2, 2006


Response by poster: The hitch is the files aren't necessarily sorted alphabetically, so I don't know if one file is "greater" than another. I just know that the files that are present in both lists will be in the same order.

I guess it's not quite the same as two sorted arrays. Hm.
posted by Khalad at 8:47 AM on February 2, 2006


Yeah, then it wouldn't work.

This problem is called Longest Common Subsequence, BTW.
posted by smackfu at 8:53 AM on February 2, 2006


Ok, so we have list A, and list B. And for two files, x and y, if both x and y are in A and B, and x is after y in A, then x must be after y in B? Correct?

If so I can give you O(2(m+n)):

Using a variation on smackfu's method, stepping down the lists, you can generate a list of all the files in A that are not in B, these are the files that have been deleted. Then do the same thing to get the list of all the files in B that are not in A, this is the list of files that have been added. No?
posted by Capn at 9:05 AM on February 2, 2006


Response by poster: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints (the first being what Capn said, the second being that I don't need the minimal list of changes, which is what LCS algorithms are designed to find).

Question about your method, Capn: if I'm running down list A, and I hit a file that isn't in the corresponding position in list B, how do I determine if it's in list B or not? I could scan down list B to see if it's there, but that takes O(n) time and leads to O(mn) overall running time.

There must be a way to adapt your idea smackfu. The lists aren't sorted, but they are sorted with respect to each other, know what I mean?
posted by Khalad at 9:26 AM on February 2, 2006


One idea: When you have a mismatch between A(a) and B(b), compare A(a) to B(b+1), A(a+1) to B(b), A(a+1) to B(b+1) ... A(n) to B(m) until you find a match. By going through both lists at once, you avoid the issue of a deletion from A requiring a full iteration over B.

This will add something like r^2 comparisons, where r is the number of consecutive adds or deletes.
posted by smackfu at 9:59 AM on February 2, 2006


Why not sort them alphabetically and then compare?
posted by null terminated at 10:02 AM on February 2, 2006


One problem with the iterate through the lists twice approach is that you're not fully using all the information available, which makes it a less than fully efficient approach. By that I mean, if list A has an entry which isn't in list B, the list A entry has to be compared with all entries in B past the last A and B synchronization point. And it has to do that for all non-synching entries. Then you turn around and do the same thing all over again for list B against A, duplicating many of the comparisons.

Here's my idea, which seems to use all available information (and I'm sure it will get shot down in flames here if I'm wrong, as well it should). The approach flips between each list trying to find a common synchronization point, using one anchor point for each list n and p, and one "scouting" index for each list, ns and ps.

Step 1: Compare an entry on the list. The last synchronization is at A[n] and B[p] entries. Increment n and p until A[n] != B[n] or end of either A or B. If end, process the A or B leftovers appropriately. If no match at A[n+1] and B[p+1], set ns and ps equal to 1. Go to step 2.

Step 2: Compare A[n+1] against B[p+ps]. If a match, then you're good. The addition/deletion entries are B[p+1] through B[p+ps-1]. Resync at step 1 with n=n+1 and p=p+ps. Otherwise step 3.

Step 3: No match, so increment the ps scouting index. If ps is past end of list, you know that the A[n+1] entry is not in B. Process entry, increment n, and go to step 1. Otherwise, step 4.

Step 4: Compare B[p+1] to A[n+ns]. If match, then A[n+1] though A[n+ns-1] are the deletion/addition entries (whatever the inverse of Step 2 match). Resync at step 1, with n=n+ns and p=p+1. Otherwise step 5.

Step 5: No match, so increment the ns scouting index. If ns is past end of list, you know that the B[p+1] entry is not in A. Process entry, increment p, and go to step 1. Otherwise go to step 2.

With this approach you simultaneously spider down both lists a little at a time, trying to find a common match point.

There are optimizations you can make based on additional knowledge. If you know that there are more files in 'A', for example, instead of one-to-one alternating comparisons on each list, you can run the B comparisons (B[p+1] against A[n+ns]) more often in hopes of getting past a bunch of nonmatching A entries in one group. To do this, you simply increment ns more often than ps and perform more compares of B[p+1] to A[n+ns] at one time than the corresponding A[n+1] to B[p+ps] checks. In other words, you flip faster away from A compares and stick longer with B compares.

Well, it works in my head anyway. I should try it in code.
posted by mdevore at 10:09 AM on February 2, 2006


I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists.

Yes, it is. So why not just sort the lists and look for differences?
posted by Mars Saxman at 10:11 AM on February 2, 2006


Rereading: I didn't mean to sound quite so unhelpful. You don't specify which language you are using, or what your level of programming experience is, but you obviously know enough to understand big-O notation, so this shouldn't be out of reach.

Analysis problems like this are usually about getting the data into the right form. Once you have transformed the data appropriately, the solution pops out. Here, the right form is a pair of sorted lists; then the difference algorithm is trivial. Furthermore, sorting is a well-understood problem, and even if your language doesn't include a generic sorting system, there are a great many examples available.

So, break the problem in half. Sort your lists first. Now look for differences like so:

Iterate through both lists. While the values are equal, continue; nothing has changed. When the values are not equal, compare them. If the element in the new-list is greater than the element in the old-list, then the element in the old-list was deleted. Increment only the old-list counter, record a deletion, and continue. If the element in the new-list is less than the element in the old-list, then the element in the new-list was inserted. Increment only the new-list counter, record an insertion, and continue.
posted by Mars Saxman at 10:18 AM on February 2, 2006


Find regions that stay constant between the two files, look at their offset. If they've been offset more than the last, go backward.
posted by devilsbrigade at 10:26 AM on February 2, 2006


I don't mean to be disputatious, but I believe that sorting is likely not the optimal approach. With sorting you are forcing a state on both lists that isn't required by the circumstances, thereby spending cycles to overprocess the information. Because the lists are known to be in the same order for matching entries, you can use alternate in-place means of locating differences that are faster (speaking in general terms, possibly not true for a specific example) than a sort+compare technique. If the pre-existing ordering condition were not present, then sorting is the obvious winner.
posted by mdevore at 10:34 AM on February 2, 2006


You are undoubtedly right, mdevore, that my approach is sub-optimal in terms of run time: but I'll bet it's optimal in terms of programmer time. Less code means less debugging means on to the next project that much sooner...
posted by Mars Saxman at 10:43 AM on February 2, 2006


I don't know the answer, but a picture may help our thinking. Make a grid, with the A-sequence along the upper edge and the B-sequence down the left edge. Start at A[0] and B[0]: if they match, fill in the square, and step right/down one. If A[1] matches B[1], fill in that square. As long as A and B match, you are just filling in the diagonal.

If there is a mismatch, you first try to fill in the 2X2 square, by going right - check for match, going down - check, going left. If there are no matches in the 4 squares, you step down one and fill out the 3X3 square. Then the 4X4 square, etc. If you find a match, you attempt to start on a new diagonal.

A missing file in sequence A means the filled diagonal shifts down. A missing file in sequence B means the diagonal shifts right. If a single file in both A and B are both missing from the other sequence, the diagonal misses one box and continues in the same trajectory.

I can draw a .png picture, any recommendations of where to post it? (never used flickr, etc)

(On preview, I think this is the same thing as mdevore's).
posted by mediaddict at 10:48 AM on February 2, 2006


>Less code means less debugging means on to the next project that much sooner

I certainly can agree with that. People tend to obsess over program optimizations that could never possibly make a difference in the real world at the expense of maintainability. The difference between a heavily optimized program that runs in a tenth of a second and an unoptimized one which takes 10 seconds is two orders of magnitude, yet if the program is run once a year, what's the benefit?

Not to excuse the bloated carcasses of many programs out there, of course...
posted by mdevore at 10:50 AM on February 2, 2006


"I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints (the first being what Capn said, the second being that I don't need the minimal list of changes, which is what LCS algorithms are designed to find)."

You intuition is right, there is a more efficent algorithm than O(mn) to solve this, but you're wrong in thinking that LCS doesn't apply. This problem is exactly what LCS is about- it is very similar to comparing two strands of DNA. LCS does assume that the lists are in the same order, like Capn mentioned.

But unless you're doing this for some sort of learning experience, I think you're crazy to try and develop a customized LCS algo to solve it. I mean, how long are the lists? It can't take more than a few seconds of computing time to sort each one and then find the differences. Who cares if that is taking O(mlogm+nlogn) or whatever, you're going to waste more time optimizing then the computers would waste running it unomptimzed thousands of times.
posted by gus at 11:09 AM on February 2, 2006


From an effort standpoint, the easiest thing to do is to sort each file, then run the unix utility comm. I realize it doesn't meet your first constraint, but it only takes a minute for you to do it.
posted by jefftang at 11:26 AM on February 2, 2006


Really simple version:
Make an interable hashtable of strings.

Go through list A.
If for each file in A insert filename a into that hashtable.

Go through list B.
For each file in B
If that file is in the hashmap remove the file from the hashmap.
If that file is not in the hashmap it wasn't in list A.

When done with list B iterate through the hashmap. Any values still in the map are not in list B.
posted by aspo at 12:03 PM on February 2, 2006


My first observation in response to the OP is that, probably, the most efficient way to find such differences is not by list analysis, but by the way most file systems do the job, which is usually by walking b-trees... This is also the reason why many programs that produce lists of file system contents by walking b-trees produce output that is grouped by directory ;-). Presumably, if you are picking up the problem from the point of having already produced your lists, you are not considering or incurring any overhead for producing this sort, so we'll take that as a starting assumption for this exercise, and say that lists A and B are in alphanumeric sort order by b-tree node from top to bottom, as a condition of being produced.

So you've got your lists, thus sorted.

In most filesystems, a "filename" is actually the concatenation of the "path" (directory structure) and the filename. Thus, a DOS name of C:\Windows\win.ini is a "different" file than either C:\Windows\copy-win.ini or C:\Win32\win.ini, even though the actual contents of the files may be the same, and in one case, even the actual file name being identical. So, I think you've got to define what for your purposes and add or delete really means.

Let's presume you subscribe to the common view that path + filename is the "real" filename, for all the reasons filesystem and application designers tend to enforce this convention. For you, a file with a different path is a different file, which conveniently (or inconveniently, as the case may be) will produce moved files as adds AND deletions in your change list. Then, we can observe that what you are really faced with in comparing your lists is a bunch of string manipulations and comparisons of the filenames, over which you can reasonably operate with some special rules derived from the naming conventions of the filesystem you work with:

1) Directories have some unique character representation, usually / or \

2) Files "live" in directories

3) If a directory does not exist, no files can exist within it

4) A directory may exist, but (depending on the filesystem) may contain no files, or only default files.

So, an opportunity exists for you to better your original efficiency estimate for diff anytime you can apply your knowledge of the above 4 rules to the string manipulation and comparison you are doing. If a directory is missing from either list, you can just print all files in that directory as either adds or drops (depending on whether they are in list A or B), without continuing to compare list entries on the two list until the missing directory is fully enumerated. You only need to start comparing the lists again when a directory in either list is exhausted. This is a little better than what diff does for directories with only a few files, but much better for directories with lots of entries.

You can also use your special knowledge of the directory delimiter character to stem your search from left to right in comparing entries from each list initially. Thus, if you find in list A an entry C:\ you only need to process the first 3 characters in each entry of list B to see if B has any possible matches with A. You can toss out any item in B that has a second \ left to right with a simple mask. Once you've enumerated all difference between the root directories, you continue the process, matching lists only until you find the next match with 2 directory characters, i.e. C:\xx\ for the characters between the directory characters. You beat diff again each time you find subdirectories that are not matched within your new directory and don't have to do the comparisons.

A final optimization you can apply that may be of particular value to filesystems that are particularly deep is that within matching directories, you can do your string comparisons for files within the directory from right to left, terminating on the directory character. That way, you don't have to process the whole path string on each comparison, and this can be way more efficient than diff, since filenames are bound to be shorter than full path names.
posted by paulsc at 12:39 PM on February 2, 2006


Paulsc has the answer. You shouldn't be comparing the text output by an a directory listing. You should be comparing the trees. If you have to, then build trees from the directory list text and then the comparison is simple to define recursively. I still don't get the question.

A) There are more diff's under heaven than the the gnu diff. You could easily snag the code in BSD diff.

B) I would be surprised if you had a large enough filesystem that efficiency is going to matter all that much. You're going to spend far more time walking the filesystem to get the listing than you're going to spend doing the comparison.
posted by rdr at 1:11 PM on February 2, 2006


Paulsc and rdr, you are making the solution a lot more complicated than it needs to be. All you care about is "what strings are in this file that aren't in that file, and vice versa." The more complicated you make the solution the more bits of code exist that can have bugs in them. You need to parse the filenames, build a tree, recursivly iterate over each tree while noticing differences and so on. That's a lot of pieces. None of them are that complicated, but why have them in the first place. Even if everything is written perfectly, overengineered code is bloated, harder to maintain, harder for someone else to pick up and understand, etc. Elegance is key to good engineering.
posted by aspo at 1:35 PM on February 2, 2006


Response by poster: Thank you all so far, I'm still reading through the answers. A few comments:

While I agree about the peril of premature optimization, I believe the choice of a good algorithm is not premature. Yes, programming time is important, but using a clearly inefficient algorithm is bad all around when a bit of thought will reveal a much better one. Mathematically speaking, lowering the constant factor on an algorithm's runtime might or might not be worthwhile. However, changing a quadratic algorithm to a linear one is absolutely worthwhile.

It is particularly necessary in this case because I am trying to replace an existing inefficient implementation with a better one. I could do all kinds of optimizations to the existing implementation to lower the constant by using better data structures and reducing the number of passes on the data, but I figure the first step is to just try a different algorithm with a mathematically superior runtime and see how that works. If that is still slow, then I can proceed to more aggressive optimizations.

In fact the most obvious approach is to replace what we have with a call to diff, but for the reasons I stated in my original post I am sure I can do better than that. I would prefer a simple algorithm as well versus the complicated methods employed by diff.

The time spent doing this is very important; we process several gigabytes of these file listings at once so spending time to get this right is not a bad idea.

Enough jabbering from me though. Time to read...
posted by Khalad at 1:49 PM on February 2, 2006


Seconding aspo's idea of using hashes.
posted by Sharcho at 1:50 PM on February 2, 2006


Response by poster: Oh, and the file listings are generated on client machines then sent to a server for processing, so the server can't directly access the file systems that are being checked.
posted by Khalad at 1:51 PM on February 2, 2006


Actually none of the suggested solutions is terribly complicated to implement. Several ideas might be combined for greater efficiency (e.g. list-walking with a pre-hashed code compare rather than a string. or b-tree directories setting min-max boundaries on an internal iterative search of current directory).

Theory and discussion are all very nice, but given the circumstances where this could make a real difference in actual person time, I'd suggest benchmarking several approaches with live data. I don't see anything here which would take even an hour to implement as a simple trial. List-walkers, hashes, b-trees, iterations, sorts, diffs, all of 'em cover well-explored territory. Set up a contest day and cheer on your favorites.
posted by mdevore at 2:22 PM on February 2, 2006


« Older Is this the beginning of the end?   |   Looking for depression related blogs Newer »
This thread is closed to new comments.