Algorithm for file comparison
February 2, 2006 8:32 AM Subscribe
I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than
due to some added constraints.
What I have are two text files each containing a list of files. They are snapshots of all the files on my system, one old and one new. I want to figure out which files have been added or deleted between the two snapshots.
I could use
diff to compare these files, but I don't want to because:
(a) I want to have the code inline and don't want to invoke an external tool. My project is not GPL and so I can't just take the
diff source code.
diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
(c) Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and space.
Here are the constraints on the problem:
(a) The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same order.
(b) Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
(c) I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.
I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists.
I'm sure there must be a solved problem, but I'm not sure what to search for as far as algorithm names or the name of the problem. That would be most helpful, if you can point me to an algorithm for this or even just the academic name for this problem.