How to compare and merge a large number of text files?
February 8, 2013 12:22 PM   Subscribe

How can I automatically compare and merge a large number of text files?

Due to a series of technical snafus with Dropbox and Simplenote syncing, my main writing folder, which contains text files mostly in Markdown format, is all messed up.

I have about 500 unique files, but now each of them has multiple versions. For any given file, the directory contains something like this:

textfile.txt
textfile.md
textfile.org
textfile.0001.txt
textfile.0002.txt

They mostly have identical content - some contain extra line breaks at the beginning, or a line containing the file name.

I didn't realize immediately that this had happened and that I had multiple versions, though, so with for some of them, though, I've modified one of the versions and not others. (The good news is that when I modify files, I don't edit or delete, I just add new text.)

I want to reconcile my folder so that I have one canonical version for each file.

Since there are now thousands of files, and more than two versions of each one, I'd rather not use a manual diffs app to reconcile them.

Is there a tool that will find multiple text files containing the same content and automatically merge them? Again, the files contain duplicated content, with some new content, so simply merging the duplicated content and then adding the new content at the end of the file would be satisfactory.

(I'm using OSX 10.8.2 and I write primarily in Aquamacs Emacs. Oh, and I'm going to stop using Simplenote.)
posted by incandescentman to Computers & Internet (41 answers total) 6 users marked this as a favorite
 
If you are on a windows based system, I use winmerge when I need to find the difference in textfiles and/or code.

There's no easy way to do this, but that's the most efficient I've found so far.
posted by Nanukthedog at 12:35 PM on February 8, 2013


Does it matter what order the additional lines are in?

If not... put all the files in a single directory, choose one to be the "master", and the following one-liner will output the lines added to all of the other versions into a new file, which you can then append to the 'master':

diff --from-file=MASTER ./* | grep '^>' | sort -u | sed -e 's/^> //' > ./differences.out

If you don't want to have to manually add the 'new' lines to the master, try:

diff --from-file=MASTER ./* | grep '^>' | sort -u | sed -e 's/^> //' >> ./MASTER
posted by hanov3r at 12:39 PM on February 8, 2013


There are also flags you can add to diff that will tell it to ignore changes in whitespace and capitalization.
posted by hanov3r at 12:40 PM on February 8, 2013


You could create a git repository, rig it up so that each copy of the folder is a local repository pointing to a central "remote" bare repository (which can actually be on your local filesystem, it doesn't have to be over the network), and then you'd be able to experiment with git's various automated merge strategies.
posted by XMLicious at 12:44 PM on February 8, 2013


autohotkey is a free software which I always found useful for automation tasks. it can record macros of your actions:
http://www.autohotkey.com/
posted by spacefire at 12:50 PM on February 8, 2013


Response by poster: The git repository idea seems promising. I also just installed Mercurial and I'm planning to learn how to use it. Does anyone know any specifics on these automated merge strategies?
posted by incandescentman at 12:59 PM on February 8, 2013


Response by poster: OK, if there's no easy way to do this all automatically, then what about a diff/merge engine that allows me to input say, five files, instead of just two, and merge them?
posted by incandescentman at 1:05 PM on February 8, 2013


Meld (http://meldmerge.org/) does a three-way compare.
posted by XMLicious at 1:48 PM on February 8, 2013


Response by poster: How about a five-way?
posted by incandescentman at 1:50 PM on February 8, 2013


Some information that would be helpful. You say you add lines only. Do you happen to add lines only to the end of the file? Do you happen to only add single lines at a time? (e.g. I have some files that I only add to with something like `echo Some new line of something >> filename.txt`. I add single lines only and only to the end of the file.) Do the files have any reliable creation/modification time available? (can you sort them by when they first showed up or by the last time they were edited.)
posted by zengargoyle at 1:51 PM on February 8, 2013


Response by poster: No, there's no reliable creation/modification times available because the syncing messed everything up.

I have been adding prose - paragraphs and sentences - and not only at the end of the files.
posted by incandescentman at 1:54 PM on February 8, 2013


How about a five-way?

There are only three file name fields, so probably not. But if you did it twice, that would let you merge a total of five files, acourse.
posted by XMLicious at 2:16 PM on February 8, 2013


If I was not afraid of a "reformat the document" at the end, I think I would start by turning each duplicate into a straight list of sentences, stripping out blank lines so that each file was something like:

This is the first sentence of the first paragraph.
This is the second sentence of the first paragraph.
This is the third sentence of the first paragraph which is a short paragraph.
This is the first sentence of the second paragraph.
...

And then start with the file that had the most lines and compare against the others, first picking out any files that had sentences that could not be found in the longest file and trying to diff/merge that additional sentence in. I'd also probably start with `sort | uniq -c | sort -n` of all of the files combined to get an idea of how many sentences have the same count as the number of files (i.e. are in each file) and how many sentences are only in one file. It may be a bit more difficult if sentences can differ by word addition.

I think depending on how your editor handles word wrapping and paragraph formatting, that the default programming type diff/merge tools may just not be up to the job. They tend to work on comparing lines and trying to find surrounding context by lines. If your sentence addition in the middle of a paragraph causes the paragraph to reflow and changes all of the following lines then diff will have a hard time. There are options to diff to do comparisons on a word-by-word basis but I'm not terribly familiar with them.
posted by zengargoyle at 3:04 PM on February 8, 2013


Go to www.perlmonks.org and beg one of the Über geeks to help you.
Perl is great for text stuff like this.
posted by Ignorance at 3:23 PM on February 8, 2013


Just wanted to chime in to say that whatever you do, make a backup copy of the entire folder *first*, so if something goes horribly wrong you can undo it.
posted by zug at 5:49 PM on February 8, 2013 [1 favorite]


Oh, man, what an interesting question! This would make a great competition problem.

If there's a subset of files (say, a few dozen) that you wouldn't mind showing to a stranger on the internet, I'd be happy to write up a script to merge them together for you.
posted by d. z. wang at 7:08 PM on February 8, 2013


This sounds almost like a genetics problem, trying to coalesce strings of data with mutation! I bet you could ask on StackOverflow and get some interesting answers.
posted by vasi at 8:08 PM on February 8, 2013


The thing is - if you find git too complicated (which is quite understandable, as it's very complicated) I would think that a custom-written script or program is probably not going to be much less complicated, as this is inherently a complicated problem. Maybe try to get someone to script git to try to make it simpler to use for your particular purpose?

btw, I think you would probably want git's "octopus" merge strategy, though I don't do much automated merging so I'm not sure.
posted by XMLicious at 4:04 AM on February 9, 2013


Also, I don't think there would be any satisfying solution involving genetic algorithms, as they involve randomly changing data and throwing away parts of it; the OP appears to want all of the original bits of text coalesced together, without losing anything or having randomly-generated stuff mixed in.
posted by XMLicious at 4:08 AM on February 9, 2013


zug: Just wanted to chime in to say that whatever you do, make a backup zipped copy of the entire folder *first*, so if something goes horribly wrong you can undo it.
Smaller size AND less prone to accidental changes.
posted by IAmBroom at 10:22 AM on February 9, 2013


XMLicious: "Also, I don't think there would be any satisfying solution involving genetic algorithms, as they involve randomly changing data and throwing away parts of it..."

XMLicious, I didn't mean "genetic" as in "genetic algorithms". I was referring to the sort of problems that people working with actual DNA have. Eg: you might have several similar DNA sequences, and want to figure out how they may be related via mutation. Or you might have several segments of DNA which you know fit together, so you have to look for overlaps at the edges.

Getting back on-topic, I don't think a three-way merge will help the OP much. "Three-way merge" specifically refers to the case where you're comparing an ancestor A to two descendants B and C. If the files you pick don't have that relationship, it may be less useful.
posted by vasi at 1:17 PM on February 9, 2013 [1 favorite]


Indeed it would be a great sort of side-project to think about / work on in spare time that might turn out something useful. It does seem amicable to gene sequence matching and I'm thinking some graph manipulation could possibly work.

Take a fingerprint of each paragraph. Depending on circumstances and how things actually work out... maybe a hash of the paragraph, or just the first word of each sentence or something similar. Then you could go over each file and build a directed graph of nodes.

p1 - p2 - p3 - p4 - p5
p1 - p2 - p3 - p5
p1 - p2 -p3 - p4 - p6 -p5

Then you could pick the nodes that occur the most an use them as fixed points and use that to help find a path from p1 - p5 and do merging where there are multiple nodes between pn and pm.

Or maybe treat sentences like a markov chain. Build the table of following sentences and then start from the first and make decisions when there is more than one following match available.
posted by zengargoyle at 1:22 PM on February 9, 2013


At a first pass I would gather checksums (probably md5) of everything to spot identical files I could get rid of.

$ md5 * | sort > /tmp/md5s

Looking in /tmp/md5s will then show clusters of identical files that will have identical checksums. Delete all but one of such clusters.

Then, yes, I'd start a git or mercurial repository and try to use that merging mechanism to resolve the issues. The good news is ASCII text is exactly what source code control software is designed to manipulate. The bad news is that unless you're a programmer, you probably won't have any idea what is going on.

Personally, I prefer mercurial over git as I find it easier to use and less argumentative but it's mostly a religious issue.
posted by chairface at 9:39 AM on February 10, 2013 [1 favorite]


Just stumbled across Ferret while I was looking for something else:
Ferret is a copy-detection tool, created at the University of Hertfordshire by members of the Plagiarism Detection Group. Ferret locates duplicate text or code in multiple text documents or source files. The program is designed to detect copying (collusion) within a given set of files. Ferret works equally well with documents in natural language (such as English, German, etc) and with source-code files in a wide range of programming languages.
It appears to be targeted at Linux, though.
posted by XMLicious at 10:37 PM on February 10, 2013


Response by poster: Thank you all for your thoughts so far.

I'm going to circumscribe the problem by doing more of this manually and seeking to automate a smaller portion of it.

I've looked more closely at the files, and I've detected some patterns. Here's an example of 5 versions I would want to reconcile into one canonical version.

I have the following 5 files:

brainstorming.0001.txt
brainstorming.0002.txt
brainstorming.txt
brainstorming.md
# brainstorming.txt

These files are almost identical. For each one, I'll list the filename in link face, and then the first 5 lines of text in the file. As you will see, each of the files repeats its filename as its first line. Some of the files repeat it twice.

# brainstorming.md
# brainstorming

who's the audience?

what are the needs


brainstorming.txt
brainstorming

# brainstorming

who's the audience?


# brainstorming.txt
brainstorming

# brainstorming

who's the audience?


brainstorming.0001.txt
brainstorming

# brainstorming

who's the audience?

brainstorming.0002.txt
brainstorming

# brainstorming

who's the audience?

...

So. If I could find a way to simply find all such variations, realize they're all the same file, and then consolidate them, I would be satisfied, and I could do the rest manually.
posted by incandescentman at 2:05 AM on February 12, 2013


Okay, here's a unix one-liner that will find duplicate files and delete all but the first one (as I said before, you'd better have a backup before trying this):

fdupes -rf1 /path/to/directory | xargs rm

You'll need to install fdupes first (port install fdupes).
posted by zug at 11:54 AM on February 12, 2013


Response by poster: Thanks zug. I ran that and got "xargs: unterminated quote," I guess because the filenames have spaces in them? Do you know a solution?
posted by incandescentman at 10:02 PM on February 12, 2013


It's because a filename somewhere has an apostrophe. Using the -0 switch will cause xargs to work around that stuff.

fdupes -rf1 /path/to/directory | xargs -0 rm
posted by zug at 7:10 AM on February 13, 2013


Response by poster: Thanks. This time I get ": File name too long." (Some of those filenames got really messed up during sync.) Is there a solution?
posted by incandescentman at 12:36 PM on February 13, 2013


fdupes will handle that for you.

$ fdupes -h
...
 -d --delete            prompt user for files to preserve and delete all
                        others; important: under particular circumstances,
                        data may be lost when using this option together
                        with -s or --symlinks, or when specifying a
                        particular directory more than once; refer to the
                        fdupes documentation for additional information
 -N --noprompt          together with --delete, preserve the first file in
                        each set of duplicates and delete the rest without
                        without prompting the user
`fdupes -rNd [path]` -- keep one of each duplicate, delete the rest without asking.
posted by zengargoyle at 11:37 AM on February 14, 2013


Response by poster: Thanks. I tried `fdupes -rNd [path]`but I'm on OSX bash and it appears I don't have the -N option.

localhost:notationaldata incandescentman$ fdupes -rNd /Users/incandescentman/Dropbox/Git/notationaldata
fdupes: invalid option -- N
Try `fdupes --help' for more information
localhost:notationaldata incandescentman$ fdupes --help
Usage: fdupes [options] DIRECTORY...

-r --recurse include files residing in subdirectories
-s --symlinks follow symlinks
-H --hardlinks normally, when two or more files point to the same
disk area they are treated as non-duplicates; this
option will change this behavior
-n --noempty exclude zero-length files from consideration
-f --omitfirst omit the first file in each set of matches
-1 --sameline list each set of matches on a single line
-S --size show size of duplicate files
-q --quiet hide progress indicator
-d --delete prompt user for files to preserve and delete all
others; important: under particular circumstances,
data may be lost when using this option together
with -s or --symlinks, or when specifying a
particular directory more than once; refer to the
fdupes documentation for additional information
-v --version display fdupes version
-h --help display this help message

--

Oh, and by the way, separately, how do I address users who have contributed answers, in order to signal them that I'm asking them follow-up questions? Like this. I can't find anything about this on the FAQ or by googling.
posted by incandescentman at 1:04 PM on February 14, 2013



$ fdupes .
./a                                     
./b
./c
./d
./with a space

# -f to omit the first duplicate file

$ fdupes -f .
./b                                     
./c
./d
./with a space

# perl fu because spaces+shell == suckage

$ fdupes -f . | perl -lne 'unlink'

$ ls
a

`unlink` will silently fail on the blank line that fdupes places between groups of duplicate files. Probably should do `-f $_ && unlink` but meh.
posted by zengargoyle at 6:22 PM on February 14, 2013 [1 favorite]


Response by poster: Zengargoyle, thanks. I didn't understand that last bit about -f $_ && unlink.

Should I simply navigate to the afflicted directory and enter the following?

fdupes -f . | perl -lne 'unlink'
posted by incandescentman at 8:53 PM on February 14, 2013


The thing about '-f $_ && unlink' was a side note, ignore that. You can indeed just cd to the directory, and execute that command. Make sure you're backed up first!
posted by vasi at 9:14 PM on February 14, 2013


Response by poster: Great, that worked! I got rid of about 1000 duplicates.

Among the files that remain, there are still many that are identical EXCEPT THAT in one of them, the filename itself is repeated at the top of the text file, followed be an empty line.

If anyone is still interested in this problem: Can you suggest a way to compare 2 text files, IGNORING the first few lines? Such that if two files are identical except that one of them has 2 extra lines at the top, the two files will be considered duplicates and one of them zapped?

Thanks!
posted by incandescentman at 11:16 PM on February 14, 2013


Hmm, maybe you'd just like a script to remove "brainstorming" followed by two newlines from the start of files? That would be:

perl -i -00 -pe 's/\Abrainstorming\n\n//' *
posted by vasi at 11:31 PM on February 14, 2013


Do the files have the same names before the extension?

Like:
brainstorming.001
brainstorming.002

Or is it:
brainstorming.001
brainstorming ideas.002

Actually, that shouldn't matter. This ought to be doable as a one-liner. I'll get back to you.
posted by zug at 5:33 AM on February 15, 2013


(do a fresh backup first just in case, this will modify the files to remove the filename and the blank line following)

cd to the directory, then:

perl -0777 -pi -e 's#\Q$ARGV\E$/{2}##' *
posted by zug at 8:20 AM on February 15, 2013


Sadly zug that won't work. Using '-0777' is pretty much a 'BEGIN{ $/ = undef; }'. Deep voodoo in inplace editing and record separators.

If the Perl is 5.10 or better you could use '\R' instead of '$/'. `perldoc perlre|perlrebackslash` describes it as: "\R" matches a generic newline; that is, anything considered a linebreak sequence by Unicode. And using $ARGV will only work if that first line matches the filename including extensions which doesn't seem to be the case in the brainstorming examples.

perl -0777 -i.bak -pe 'BEGIN{$MATCH=shift @ARGV} s#\A\Q$MATCH\E\R\R##' brainstorming brainstorming.*
Disclaimer: I'd totally be writing a script and using modules for this by now. (honest)
posted by zengargoyle at 4:27 PM on February 15, 2013


Response by poster: Wow, what an amazing community you guys are. ask.metafilter.com, I love you. Thank you all so much for lending your minds to this task, and zengargoyle, zug, vasi, thank you for donating your brains and time to helping me sort this out. I really appreciate it.
posted by incandescentman at 2:48 PM on February 20, 2013


Response by poster: In the end I wound up using a hybrid method. After eliminating the duplicates I could using the methods provided by genius mefites, I am now using the following method to try to get the rest: searching for files with similar filenames, cat'ing them together, then eliminating duplicate lines to remove the overlap. Tedious but it seems to be working. Thanks again.
posted by incandescentman at 10:20 PM on February 20, 2013


« Older Please help me diagnose my wifi problems   |   What vaginal infection is this? Newer »
This thread is closed to new comments.