Using piping with *nix find without getting mired in string escaping...
December 12, 2017 1:46 PM   Subscribe

I want to make a single, long list of lines from individual files according to some search criteria. I'm not wedded to using find and grep.

I find files doing this:

find . -name '*.rtf' -type f

I can pipe single rtf files into grep by doing this:

cat ./some/file | textutil -convert txt -stdin -stdout | grep ‘foo bar’
(grep won’t give me the lines back unless the file is plain text.)

But, I can’t get piped commands working with find -exec, to get a single, long list of matching lines from many rtf files, according to my search criteria. I think I’m getting mired in string escaping.

This is wrong:
find . -name '*.rtf' -type f -exec sh -c "textutil -convert txt {} -stdout | grep -i 'some string' " \;

How can I get a single list of matching lines of text across multiple files? If there is another way to extract matching lines from files, say from the mac GUI, that would be even more excellent.
posted by zeek321 to Computers & Internet (28 answers total) 4 users marked this as a favorite
 
Try xargs and use the $@ shell variable to get the filename instead of that awful -exec syntax. The variable will substitute within double quotes.
find . -name '*.rtf' -type f | xargs sh -c "textutil -convert txt $@ -stdout | grep -i 'some string' "
posted by rlk at 1:55 PM on December 12, 2017 [3 favorites]


you want to find a bunch of RTF files, convert them to plain text (and output the plain text to stdout) and then search the output from the conversion for a string ?

(with out the conversion, it'd be something like: find -name *.rtf | xargs grep "foo" .. So perhaps you need to xargs the find output to the converter ? )
posted by k5.user at 1:56 PM on December 12, 2017


ahh, rlk, so close?

$ find . -name '*.rtf' -type f | xargs sh -c "textutil -convert txt $@ -stdout | grep -i 'foo bar' "
xargs: unterminated quote
posted by zeek321 at 2:09 PM on December 12, 2017


I tried escaping my single quotes: \' but no dice.
posted by zeek321 at 2:10 PM on December 12, 2017


Added -print0 and -0... closer...

find . -name '*.rtf' -type f -print0 | xargs -0 sh -c "textutil -convert txt $@ -stdout | grep -i 'foo bar' "
No input files specified.
No input files specified.
posted by zeek321 at 2:15 PM on December 12, 2017


shouldn't that last double quote go before the pipe and grep command?
posted by ArgentCorvid at 2:19 PM on December 12, 2017


Oh, uh, actually we DON'T want the $@ to substitute in the outer shell, we want it to substitute after the inner shell executes. So try swapping the single and double quotes.

find . -name '*.rtf' -type f -print0 | xargs -0 sh -c 'textutil -convert txt $@ -stdout | grep -i "foo bar" '
posted by rlk at 2:21 PM on December 12, 2017


also maybe add a --verbose to the xargs part to see what line it is trying to execute.
posted by ArgentCorvid at 2:22 PM on December 12, 2017


I'd take a slightly different approach, with a loop in a subshell:

(for file in `find . -name '*.rtf' -type f`; do cat $file | textutil -convert txt -stdin -stdout; done) | grep -i 'foo bar'

any better?
posted by illongruci at 2:23 PM on December 12, 2017


@rlk, I have a lot of files with spaces in them, so now I'm getting dozens of errors like this:

Error reading the/first-word. The file doesn’t exist.
Error reading some-middle-word/foo. The file doesn’t exist.

I somehow need to pass the file paths properly escaped from find into the rest of the command. I tried putting double quotes around the "$@" but I got a bizarre request to install a font with no further output.

I suspect it's going to work if I can figure this last part out.
posted by zeek321 at 2:29 PM on December 12, 2017


why not
find . -name '*.rtf' -type f -print0 | xargs -0 'textutil -convert txt -stdout' | grep -i 'foo bar'

I think the $@ isn't necessary. if you think it is, The -I option of xargs may help.
posted by ArgentCorvid at 2:36 PM on December 12, 2017


If I understand the problem, would this do it?

find . -name '*.rtf' -type f -exec textutil -convert txt -stdout '{} '+' | grep -i 'some string'

(the + rather than ; makes it do a list of files rather than one)
posted by jaymzjulian at 2:59 PM on December 12, 2017


How about
find . -name '*.rtf' -type f -print0 | xargs -0 textutil -stdout -cat txt | grep -i 'foo bar'

posted by nicwolff at 3:37 PM on December 12, 2017 [1 favorite]


I think jay might be on the right path but made a minor error.

Try:

find . -name '*.rtf' -type f -exec 'textutil -convert txt -stdout '{} '+' | grep -i 'some text'
posted by xyzzy at 4:32 PM on December 12, 2017


I did make a typo - what I meant to type was:

find . -name '*.rtf' -type f -exec textutil -convert txt -stdout '{}' '+' | grep -i 'some string'

(note the extra ' after the {})
posted by jaymzjulian at 4:46 PM on December 12, 2017


jaymzjulian's method is what I'd use.

Any time you find yourself wanting find to execute some complicated pipeline instead of a simple command, it's worth stepping back and asking whether the first command in that pipeline, executed multiple times, will produce a stream suitable for the rest of the pipeline to consume.

Also, in most shells you could actually leave out the quotes around {} and + entirely and it would still work. If you enter echo {} + and your shell displays {} + then it won't need those tokens quoted for find either. Putting the quotes in is good practice if you're writing scripts, but for command-prompt one-liners it's handy to know whether you need them or not.

All that said: xargs is a very useful tool and it's certainly worth exploring its options, but for more complicated find+process loops I like to use a while read loop to consume output from find -printf, like this:
retain-latest-pcaps() {
        find . -maxdepth 1 -mindepth 1 \
                -name 'minute-*.pcap' -newermt '61 seconds ago' \
                -printf '%P %TY-%Tm-%Td-%TH-%TM-%TS\n' |
        while read name modtime
        do mv $name outage-$modtime.pcap
        done
}
This kind of construction lets you do arbitrarily complicated processing inside the while read loop, each step of which can use any of the useful output from find -printf as well as transforming that output via shell string substitutions. Main disadvantage compared to xargs is that it will generally be slower.
posted by flabdablet at 7:29 PM on December 12, 2017 [2 favorites]


Shell is hard. That's why there's Perl :P.
but.

zen@gaz:~$ find . -type f -name '*.rtf' | while read file; do echo "$file:"; unrtf "$file" | fgrep "the"; done
./repos/p6/perl6-all-modules/viklund/november/talks/ru/article.rtf:
token twext { [ <.alnum> || <.otherchar> || <.whitespace> ]+ };
token otherchar { <[ !..% (../ : ; ? @ \\ ^..` {..~ ]> };
Oh, `unrtf` does html by default which is actually handy for the < and > that happened to be in there but put <br>'s at the end of lines :(
posted by zengargoyle at 7:41 PM on December 12, 2017


Oh, mods, preview shows those <br>'s and makes it look double spaced while the post doesn't.
posted by zengargoyle at 7:44 PM on December 12, 2017


grep on a mac can search binary files, try the -U flag. That simplifies the command line.

grep -U 'foo bar' filename.rtf

works for me.
posted by epo at 2:54 AM on December 13, 2017


So I've just been looking at a manual page for textutil, and I'm not convinced we're actually doing the right thing here.

The description of the form you're using, textutil -convert fmt [options] file ..., says "Convert the files to format fmt and write each one back to the file system"; the description of the -stdout option says "Send the first output file to stdout."

That says to me that what textutil -convert txt will do with a long list of RTF files on its command line, as it would see when invoked via find . -type f -name '*.rtf' -exec textutil -convert -stdout {} + or any of the equivalent xargs variants, is convert the first of these to text and write it to stdout, then make a whole pile of new .txt files from the remainder of the input .rtf files. Which would not only create clutter in your filesystem, it would fail to pass anything but the contents of the first file along to grep.

I think you probably want textutil -cat txt instead: "Read the specified files, concatenate them, and write the result out as a single file in the indicated format." Using that with -stdout should push all the converted RTFs down the pipe as text and avoid creating any new files. So the command line becomes

find . -name '*.rtf' -type f -exec textutil -cat txt -stdout {} + | grep -i 'some text'

If you were ever to find yourself needing to use something in textutil's place that can only accept one input filename on the command line instead of an arbitrary number of them at the end, then instead of using find ... -exec ... {} + you'd use find ... -exec ... {} ';' (note that semicolons are special to every shell you're likely to use, so you will need to wrap that semicolon in quotes or prefix it with \). When the -exec option's arguments get terminated with ; instead of +, find will invoke the specified command repeatedly, once for each file it finds, instead of invoking it once* with all the found pathnames supplied on the command line as in the + case.

Finally, if you don't actually need the recursive subfolder search that find performs by default, or any find options cleverer than just expanding a glob, you can just let the shell itself expand your *.rtf glob instead:

textutil -cat txt -stdout *.rtf | grep -i 'some text'

This would make the shell invoke textutil with a whole pile of .rtf files at the end of its command line, in much the same way as find would invoke it given find ... -exec textutil ... {} +.

*technically the command that -exec ... + invokes can get called more than once, but this only happens if find needs to do that in order to stay inside whatever command line length limits the local system imposes.
posted by flabdablet at 3:33 AM on December 13, 2017 [1 favorite]


grep almost anywhere can search binary files, but .rtf isn't even binary and depending on what you search for you will get a bunch of nonsense.

Try and grep a .rtf for 'default' or 'lang' or 'char' or 'font'... you get a bunch of .rtf garbage. It might catch a longer unique phrase unless it was like "the thing" and then you'd have to be grepping for whatever italics look like in .rtf files. Or the thing you're searching for may be broken over two or more lines (but is one line in the .txt like version). Guess you really have to account for the thing you're searching for to be broken over lines anyways.....

Which is more reason why if I was serious, I'd use Perl to slurp in files and convert them from .rtf to text and maybe remove problematic characters like line breaks and craft a regex to find the thing I was looking for. It all depends on how complex this question actually is. :)
posted by zengargoyle at 3:34 AM on December 13, 2017


I'd use Perl to slurp in files and convert them from .rtf to text

Handling that conversion cleanly is exactly what textutil is for.
posted by flabdablet at 3:38 AM on December 13, 2017


I didn't say that I might not do something like
my $txt = `txtutil .... "$file"`;
to do the actual work, or a better "capture stdout/stderr of system/exec" sort of thing to avoid shell quoting hell. But the UnRTF module uses `unrtf` under the hood so it becomes like
use UnRTF; my $text = UnRTF->new(file=>$file)->convert(format=>'text');
The real PITA is figuring out if your search text might cross lines or might have a long hyphenated word even depending on which converter you use on the back-end.

Probably better to convert it into HTML, parse the DOM and search it like a web page.
posted by zengargoyle at 4:01 AM on December 13, 2017


This is wrong:
find . -name '*.rtf' -type f -exec sh -c "textutil -convert txt {} -stdout | grep -i 'some string' " \;


Might also help you to know why this is wrong, in case you find a genuine need to use a similar construction later on.

When a POSIX-compatible system invokes an executable, it provides its main entry point with two things: an integer argument count (conventionally named argc) and an array of pointers to null-terminated argument strings (conventionally named argv). This is as close as a POSIX executable ever gets to seeing the shell command line that invoked it.

The same kind of invocation interface is also provided for ANSI C executables on Windows, but it has to be faked up by the standard library. Windows, unlike POSIX systems, does not provide argc and argv natively; when a Windows executable is invoked, the system hands it the invoking command line as a single string, and it's up to the executable (or more usually, one of the libraries it's linked with) to parse that and break it down. Windows inherited this design from DOS, which inherited it from CP/M. It's a mess, and it's the reason why it's still possible to break innumerable things by putting an executable file on a Windows box at C:\PROGRAM.EXE.

But I digress.

When you type

find . -name '*.rtf' -type f -exec sh -c "textutil -convert txt {} -stdout | grep -i 'some string' " \;

into the shell (or the shell encounters it as a line of script), it will invoke the find executable and pass it the following things:

argc: 11
argv[0]: find
argv[1]: .
argv[2]: -name
argv[3]: *.rtf
argv[4]: -type
argv[5]: f
argv[6]: -exec
argv[7]: sh
argv[8]: -c
argv[9]: textutil -convert txt {} -stdout | grep -i 'some string'
argv[10]: ;

Note that as well as breaking the command line you typed into individual strings, the shell has stripped the quotes from any of those that its parsing rules required you to include in order to stop various amounts of processing inside them. See the "parameter expansion", "word splitting" and "quote removal" sections in the bash manual for the gory details.

Also note that the command line you double-quoted in order to turn it into a single argument for sh has indeed been handed to find as a single string (argv[9]).

You've used the per-file-invocation form of -exec here (the one terminated with ; rather than +) so as soon as find finds its first .rtf file (which I'll assume is named found.rtf) it will invoke the sh executable and pass it the following things:

argc: 3
argv[0]: sh
argv[1]: -c
argv[2]: textutil -convert txt ./found.rtf -stdout | grep -i 'some string'

There are some versions of find that would not substitute found.rtf for the embedded {} inside their argv[9], requiring instead that {} is an entire argument on its own in order to trigger substitution. Neither BSD nor GNU find has that restriction, though. Which is, as we are about to discover, a bug and not a feature.

Finally, sh sees the -c option passed to it as argv[1], and reacts by parsing its argv[2] as a shell command line. So it basically seems to do what's required.

Note in particular that find does not do any processing on its argv[9] beyond straight substitution of found.rtf for {}. In particular, it does not break argv[9] into separate strings the way a shell might do, and it knows nothing about the complicated rules that shells use for dealing with quotes. And this is where the alarm bells start ringing. If find were to find hell's bells.rtf instead of found.rtf, then sh would see

argc: 3
argv[0]: sh
argv[1]: -c
argv[2]: textutil -convert txt ./hell's bells.rtf -stdout | grep -i 'some string'

and its command line parsing rules would want to make it invoke textutil with

argc: 5
argv[0]: textutil
argv[1]: -convert
argv[2]: txt
argv[3]: ./hells bells.rtf -stdout | grep -i some
argv[4]: string

which textutil would clearly not react well to. Fortunately it never gets to react at all; having parsed 's bells.rtf -stdout | grep -i ' as a single-quoted string (that happened to have a bit more stuff glued on at the beginning and end), the shell sees the single quote at the end of string' as unmatched, and just complains about that instead of launching textutil.

Splicing the pathname straight in where {} was has, in effect, resulted in the same multiple-parsing-passes hell that Windows suffers on the regular but POSIX was designed to avoid. And no, just sticking a pair of single quotes around the {} embedded in the original string doesn't help; it just moves the hell around a bit.

What needs to happen is for hell's bells.rtf (or any other filename find happens to find) to be passed around always as its own dedicated argv[] string, never being embedded inside another one to be parsed out later. Because there, always, be dragons. And what that boils down to is that for the preservation of basic sanity, the original arguments to find need to have {} as a single stand-alone argument rather than embedded inside something else.

But this constraint results in its own problem: that argument really does need to be substituted into the pipeline that we're invoking sh to create.

Fortunately, Bourne shells are essentially sane and have facilities that deal with exactly this case. If you check the manual for the -c option on pretty much any Bourne-descended shell, you will see something like
-c

Read and execute commands from the first non-option argument command_string, then exit. If there are arguments after the command_string, the first argument is assigned to $0 and any remaining arguments are assigned to the positional parameters. The assignment to $0 sets the name of the shell, which is used in warning and error messages.
Which means that instead of invoking our pipeline as

sh -c "textutil -convert txt found.rtf -stdout | grep -i 'some string' "

we could use

sh -c "textutil -convert txt \"\$1\" -stdout | grep -i 'some string' " pipeline found.rtf

or more tidily

sh -c 'textutil -convert txt "$1" -stdout | grep -i "some string" ' pipeline found.rtf

Then sh would see

argc: 5
argv[0]: sh
argv[1]: -c
argv[2]: textutil -convert txt "$1" -stdout | grep -i "some string"
argv[3]: pipeline
argv[4]: found.rtf

and now things start working properly. The double-quoted parameter expansion "$1" can itself be cleanly parsed by the shell regardless of what horrors lie inside the parameter to be expanded. In particular, passing

argc: 5
argv[0]: sh
argv[1]: -c
argv[2]: textutil -convert txt "$1" -stdout | grep -i "some string"
argv[3]: pipeline
argv[4]: hell's bells.rtf

works without issues. And if anything does go wrong, you get error messages containing the string "pipeline" so you'll know who is complaining.

If you need to embed an arbitrary number of arguments inside a shell command line, rather than the single one we're using here, you can use the "$@" special expansion form instead of "$1". "$@" is equivalent to "$1" "$2" "$3" "$4" ... for as many positional parameters as exist.

Putting all that together, the final find command that works the way you were trying to do it initially would look like

find . -name '*.rtf' -type f -exec sh -c 'textutil -convert txt "$1" -stdout | grep -i "some string"' pipeline {} \;

or if you wanted to take advantage of the -exec ... + thing to avoid having sh fired up for every single file,

find . -name '*.rtf' -type f -exec sh -c 'textutil -cat txt "$@" -stdout | grep -i "some string"' pipeline {} +

As a general rule, if you find yourself wanting to wrap a command line passed to sh -c in double quotes rather than singles, that's a code smell for possible multiple-parsing bugs. The command line passed to sh -c should almost always be a simple fixed string that gets nothing clever done to its innards.

But as I hinted earlier, this kind of thing is best kept for scenarios where you actually need it. In almost all cases you can get away with having find invoke a single command and doing the rest of the pipeline outside it.
posted by flabdablet at 7:00 AM on December 13, 2017


The problem with using -cat is that the last line of each .rtf file will be run in with the first of the next, so if either matches grep you'll get them both.
posted by nicwolff at 7:16 PM on December 17, 2017


Well that sucks.

Does -convert also not force a trailing newline at the end of an input RTF file when converting it to plain text, or is this solely a -cat misfeature?

At this point I'm annoyed enough to force the poorly-designed textutil to bend to my will. Can't immediately think of a tidier way to do that than using xargs:
n=$(mktemp XXXXXXX.txt)
echo >$n
find . -name '*.rtf' -type f -printf "%p\0$n\0" | xargs -0 textutil -cat txt -stdout | grep -i 'some string'
rm $n

posted by flabdablet at 7:49 PM on December 17, 2017


Apropos of nothing:
Hey, all of you who put different semantics and options into GNU/BSD xargs: screw you.
Obligatory XKCD.
I'm convinced that `xargs` comes from a time of plain Bourne Shell and small computing where there was no other way and combining args to avoid multiple processes was a good idea. Nowadays machines are fast, diskspace it cheap and you're probably not doing big data and if you are, it's still the wrong thing and it's probably not portable.

Create a project directory, then a source directory, copy some files, create a destination directory, do the conversion, grep until you get it right. Use find to do the copying, redo, get report, mail it to whomever, do `history > history.txt`, call it a day. Edit `history.txt`, clean up your mistakes, delete your trash, tar it up and put it in a 'YYYY/MM/project.tar.bz2' and forget it. Done.
If you have to do it again, copy the old into a new project, edit the history and convert it to a shell script, press button and done. The third time, script it in Perl/Python/whatever, press button and done. The next time, your Procmail handler has already done the thing and sent you a success report and DONE.

This is a rant, `xargs` isn't worth it unless you're extremely constrained. Or if you have really big problems, a parallel that uses multiple cores and multiple machines is much more worth of stusy than getting `xargs` to do what you want.
posted by zengargoyle at 3:22 AM on December 18, 2017


it's still the wrong thing and it's probably not portable

xargs -0 with nothing clever done with replacement strings is robust, portable across GNU and BSD userlands, and frankly, screw Solaris.
posted by flabdablet at 7:37 AM on December 18, 2017 [1 favorite]


« Older Happy Adoption Day! Small ways to celebrate with...   |   Best websites for R Studio / statistics help? Newer »
This thread is closed to new comments.