Need a shell script to filter out directories
August 10, 2011 7:52 AM   Subscribe

Shell gurus needed! I need a line/script that will create a list of all sub-folders within a folder that do not contain files with a certain filename pattern. There is an additional requirement...(more inside)

Given a folder /somefolder , I need to be able to create a list of all subfolders within /somefolder that:

1) have files of the pattern *.tif
AND
2) do not have files of the pattern *_Aug11.pdf

In other words a folder containing 8107.tif AND 8107_Aug11.pdf would not match the search expression.

However, a folder containing 8107.tif AND 8107.pdf would match, as would a folder containing just 8107.tif alone, or 8107.tif, 8108.tif, etc. provided there are NO associated _Aug11.pdf for each of those.

The script can be a Windows powershell or unix shell script...
posted by dukes909 to Computers & Internet (12 answers total) 2 users marked this as a favorite
 
Here's a Bash script that will do this:
#!/usr/bin/env bash

DIR="${1:-.}"

(for TIFF in $(find "$DIR" -name '*.tif' -print); do
    if [ ! -e "${TIFF%.tif}_Aug11.pdf" ]; then
        echo "$(dirname "$TIFF")"
    fi
done) | uniq
You could make this a one-liner, but ugh. It's also a bit inefficient, because it will keep processing the same directory over and over again even after it finds one example. There are ways to do keep it from doing that but at the expense of making the script a bit more complex.
posted by grouse at 8:12 AM on August 10, 2011


What about a directory containing 8107.tif and 666_Aug11.pdf? Items (1) and (2) seem to say that such a directory is not wanted, but the mention of "associated files" later on suggests you might have something else in mind.

If I read grouse's script right, it prints all directories which contain *.tif files with no associated *_Aug11.pdf files, even if other tifs in the same directory do have associated files, and even if there are unaffiliated *_Aug11.pdf files in the same directory. I don't think that's what you asked for, but maybe it is what you wanted.

Here's my (untested) take on a literal reading of (1) and (2):
    find somefolder -type d |(while read d; do
      if ls "$d"/*.tif >/dev/null 2>&1 &&
          ! ls "$d"/*_Aug11.pdf >/dev/null 2>&1; then
        echo "$d"
      fi
    done)

posted by stebulus at 8:43 AM on August 10, 2011


Best answer: It will also fail for directories with spaces in them. Here's a better version that will work in that case:
#!/usr/bin/env bash

DIR="${1:-.}"

(find "$DIR" -name '*.tif' -exec bash -c \
    'FILENAME="{}"

     if [ ! -e "${FILENAME%.tif}_Aug11.pdf" ]; then
         echo "$(dirname "$FILENAME")"
     fi' ';') | uniq

posted by grouse at 8:46 AM on August 10, 2011


stebulus, the key is in the example case:

However, a folder containing 8107.tif AND 8107.pdf would match, as would a folder containing just 8107.tif alone, or 8107.tif, 8108.tif, etc. provided there are NO associated _Aug11.pdf for each of those.
posted by grouse at 8:48 AM on August 10, 2011


Response by poster: Yes, I want the folders that contain .tif files but do not have an associated _Aug11.pdf . So, even if there is a folder /somefolder/farm containing:

piggies.tif
8017.tif
8017_Aug11.pdf

then I want that folder "farm" to be listed.

(What I'm trying to do is find all of the .tif files that were not converted to a PDF because of an error within the .tif file).
posted by dukes909 at 8:50 AM on August 10, 2011


One liner, and messy

find . -name "*tif" | while read tif; do pdf=`echo $tif | sed -e 's/.tif$/_Aug11.pdf/'`; if [ ! -f $pdf ]; then echo File $tif does not have a matching pdf; fi; done
posted by devbrain at 9:14 AM on August 10, 2011


Response by poster: grouse' script worked great, although it printed them all on one line. That's ok, I redirected it to a file and edited the list. Thank you!
posted by dukes909 at 9:14 AM on August 10, 2011


Still messy, updated to print just directory names, not the individual files.

find . -name "*tif" | while read tif; do pdf=`echo $tif | sed -e 's/.tif$/_Aug11.pdf/'`; if [ ! -f $pdf ]; then echo Directory `dirname $tif` is missing pdfs; fi; done | sort | uniq
posted by devbrain at 9:15 AM on August 10, 2011


Response by poster: devbrain - I get a "bash: too many arguments" when I try yours.
posted by dukes909 at 9:19 AM on August 10, 2011


grouse' script worked great, although it printed them all on one line.

How odd. I even tested it (on Cygwin) and it worked fine for me. Glad it worked for you in some sort, though.
posted by grouse at 9:19 AM on August 10, 2011


What I'm trying to do is find all of the .tif files that were not converted to a PDF because of an error within the .tif file

Ok. Then disregard my script; grouse's will serve you nicely.
posted by stebulus at 9:20 AM on August 10, 2011


That's odd -- I cut/pasted it back out of the and confirmed nothing got broken by markup/reformatting.

If you can bust it apart to different lines it'll narrow down where the problem is. That said, if you've already got a solution that works, there's no need to diagnose this alternative. (This script also won't work with spaces in the filename, but should generate a different error in that instance)


#!/bin/bash

find . -name "*tif" | \
while read tif; do
pdf=`echo $tif | sed -e 's/.tif$/_Aug11.pdf/'`
if [ ! -f $pdf ]; then
echo Directory `dirname $tif` is missing pdfs
fi
done | sort | uniq


posted by devbrain at 10:24 AM on August 10, 2011


« Older Sync my life!   |   Can I hire someone to do my genealogy research? Newer »
This thread is closed to new comments.