Linux OCR
August 15, 2010 3:16 PM   Subscribe

Linux script to parse all files in file tree and submit to a program?

I use a file index program (xfriend) under linux and would like it to index my jpeg and other files.

How would a script (preferably python), look that scans all my files and, depending on the file type, submits the file to linux OCR cuneiform and spits out a *.txt file for each relevant file?
posted by yoyo_nyc to Computers & Internet (12 answers total) 3 users marked this as a favorite
 
Use an agile combination of find, file and sh's "case" statement.
posted by knz at 3:20 PM on August 15, 2010


Response by poster: As you might have already guessed, I am not a programmer. But my limited understanding tells me that this should not be more than 10 lines of python code.
posted by yoyo_nyc at 3:25 PM on August 15, 2010


Best answer: Something along the lines of:

find /top/level/directory -name "*jpg" -exec cuneiform {} \;

The above finds all jpeg files in /top/level/directory and its subdirectories and executes, for each file foo, the command cuneiform foo.
posted by axiom at 3:30 PM on August 15, 2010 [1 favorite]


30 years ago when I was a top-notch Bourne shell user, I could have done that for you in a single command line. Here's how I think it would go, but I'm sure I'll make mistakes because my memory is growing dim:

OCRprog `find . -name *.jpg -print` OCRprog-command-flags

The `` makes the delimited command run and takes all of the stdout created by it and substitutes it in that place in the outer shell script. So what this should do is find all the JPG files in the current directory and everything beneath it and pass full filepaths for all of them on the command line to OCRprog.
posted by Chocolate Pickle at 3:33 PM on August 15, 2010


Ah. Axiom is correct; the parameter to -name has to be enclosed in quotes.
posted by Chocolate Pickle at 3:34 PM on August 15, 2010


Best answer: command line:
find /path/to/files -print0 | xargs -0 doindex.sh
in file doindex.sh:
#! /bin/bash
for f in "$@"; do
    case $(file "$f") in
         *image*)
            cuneiform ... 
            ;;
         ....)
     esac
done
Replace pattern *image* by JPEG* if Cuneiform is not flexible on its input format.
posted by knz at 3:35 PM on August 15, 2010


Best answer: I suspect Chocolate Pickle's script will be broken by filenames with spaces in them.

I'd use something like:


#! /bin/bash
find -iname "*jpg" | while read i
do somecommand "$i"
done


(nb untested)
posted by pompomtom at 5:29 PM on August 15, 2010


Axion's command is the one that I use although you may need to throw a -print at the end to get the program's output.
find /top/level/directory -name "*.txt" -exec grep 'secretMessage' {} \; -print
posted by xorry at 6:39 PM on August 15, 2010


Best answer: I also usually use Axion's method but add a "-type f" to make sure not to match any directories (you might be surprised at how many directories are named manual.html or such), and "-iname" for case-insensitive matching. If you want grep to print filenames with the matched lines, you can give it two files, "-exec grep 'foo' {} /dev/null \;"

If cuneiform has options for specifying the output file you can really do neat stuff.
find . -type f -iname \*.jpg | while read i; do 
b="${i%.*}"     # path/file minus .ext
d="${i%/*}"     # path
mkdir -p "text/$d"    # make same path under ./text
cuneiform -i "$i" -o "text/${b}.txt"  # convert and place output under ./text
done
This would convert ./foo/bar/bat.jpg to ./text/foo/bar/bat.txt which would be neat.
posted by zengargoyle at 7:31 PM on August 15, 2010


Best answer: Something like this, in python (at least in 2.6 and 2.7, probably other versions too):

from os.path import walk, join
from os import system

for (dirpath, dirnames, filenames) in walk('mypath'):
for fn in filenames:
if fn.endswith(".jpeg"):
system("mycmd " + join(dirpath, fn))
posted by rainy at 7:33 PM on August 15, 2010


I guess I should add that if you want to use current dir instead of specifying path, use os.getcwd() instead of 'mypath' and I hope you can change "mycmd" to proper command for cuneiform.
posted by rainy at 7:37 PM on August 15, 2010


Note that filetype recognition based on extension is a DOSism, and would break on Unix in a variety of circumstances:

- when the extension is "JPEG" instead of "jpeg", "jpg" instead of "jpeg", or whatever;

- when working on temporary files which do not have extensions (e.g. browser downloads).

Hence the preferred use of file to determine the type based on content.
posted by knz at 7:49 AM on August 16, 2010


« Older Please recommend a reliable midsize motorcycle.   |   An ass recommends donkey-voting Newer »
This thread is closed to new comments.