Using grep to extract x characters of text after a predictable pattern
March 15, 2013 3:39 AM   Subscribe

I'm trying to work out how to use something like grep or sed or awk (or maybe even some Perl) to extract a string of characters which appears in a predictable place in a series of text files. Help/advice/tutorials very welcome!

I have a number of text files in a Google Drive folder, which gets replicated on my Mac, and a new one gets added every day. The files are plain text and they aren't structured formally (as in, they're not XML or anything) but they each contain a certain string ("Total Portfolio Value") which is always followed by a space, followed by "$xx,xxx.xx", followed by five more spaces. The value of $xx,xxx.xx changes every day, and that's what I'm trying to extract, to put into a separate file.

I can use Automator to check whenever a new file appears and run a shell script on the file, so I'm trying to work out what goes in the shell script.

As much as anything else I'm using this as a practical exercise to teach myself a bit about text processing using grep/sed/awk, Perl and regular expressions (any/all of the above!) so just a few pointers about the best approach the contents of the shell script would be great!
posted by infinitejones to Computers & Internet (8 answers total) 5 users marked this as a favorite
grep -o '\$[0-9]*,[0-9]*\.[0-9]*'
posted by empath at 3:51 AM on March 15, 2013

that's kind of a quick and dirty way of doing it, it'll basically match $*,*.* where * is any number of digits.
posted by empath at 3:52 AM on March 15, 2013

Or similarly egrep -o 'Total Portfolio Value \$[^ ]*' /your/file, which will match "Total Portfolio Value " and then everything up to the next space.

(Because [^ ] means "anything that isn't a space", so [^ ]* means "as many characters as possible as long as they're not a space", and egrep -o means "search for the following and only return the result of the match".)
posted by katrielalex at 3:58 AM on March 15, 2013

Here's a solution with egrep and sed that searches for 'Total Portfolio Value', works if there are other numbers in the file (empath's solution returns all the $x,x.x in a file, which may not be what you want), and throws away the commas and the dollar sign.

$ cat test.txt
Total Portfolio Value $11,1234.56 more text
Extra line
$ cat test.txt | egrep -o 'Total Portfolio Value \$[^ ]*' | sed -e 's/^Total\ Portfolio\ Value\ \$//' | sed -e 's/,//'
posted by caek at 4:00 AM on March 15, 2013 [1 favorite]

And perhaps you then want to extract just the price, which you can do as egrep -o 'Total Portfolio Value \$[^ ]*' /tmp/foo | cut -d$ -f2 | tr -d ,.

(Because | means "send the output of the previous command to be the input of the subsequent one", and cut -d$ -f2 means "split the input at every $, and extract the second field", and tr -d , means "delete all the commas".)
posted by katrielalex at 4:01 AM on March 15, 2013 [2 favorites]

In some of these examples, the [^ ]* should really be [^ ]+, i.e. require there to be at least one non-space character, * matches any number including 0, so some the examples above would match a line ending in "Total Portfolio Value $".

Changing it to [^[:blank:]]+ would be more robust (catch tabs too) but wouldn't guarantee that it was a number which followed the '$' sign. [[:digit:],]+ is better still but would match an arbitrary number of commas with no digits.
posted by epo at 5:52 AM on March 15, 2013

Response by poster: These all look brilliant. Lots of detail for me to get my head around, which is what I was hoping for! Thanks very much everyone.
posted by infinitejones at 6:20 AM on March 15, 2013

None of the answers offered actually match your specification. This will print the number found after "Total Portfolio Value" and a space and a dollar sign, with five spaces after it:
perl -lne 'print $1 if /Total Portfolio Value \$([\d.,]+ {5})/' input_file.txt

posted by nicwolff at 7:33 AM on March 15, 2013

« Older What ancient Anatolian alphabet is this?   |   Sydney reading coffee venue Newer »
This thread is closed to new comments.