Using grep to extract x characters of text after a predictable pattern
March 15, 2013 3:39 AM
I'm trying to work out how to use something like grep or sed or awk (or maybe even some Perl) to extract a string of characters which appears in a predictable place in a series of text files. Help/advice/tutorials very welcome!
I have a number of text files in a Google Drive folder, which gets replicated on my Mac, and a new one gets added every day. The files are plain text and they aren't structured formally (as in, they're not XML or anything) but they each contain a certain string ("Total Portfolio Value") which is always followed by a space, followed by "$xx,xxx.xx", followed by five more spaces. The value of $xx,xxx.xx changes every day, and that's what I'm trying to extract, to put into a separate file.
I can use Automator to check whenever a new file appears and run a shell script on the file, so I'm trying to work out what goes in the shell script.
As much as anything else I'm using this as a practical exercise to teach myself a bit about text processing using grep/sed/awk, Perl and regular expressions (any/all of the above!) so just a few pointers about the best approach the contents of the shell script would be great!
I have a number of text files in a Google Drive folder, which gets replicated on my Mac, and a new one gets added every day. The files are plain text and they aren't structured formally (as in, they're not XML or anything) but they each contain a certain string ("Total Portfolio Value") which is always followed by a space, followed by "$xx,xxx.xx", followed by five more spaces. The value of $xx,xxx.xx changes every day, and that's what I'm trying to extract, to put into a separate file.
I can use Automator to check whenever a new file appears and run a shell script on the file, so I'm trying to work out what goes in the shell script.
As much as anything else I'm using this as a practical exercise to teach myself a bit about text processing using grep/sed/awk, Perl and regular expressions (any/all of the above!) so just a few pointers about the best approach the contents of the shell script would be great!
that's kind of a quick and dirty way of doing it, it'll basically match $*,*.* where * is any number of digits.
posted by empath at 3:52 AM on March 15, 2013
posted by empath at 3:52 AM on March 15, 2013
Or similarly
(Because
posted by katrielalex at 3:58 AM on March 15, 2013
egrep -o 'Total Portfolio Value \$[^ ]*' /your/file
, which will match "Total Portfolio Value " and then everything up to the next space.(Because
[^ ]
means "anything that isn't a space", so [^ ]*
means "as many characters as possible as long as they're not a space", and egrep -o
means "search for the following and only return the result of the match".)posted by katrielalex at 3:58 AM on March 15, 2013
Here's a solution with egrep and sed that searches for 'Total Portfolio Value', works if there are other numbers in the file (empath's solution returns all the $x,x.x in a file, which may not be what you want), and throws away the commas and the dollar sign.
$ cat test.txt
Total Portfolio Value $11,1234.56 more text
$99,999.99
Extra line
$ cat test.txt | egrep -o 'Total Portfolio Value \$[^ ]*' | sed -e 's/^Total\ Portfolio\ Value\ \$//' | sed -e 's/,//'
111234.56
posted by caek at 4:00 AM on March 15, 2013
$ cat test.txt
Total Portfolio Value $11,1234.56 more text
$99,999.99
Extra line
$ cat test.txt | egrep -o 'Total Portfolio Value \$[^ ]*' | sed -e 's/^Total\ Portfolio\ Value\ \$//' | sed -e 's/,//'
111234.56
posted by caek at 4:00 AM on March 15, 2013
And perhaps you then want to extract just the price, which you can do as
(Because | means "send the output of the previous command to be the input of the subsequent one", and
posted by katrielalex at 4:01 AM on March 15, 2013
egrep -o 'Total Portfolio Value \$[^ ]*' /tmp/foo | cut -d$ -f2 | tr -d ,
.(Because | means "send the output of the previous command to be the input of the subsequent one", and
cut -d$ -f2
means "split the input at every $, and extract the second field", and tr -d ,
means "delete all the commas".)posted by katrielalex at 4:01 AM on March 15, 2013
In some of these examples, the [^ ]* should really be [^ ]+, i.e. require there to be at least one non-space character, * matches any number including 0, so some the examples above would match a line ending in "Total Portfolio Value $".
Changing it to [^[:blank:]]+ would be more robust (catch tabs too) but wouldn't guarantee that it was a number which followed the '$' sign. [[:digit:],]+ is better still but would match an arbitrary number of commas with no digits.
posted by epo at 5:52 AM on March 15, 2013
Changing it to [^[:blank:]]+ would be more robust (catch tabs too) but wouldn't guarantee that it was a number which followed the '$' sign. [[:digit:],]+ is better still but would match an arbitrary number of commas with no digits.
posted by epo at 5:52 AM on March 15, 2013
These all look brilliant. Lots of detail for me to get my head around, which is what I was hoping for! Thanks very much everyone.
posted by infinitejones at 6:20 AM on March 15, 2013
posted by infinitejones at 6:20 AM on March 15, 2013
None of the answers offered actually match your specification. This will print the number found after "Total Portfolio Value" and a space and a dollar sign, with five spaces after it:
posted by nicwolff at 7:33 AM on March 15, 2013
perl -lne 'print $1 if /Total Portfolio Value \$([\d.,]+ {5})/' input_file.txt
posted by nicwolff at 7:33 AM on March 15, 2013
This thread is closed to new comments.
posted by empath at 3:51 AM on March 15, 2013