working with tar archives...
August 13, 2008 10:10 PM   Subscribe

How can get the sum of the files sizes of all php and css files within specific folders of a tar archive?

Ultimately, I want to create a graph showing how code grew over the duration of a project. To do that, I would like to build a script that sums the sizes specific files of specific types within specific directories of the archive file.

I have a folder containing a couple hundred backup files of a web development server. The backup files were archived using the "tar czf" command line.

In each archive file, I want to search within the following directories for php and css files:
Subdirectories of these directories should not be searched.

Can you guys get me on my way?

So far I'm stuck trying to list files I need in the tar file. I have:
tar -ztvf archive_name.tgz var/www/dev/
which is listing all subdirectories and all file types.

The OS is Ubuntu 8.04 Server Edition

Thanks in advance!
posted by timebomb to Computers & Internet (18 answers total) 2 users marked this as a favorite
You could write a regexp that selects only the files of interest — something like var/www/dev(|/includes|/css)/[^/]+$ — and then use that in an awk or perl script to select the lines, sum the file sizes into a variable, and print out the value of that variable at the end (e.g. in an END block). Pipe the output of tar tzvf into said script, and you have an uncle Robert.
posted by hattifattener at 10:33 PM on August 13, 2008

Someone will swoop in a few minutes and do this more elegantly, but this should do the trick:

echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/.+css/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:50 PM on August 13, 2008

I was going to let you figure out the regex yourself, but what the hell. This should work:

echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:56 PM on August 13, 2008

The output from tar ztvf has six whitespace-delimited fields per line, of which the third is filesize and the sixth is pathname. You need to filter the tar listing by pathname, then extract and sum the filesize fields. This is exactly the kind of job that awk was made for.

tar -ztvf archive_name.tgz |
awk '
$6 ~ /^var\/www\/dev\/[^/]*$/,
$6 ~ /^var\/www\/dev\/includes\/[^/]*$/,
$6 ~ /^var\/www\/dev\/css\/[^/]*$/ {
    if ($6 ~ /\.php$/) totalphp += $3
    if ($6 ~ /\.css$/) totalcss += $3
END {print totalphp, totalcss}

should be pretty close. Post back if man awk doesn't help you enough with the syntax.
posted by flabdablet at 11:09 PM on August 13, 2008

Continuing hattifattener's idea, do something like:

tar -ztvf archive_name.tgz var/www/dev | awk '{sum += $3} END {print sum}'

...for each of your directories.
posted by jquinby at 11:09 PM on August 13, 2008

Bollocks. I really ought to test these things before posting them. Try

tar -ztvf archive_name.tgz |
awk '
$6 ~ /^var\/www\/dev\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/includes\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/css\/[^\/]*$/ {
    if ($6 ~ /\.php$/) totalphp += $3
    if ($6 ~ /\.css$/) totalcss += $3
END {print totalphp, totalcss}

Just tested this (with names changed to suit a randomly selected .tar.gz of my own) on Ubuntu 7.10 and it works for me.
posted by flabdablet at 11:42 PM on August 13, 2008

I've been working with the second suggestion from chrisamiller, which seems to be doing exactly what I need! I'm trying to write a script that will print that total for each .tgz file in the current directory. It's not working yet, but here's what I have so far:


FILES=`ls *.tgz`

for FILE in $FILES
echo "$(tar -tvf $FILE | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')$

I'm getting errors like this when running it:
rsh: capstone04-2008-05-12-Monday-23: Name or service not known
tar: capstone04-2008-05-12-Monday-23\:52\:24.tgz: Cannot open: Input/output error
tar: Error is not recoverable: exiting now

Is it because of the escape characters? I feel like I'm soooo close to having this! Thanks guys.
posted by timebomb at 11:46 PM on August 13, 2008

well, to begin with , you don't need the 'ls' command. It will simplify to:

for file in *.tgz
. . .

You might also try quoting your filename:
echo '$(tar -tvf "$FILE" | awk . . . )'
posted by chrisamiller at 11:55 PM on August 13, 2008

oops - did that backwards:

You might also try quoting your filename:
echo "$(tar -tvf '$FILE' | awk . . . )"
posted by chrisamiller at 11:57 PM on August 13, 2008

you could try this:

put this in
echo -n "$@: "
tar tzvf $@ | awk '
$6 ~ /^var\/www\/dev\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/includes\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/css\/[^\/]*$/ {
if ($6 ~ /\.php$/) totalphp += $3
if ($6 ~ /\.css$/) totalcss += $3
END {print totalphp, totalcss}'
chmod 700

then run:

find . -maxdepth 1 -type f -name \*.tgz -print0 | xargs --null -l ./
posted by ffej at 11:58 PM on August 13, 2008

Thanks for the suggestion, I tried the double quotes but am getting the same thing. ..
posted by timebomb at 12:01 AM on August 14, 2008

timebomb, I think the reason is that some of your filenames have characters in them that tar is interpreting as meaning to retrieve a file from a remote host (hence the messages about rsh).

I'd just do it this way:

for file in *.tgz
echo "Summing $file ..."
cat "$file" | tar tzvf - | awk ' /crazyregexp/ { SUM += $3 }; END { print "Total size is is ",SUM }'

where crazyregexp is the regexp from my, chrisamiler's, or your post.
posted by hattifattener at 12:11 AM on August 14, 2008

I tried out the "find . -maxdepth 1 -type f -name \*.tgz" command to see what the difference would be. It just prepended ./ to the file name, so for now I just did that manually in the script and it is working!

Thanks guys!
posted by timebomb at 12:12 AM on August 14, 2008

(from the gnu tar man page:
-f [hostname:]file
Read or write the specified file […] If a hostname is specified, tar will use rmt(8) to read or write the specified file on a remote machine. “-” may be used as a filename, for reading or writing to/from stdin/stdout.
posted by hattifattener at 12:16 AM on August 14, 2008

ahh, so it was interpreting the filename as a hostname. makes sense
posted by timebomb at 12:28 AM on August 14, 2008

cat "$file" | tar -tzvf - | ...

is the canonical useless use of cat. In general, instead of catting something down a pipe you should just use input redirection:

tar -tzvf - <"$file" | ...

though in this specific instance, the Right Thing is probably just to turn off tar's support for rsh:

tar -tzvf "$file" --force-local | ...
posted by flabdablet at 3:46 AM on August 14, 2008

A more robust solution that is not vulnerable to weird filenames might be Perl's Archive::tar::file.

use Archive::Tar;
foreach my $filename (@ARGV)
    my $tar = Archive::Tar->new;
    my @items = $tar->get_files;
   foreach my $item (@items)
       next unless $item->name =~ /\.css$|\.php$/;
       print join "\t", ($filename, $item->name, $item->size, "\n");

posted by benzenedream at 4:06 AM on August 14, 2008

Huh. Another demonstration that there's always more than one way to do it:

tar xzfO archive.tar.gz '*.css' '*.php' | wc -c
posted by sfenders at 6:24 AM on August 14, 2008 [1 favorite]

« Older Probability question (in need of code)   |   Cuban Cuisine/dinner party/roasted pig Newer »
This thread is closed to new comments.