working with tar archives...
August 13, 2008 10:10 PM Subscribe
How can get the sum of the files sizes of all php and css files within specific folders of a tar archive?
Ultimately, I want to create a graph showing how code grew over the duration of a project. To do that, I would like to build a script that sums the sizes specific files of specific types within specific directories of the archive file.
I have a folder containing a couple hundred backup files of a web development server. The backup files were archived using the "tar czf" command line.
In each archive file, I want to search within the following directories for php and css files:
var/www/dev/
var/www/dev/includes
var/www/dev/css
Subdirectories of these directories should not be searched.
Can you guys get me on my way?
So far I'm stuck trying to list files I need in the tar file. I have:
tar -ztvf archive_name.tgz var/www/dev/
which is listing all subdirectories and all file types.
The OS is Ubuntu 8.04 Server Edition
Thanks in advance!
Ultimately, I want to create a graph showing how code grew over the duration of a project. To do that, I would like to build a script that sums the sizes specific files of specific types within specific directories of the archive file.
I have a folder containing a couple hundred backup files of a web development server. The backup files were archived using the "tar czf" command line.
In each archive file, I want to search within the following directories for php and css files:
var/www/dev/
var/www/dev/includes
var/www/dev/css
Subdirectories of these directories should not be searched.
Can you guys get me on my way?
So far I'm stuck trying to list files I need in the tar file. I have:
tar -ztvf archive_name.tgz var/www/dev/
which is listing all subdirectories and all file types.
The OS is Ubuntu 8.04 Server Edition
Thanks in advance!
Someone will swoop in a few minutes and do this more elegantly, but this should do the trick:
echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/.+css/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:50 PM on August 13, 2008
echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/.+css/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:50 PM on August 13, 2008
Best answer: I was going to let you figure out the regex yourself, but what the hell. This should work:
echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:56 PM on August 13, 2008
echo "$(tar -tvf asdf.tar | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')0" | bc
posted by chrisamiller at 10:56 PM on August 13, 2008
The output from tar ztvf has six whitespace-delimited fields per line, of which the third is filesize and the sixth is pathname. You need to filter the tar listing by pathname, then extract and sum the filesize fields. This is exactly the kind of job that awk was made for.
should be pretty close. Post back if
posted by flabdablet at 11:09 PM on August 13, 2008
tar -ztvf archive_name.tgz |
awk '
$6 ~ /^var\/www\/dev\/[^/]*$/,
$6 ~ /^var\/www\/dev\/includes\/[^/]*$/,
$6 ~ /^var\/www\/dev\/css\/[^/]*$/ {
if ($6 ~ /\.php$/) totalphp += $3
if ($6 ~ /\.css$/) totalcss += $3
}
END {print totalphp, totalcss}
'
should be pretty close. Post back if
man awk
doesn't help you enough with the syntax.posted by flabdablet at 11:09 PM on August 13, 2008
Continuing hattifattener's idea, do something like:
tar -ztvf archive_name.tgz var/www/dev | awk '{sum += $3} END {print sum}'
...for each of your directories.
posted by jquinby at 11:09 PM on August 13, 2008
tar -ztvf archive_name.tgz var/www/dev | awk '{sum += $3} END {print sum}'
...for each of your directories.
posted by jquinby at 11:09 PM on August 13, 2008
Best answer: Bollocks. I really ought to test these things before posting them. Try
Just tested this (with names changed to suit a randomly selected .tar.gz of my own) on Ubuntu 7.10 and it works for me.
posted by flabdablet at 11:42 PM on August 13, 2008
tar -ztvf archive_name.tgz |
awk '
$6 ~ /^var\/www\/dev\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/includes\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/css\/[^\/]*$/ {
if ($6 ~ /\.php$/) totalphp += $3
if ($6 ~ /\.css$/) totalcss += $3
}
END {print totalphp, totalcss}
'
Just tested this (with names changed to suit a randomly selected .tar.gz of my own) on Ubuntu 7.10 and it works for me.
posted by flabdablet at 11:42 PM on August 13, 2008
Response by poster: I've been working with the second suggestion from chrisamiller, which seems to be doing exactly what I need! I'm trying to write a script that will print that total for each .tgz file in the current directory. It's not working yet, but here's what I have so far:
#!/bin/sh
FILES=`ls *.tgz`
for FILE in $FILES
do
echo "$(tar -tvf $FILE | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')$
done
I'm getting errors like this when running it:
rsh: capstone04-2008-05-12-Monday-23: Name or service not known
tar: capstone04-2008-05-12-Monday-23\:52\:24.tgz: Cannot open: Input/output error
tar: Error is not recoverable: exiting now
Is it because of the escape characters? I feel like I'm soooo close to having this! Thanks guys.
posted by timebomb at 11:46 PM on August 13, 2008
#!/bin/sh
FILES=`ls *.tgz`
for FILE in $FILES
do
echo "$(tar -tvf $FILE | awk '/var\/www\/dev\/(includes\/|css\/)?([^\/])+(\.css|\.php)/ {print $3}' | perl -pe 's/\n/+/g')$
done
I'm getting errors like this when running it:
rsh: capstone04-2008-05-12-Monday-23: Name or service not known
tar: capstone04-2008-05-12-Monday-23\:52\:24.tgz: Cannot open: Input/output error
tar: Error is not recoverable: exiting now
Is it because of the escape characters? I feel like I'm soooo close to having this! Thanks guys.
posted by timebomb at 11:46 PM on August 13, 2008
well, to begin with , you don't need the 'ls' command. It will simplify to:
#!/bin/sh
for file in *.tgz
do
. . .
You might also try quoting your filename:
echo '$(tar -tvf "$FILE" | awk . . . )'
posted by chrisamiller at 11:55 PM on August 13, 2008
#!/bin/sh
for file in *.tgz
do
. . .
You might also try quoting your filename:
echo '$(tar -tvf "$FILE" | awk . . . )'
posted by chrisamiller at 11:55 PM on August 13, 2008
oops - did that backwards:
You might also try quoting your filename:
echo "$(tar -tvf '$FILE' | awk . . . )"
posted by chrisamiller at 11:57 PM on August 13, 2008
You might also try quoting your filename:
echo "$(tar -tvf '$FILE' | awk . . . )"
posted by chrisamiller at 11:57 PM on August 13, 2008
you could try this:
put this in sum.sh:
then run:
find . -maxdepth 1 -type f -name \*.tgz -print0 | xargs --null -l ./sum.sh
posted by ffej at 11:58 PM on August 13, 2008
put this in sum.sh:
echo -n "$@: "chmod 700 sum.sh
tar tzvf $@ | awk '
$6 ~ /^var\/www\/dev\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/includes\/[^\/]*$/ ||
$6 ~ /^var\/www\/dev\/css\/[^\/]*$/ {
if ($6 ~ /\.php$/) totalphp += $3
if ($6 ~ /\.css$/) totalcss += $3
}
END {print totalphp, totalcss}'
then run:
find . -maxdepth 1 -type f -name \*.tgz -print0 | xargs --null -l ./sum.sh
posted by ffej at 11:58 PM on August 13, 2008
Response by poster: Thanks for the suggestion, I tried the double quotes but am getting the same thing. ..
posted by timebomb at 12:01 AM on August 14, 2008
posted by timebomb at 12:01 AM on August 14, 2008
timebomb, I think the reason is that some of your filenames have characters in them that tar is interpreting as meaning to retrieve a file from a remote host (hence the messages about rsh).
I'd just do it this way:
for file in *.tgz
do
echo "Summing $file ..."
cat "$file" | tar tzvf - | awk ' /crazyregexp/ { SUM += $3 }; END { print "Total size is is ",SUM }'
done
where crazyregexp is the regexp from my, chrisamiler's, or your post.
posted by hattifattener at 12:11 AM on August 14, 2008
I'd just do it this way:
for file in *.tgz
do
echo "Summing $file ..."
cat "$file" | tar tzvf - | awk ' /crazyregexp/ { SUM += $3 }; END { print "Total size is is ",SUM }'
done
where crazyregexp is the regexp from my, chrisamiler's, or your post.
posted by hattifattener at 12:11 AM on August 14, 2008
Response by poster: I tried out the "find . -maxdepth 1 -type f -name \*.tgz" command to see what the difference would be. It just prepended ./ to the file name, so for now I just did that manually in the script and it is working!
Thanks guys!
posted by timebomb at 12:12 AM on August 14, 2008
Thanks guys!
posted by timebomb at 12:12 AM on August 14, 2008
(from the gnu tar man page:
posted by hattifattener at 12:16 AM on August 14, 2008
-f [hostname:]file)
Read or write the specified file […] If a hostname is specified, tar will use rmt(8) to read or write the specified file on a remote machine. “-” may be used as a filename, for reading or writing to/from stdin/stdout.
posted by hattifattener at 12:16 AM on August 14, 2008
Response by poster: ahh, so it was interpreting the filename as a hostname. makes sense
posted by timebomb at 12:28 AM on August 14, 2008
posted by timebomb at 12:28 AM on August 14, 2008
cat "$file" | tar -tzvf - | ...
is the canonical useless use of cat. In general, instead of catting something down a pipe you should just use input redirection:
tar -tzvf - <"$file" | ...
though in this specific instance, the Right Thing is probably just to turn off tar's support for rsh:
tar -tzvf "$file" --force-local | ...
posted by flabdablet at 3:46 AM on August 14, 2008
A more robust solution that is not vulnerable to weird filenames might be Perl's Archive::tar::file.
posted by benzenedream at 4:06 AM on August 14, 2008
use Archive::Tar; foreach my $filename (@ARGV) { my $tar = Archive::Tar->new; $tar->read($filename,1); my @items = $tar->get_files; foreach my $item (@items) { next unless $item->name =~ /\.css$|\.php$/; print join "\t", ($filename, $item->name, $item->size, "\n"); } }
posted by benzenedream at 4:06 AM on August 14, 2008
Huh. Another demonstration that there's always more than one way to do it:
tar xzfO archive.tar.gz '*.css' '*.php' | wc -c
posted by sfenders at 6:24 AM on August 14, 2008 [1 favorite]
tar xzfO archive.tar.gz '*.css' '*.php' | wc -c
posted by sfenders at 6:24 AM on August 14, 2008 [1 favorite]
This thread is closed to new comments.
posted by hattifattener at 10:33 PM on August 13, 2008