I need a script to compare files in a directory to files referenced in web pages.
September 13, 2004 3:37 PM   Subscribe

Can anyone recommend a program (in say perl or java or something) that will execute on my linux web server and a) slurp the file references in the pages (html, php) and b) compare them to the actual files in my web root tree giving me a list of all of the unreferenced files (not referenced on public web pages)? I want to clean up this junky file system, but I don't want to break any links and I inherited the mess. Thanks.
posted by pissfactory to Computers & Internet (5 answers total)
 
for i in `find files` ; do if ( ! grep -q $i *.html ) then echo $i ; fi ; done

...or you could use wget, then find and diff, which would be a little more reliable.
posted by sfenders at 3:59 PM on September 13, 2004 [1 favorite]


pissfactory: I wrote just such a thing years ago. Code here, and brief (almost non) manual here. No guarantees that it runs, works, does anything productive, or doesn't attract space aliens to your house.
posted by weston at 4:50 PM on September 13, 2004


And is sfenders some kind of shell ninja or what?
posted by weston at 5:29 PM on September 13, 2004


linklint -orphan
posted by nicwolff at 6:17 PM on September 13, 2004


You could also just wget -r -l 0 your website. After that, all files with an access time of before you started wgetting are orphaned and can be moved or deleted. (Use find -amin -10, or ls **/*(.am-10) in the ever wonderful zsh)

sfenders' snippet will break should you have filenames with spaces in them. You can reasonably easily convert this to something* that only breaks with filenames containing newlines, but even that can happen. All in all the only safe file-name separator is a \0, which are harder to work with in shells, sadly.

find . | while read -r i; do if ( ! grep -r $i * ) then echo $i ; fi ; done (this also works if your html files aren't all in the current directory)
posted by fvw at 8:34 AM on September 14, 2004


« Older Does anyone have any experience with The Well?   |   Security fits Newer »
This thread is closed to new comments.