Get rid of old files on legacy website
May 5, 2017 3:04 AM   Subscribe

Is there a tool that can tell me which website files have not been accessed over a given period of time?

I'm working on a large legacy website. Rebuilding is not an option but I'd like to at least upgrade the scripts (they rely on register globals and magic quotes). But there are thousands of scripts and I suspect a large portion are not in use, I don't want to waste time upgrading stuff that isn't in use and I'd like to get rid of them to clean it up.

Is there a way to create a list of all files accessed on a website (including php includes being included) over a specific period of time? (or even better a tool that will tell me which files are definitely not in use)
posted by missmagenta to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
Yes.

The web server maintains a log of every request. You need the server access logs for the period you are interested in, and someone who can process the logs to tell you which resources have been requested. This is very, very easy to do.

If your website is internet-facing, you'll probably find that every resource has been requested. This us where your friendly web geek can help categorize the requests. On rarely-used resources, most accesses will be from crawlers.

Working with your geek you can identify the resources (web pages, scripts, databases, whatever) can be removed with little or no impact on actual users.

I* have no idea how you identify php includes without actually parsing the php and building a dependency tree, but your friendly web geek will be able to help. There will be tools.

* - I have been paid actual money to deliver code in over 20 languages. I would never, ever code in PHP. It is nasty.
posted by Combat Wombat at 4:19 AM on May 5, 2017


Best answer: If the webserver is Linux and you have command prompt access you can use the find command with the -atime option to find files that haven't been accessed in days.

(Disclaimer: I'm on a mobile device and can't check if this is 100 percent right)

For example, if you're in the directory you want to check type, 'find . -atime +30 -print' should list all files that have not been accessed in 30 or more days.

You can pipe the find output through xargs to remove those files too.

posted by Gev at 4:30 AM on May 5, 2017 [4 favorites]


+1 for Gev's access time idea.

Web server logs can tell you which static resources (e.g. images) have been accessed, but they won't help with PHP includes.
posted by matthewr at 4:54 AM on May 5, 2017


Response by poster: Perfect!
posted by missmagenta at 5:16 AM on May 5, 2017


If you use Gev's method, make sure the webroot filesystem is not mounted with "noatime" option
(just running mount should tell you the options used).
posted by dvr at 6:51 AM on May 5, 2017 [1 favorite]


I'd also shove any deleted files off to the side.

I did something like this once, and was all, "This script is no longer used. Delete!" and a couple years later I get a call, "How come our precinct caucus results reporting app no longer works?" Just because it's only used every 4 years doesn't mean it's not used.
posted by cjorgensen at 9:05 AM on May 5, 2017 [1 favorite]


Archive rather than delete in case there's surprises like what cjorgensen mentioned.

The atime check is good, but sometimes you'll get bad information from it if there's something running on the system that accesses all the files periodically, like some backup systems or running a virus scanner on all the files in the webroot. In those cases, you might need to do something like use a PHP dependency mapper in conjunction with your web logs to see what's in active use.
posted by Candleman at 9:19 AM on May 5, 2017


« Older Not industrial, not house, but Powerhouse!   |   Desperately seeking... tape backup for NAS shares Newer »
This thread is closed to new comments.