Is there a programme to allow me to find redundant, non-linked files on my server?
March 3, 2006 1:57 AM   Subscribe

Is there a programme, or way, to allow me to find 'redundant' files on my server: i.e. files that are not linked to from any other page?

Several of our larger websites have been running for 10+ years now. I'm sure there are hundreds of images and pdfs and video files and pages and entire directories that are hanging around on the server that are no longer used. Is there a programme that can spider a website, then spider the site directory on the server and then provide a list of files that are on the latter but not the former so that I can do a bit of housekeeping?
posted by Hartster to Computers & Internet (10 answers total)
 
You can do it with Dreamweaver, it appears. But you need to have your website defined as a Site in Dreamweaver. You'd then use "Check Links" under the Sites menu.

Otherwise there are a ton of site management programs out there which should have this as a feature, and I think the missing piece of the puzzle is that you're not googling for the phrase "orphaned files"
posted by AmbroseChapel at 2:29 AM on March 3, 2006


Can I suggest being very wary of anything that offers the ability to do this?

You need something that can not only trawl through the HTML, but which can also parse any javascript, cgi and php/asp/etc that may be on the servers as you can rarely be sure that all links are solely HTML.

Dreamweaver's abilities are severely limited in this respect.

You also need to be sure that no-one is crosslinking information from another site - internally of externally. Your tech guys may be able to give you a list of files that haven't been 'touched' (pretty sure that's the Unix term). This will give you a list of files that haven't been accessed. This doesn't tell you that it's not linked too, but if one of the files you want to delete isn't on this list, you need to find out why before you delete it.
posted by twine42 at 3:50 AM on March 3, 2006


Damn...

[...] files that haven't been 'touched' in the last x months (pretty sure [...]
posted by twine42 at 3:52 AM on March 3, 2006


Best answer: Actually, touched isn't what you're looking for. Touching sets the modification time, what you want is access time. If your filesystem stores access times (chances are it can), and it hasn't been disabled as a mount option (this is often done to enhance performance), and your web directories aren't mounted read-only, try find /path/to/web/directory -atime +61, which should list all files that haven't been read from in the last 61 days. You might want to spider your site first to make sure you don't hit files that nobody happens to have requested in the last two months, and of course you should move them out of the directory tree into a backup instead of just plain deleting… This all assumes you're using a unix, but if this is about the site mentioned in your profile, you are.
posted by fvw at 4:05 AM on March 3, 2006


wget -m http://mirror.com/
cd html_root
find . | sort > ../root.files
cd mirror.com
find . | sort > ../mirror.files
cd ..
diff root.files mirror.files

But using atime is better, yeah.
posted by sfenders at 6:15 AM on March 3, 2006


nice fvw!
posted by yeahyeahyeahwhoo at 6:17 AM on March 3, 2006


61 days is way to short, lots of activity is annual in nature. You don't want to blow away tips on surviving Burning Man or Halloween costume suggestions etc.
posted by Mitheral at 7:02 AM on March 3, 2006


Response by poster: Perfect, thanks fvw. Works like a treat. The good thing about this method is (re: Mitheral's worries) that the searchbots ever more aggressive spidering will have accessed even the most unpopular pages, so I can be pretty sure what's produced by fvw's method is genuinely orphaned.

Thanks again AskMe!
posted by Hartster at 8:15 AM on March 3, 2006


Linkbot claimed to do this, but in practice it never seemed to be able to tie together spidering a site and FTPing into the root folder. That was years ago, so they may have improved it since then.
posted by yerfatma at 1:43 PM on March 3, 2006


I feel like saying this exercise is not worth the trouble. If the pages which are "orphaned" aren't being accessed, they're costing you nothing in bandwidth and effectively nothing in storage.

And if somebody's got a bookmark or email link to them, or if some search engine has a record of them which might turn up, despite the fact that they're orphaned on your server, they're not really orphaned on the internet as a whole, and you risk a 404 for no good reason.
posted by AmbroseChapel at 5:00 PM on March 5, 2006


« Older Where to live in Leeds, UK?   |   Luxury fitness resort in Asia Newer »
This thread is closed to new comments.