Need a very simple link checker
March 16, 2006 2:17 PM   Subscribe

Need a URL link checker which can just take in a txt list of URLs and just see if they are active or inactive. All of the link checkers I downloaded are geared towards taking in a URL and then traversing all the links and anchors within that URL and generating reports on everything. What I have is a simple list of websites on the internet (not my own pages within my domain or site).. all I need is for it to tell me whether they still exist or not. Thats all. Can anyone recommend something?
posted by postergeist to Technology (8 answers total)
 
wget -i [filename]

will get each page if it can, and report errors on those that fail.
posted by nicwolff at 2:37 PM on March 16, 2006


http://www.gnu.org/software/wget/ -- has links to windows binaries of wget if you need it.
posted by fishfucker at 2:58 PM on March 16, 2006


I use a program called WinHTTrack. It seems to work well for me.
posted by inigo2 at 3:04 PM on March 16, 2006


Here's a python script that will do the job:
import sys
import urllib2
import fileinput
 
for line in fileinput.input() :
	line = line.strip()
	try:
		urllib2.urlopen(line)
		
	except urllib2.URLError:
		print line, ' NOT ACTIVE'
		continue
		
	except ValueError:
		print line, ' CANT PARSE'
		continue
	
	print line, ' OK'

posted by Capn at 3:19 PM on March 16, 2006


Um, those extra new-lines didn't show up in the live preview, sorry.
posted by Capn at 3:19 PM on March 16, 2006


The problem with this approach is that it will only test to see whether the domain is active or not - if the list is more than a few months old the domain registrations may have expired and been snapped up by a spammer. This happens a lot; even to computer-savvy MeFi members :) If the spammer has re-launched the site as a spam-trap, the wget tests above will show up positive, even if all the original content is gone.

This will only be a problem if your URLs are in the form http://www.domain.com - if they are looking for a specific page (http://www.domain.com/page.html) then the methods above should work ok.
posted by blag at 4:00 PM on March 16, 2006


This will only be a problem if your URLs are in the form http://www.domain.com
It can still be a problem regardless. That CSS drop-shadow link that you pointed out is http://www.renegadetourist.com/shadow/index.html and it still returns the linkfarm page, which is setup to respond to any URL, as you can demonstrate for yourself: http://www.renegadetourist.com/aw039lhjdkhgliseyuhkjh
posted by Rhomboid at 5:53 PM on March 16, 2006


Here's another method. Copy your plain text list of URLs and go here. Paste them in the form and press the "HTMLify" button. Save the resulting web page to your computer. Upload the page to your site and sick your downloaded link checkers on it. Or go here, enter the URL of your new page and press the "Check" button. If you want, you can then browse to your page and click on any links you want to double check manually.
posted by shoesfullofdust at 9:19 PM on March 16, 2006


« Older easy as pie...roigi   |   switching to dvorak on OS X Newer »
This thread is closed to new comments.