Checking for links to bad people?
August 19, 2008 2:03 PM   Subscribe

Is there any way to search a website for links to pages like this?

A client's website has hundreds of pages of old content, content that they have no intention of updating but still gets relatively high traffic. Because of the kind of content we're dealing with, it's more important that the pages are there than that they're manually updated on a regular basis.

Is there any way to comb through and find links that have gone to domain squatters? There's some concern about the unsavory things that those pages sometimes advertise, or (more to the point) that an old link might one day go to one of those unsavory things.

Of course I've seen plenty of spider programs that check for broken links, but the ones I've seen don't seem to have a way of checking for legitimate content.

Avoiding porn is essential; avoiding "what you need, when you need it" would be nice.
posted by roll truck roll to Computers & Internet (7 answers total)
 
I don't think you're going to be able to write a program that can tell the difference between "legitimate" and... illegitimate content. Google probably already has 10 PhDs working on that and not succeeding. You should just write a script to pull out all the links, normalize them down to a list of all the unique domains, and dump them into a page for some poor human to check one by one. When you find a domain that is just a click farm, search and destroy all instance of links to it across your content.
posted by steveminutillo at 2:26 PM on August 19, 2008


You could conceivably create a script with AutoIt that would periodically do a whois on all of your links and put up a red flag/report/alert when the owner is different than the original owner you have on file, or perhaps if the 'owner since' date is relatively new. (I doubt you have a record of the original site owners but anything with an 'owned since' date that's newer than when you originally created the links would be a good starting point - those would be any domains which changed ownership or were 'non-renewed' since your starting date.

But, yeah.. a program that already does this, or a website that will do it automatically? I doubt such a thing exists...
posted by MarkLark at 2:46 PM on August 19, 2008


You could also write a fairly simple script that grabs the source of the the first page found on every link and then check it against a list of red flag terms, like viagra, porn, "what you need", etc. and that would give you a list of the first links you'll want to hand check.

And I'm sorry to hear that cat scan is no more. That was an early web favorite of mine.
posted by afflatus at 2:48 PM on August 19, 2008


This document (Acrobat PDF) explores some of the complexities of identifying Link Farm spam pages.
posted by wannalol at 2:57 PM on August 19, 2008


maybe you could just link to the wayback machine archived version of those sites, and just place a caveat on the website?

ive personally never run into porn clicking by clicking on a spam page link
posted by sponge at 3:57 PM on August 19, 2008


Python + Beautiful Soup + re. Not 100%, but easily 90% or so.
posted by signal at 10:12 PM on August 19, 2008


Response by poster: I just wanted to come back belatedly and say thanks to everyone. We ended up basically going with steveminutillo's suggestion, but also added a little disclaimer to archived content.

I'm not going to mark a best answer, because everyone was very helpful.
posted by roll truck roll at 5:09 PM on January 21, 2009


« Older Sand Castle 101   |   So many names, so little time Newer »
This thread is closed to new comments.