Checking for links to bad people?
August 19, 2008 2:03 PM Subscribe
Is there any way to search a website for links to pages like this?
A client's website has hundreds of pages of old content, content that they have no intention of updating but still gets relatively high traffic. Because of the kind of content we're dealing with, it's more important that the pages are there than that they're manually updated on a regular basis.
Is there any way to comb through and find links that have gone to domain squatters? There's some concern about the unsavory things that those pages sometimes advertise, or (more to the point) that an old link might one day go to one of those unsavory things.
Of course I've seen plenty of spider programs that check for broken links, but the ones I've seen don't seem to have a way of checking for legitimate content.
Avoiding porn is essential; avoiding "what you need, when you need it" would be nice.
A client's website has hundreds of pages of old content, content that they have no intention of updating but still gets relatively high traffic. Because of the kind of content we're dealing with, it's more important that the pages are there than that they're manually updated on a regular basis.
Is there any way to comb through and find links that have gone to domain squatters? There's some concern about the unsavory things that those pages sometimes advertise, or (more to the point) that an old link might one day go to one of those unsavory things.
Of course I've seen plenty of spider programs that check for broken links, but the ones I've seen don't seem to have a way of checking for legitimate content.
Avoiding porn is essential; avoiding "what you need, when you need it" would be nice.
You could conceivably create a script with AutoIt that would periodically do a whois on all of your links and put up a red flag/report/alert when the owner is different than the original owner you have on file, or perhaps if the 'owner since' date is relatively new. (I doubt you have a record of the original site owners but anything with an 'owned since' date that's newer than when you originally created the links would be a good starting point - those would be any domains which changed ownership or were 'non-renewed' since your starting date.
But, yeah.. a program that already does this, or a website that will do it automatically? I doubt such a thing exists...
posted by MarkLark at 2:46 PM on August 19, 2008
But, yeah.. a program that already does this, or a website that will do it automatically? I doubt such a thing exists...
posted by MarkLark at 2:46 PM on August 19, 2008
You could also write a fairly simple script that grabs the source of the the first page found on every link and then check it against a list of red flag terms, like viagra, porn, "what you need", etc. and that would give you a list of the first links you'll want to hand check.
And I'm sorry to hear that cat scan is no more. That was an early web favorite of mine.
posted by afflatus at 2:48 PM on August 19, 2008
And I'm sorry to hear that cat scan is no more. That was an early web favorite of mine.
posted by afflatus at 2:48 PM on August 19, 2008
This document (Acrobat PDF) explores some of the complexities of identifying Link Farm spam pages.
posted by wannalol at 2:57 PM on August 19, 2008
posted by wannalol at 2:57 PM on August 19, 2008
maybe you could just link to the wayback machine archived version of those sites, and just place a caveat on the website?
ive personally never run into porn clicking by clicking on a spam page link
posted by sponge at 3:57 PM on August 19, 2008
ive personally never run into porn clicking by clicking on a spam page link
posted by sponge at 3:57 PM on August 19, 2008
Python + Beautiful Soup + re. Not 100%, but easily 90% or so.
posted by signal at 10:12 PM on August 19, 2008
posted by signal at 10:12 PM on August 19, 2008
Response by poster: I just wanted to come back belatedly and say thanks to everyone. We ended up basically going with steveminutillo's suggestion, but also added a little disclaimer to archived content.
I'm not going to mark a best answer, because everyone was very helpful.
posted by roll truck roll at 5:09 PM on January 21, 2009
I'm not going to mark a best answer, because everyone was very helpful.
posted by roll truck roll at 5:09 PM on January 21, 2009
This thread is closed to new comments.
posted by steveminutillo at 2:26 PM on August 19, 2008