Webstat Pollution is Driving Me Insane
April 19, 2007 9:02 AM Subscribe
Blogger is polluting my webstats. Hivemind, please hope me.
I run a site outside Blogger, and my stats are completely polluted by one person's blog, which IS a Blogger site. I know it's not hotlinking. 1) we have that disabled and 2) I know the person and he isn't the type to hotlink. I know that he has a text link to us on his site, and he's linked to us in past posts, but nothing to justify 99% of our traffic being recorded as being from his site. I don't want him to delink us (rude, and he's part of the group affiliated with the other site).
I have no idea why he's hitting us so much, and it's really out of control. I can see in the stats hits from individual posts that have notihng to do with us (ie, there is no link in the post text to my url).
Any ideas of how to stop this? I need to see the actual traffic being driven to my site, and what I'm getting right now is trash.
I run a site outside Blogger, and my stats are completely polluted by one person's blog, which IS a Blogger site. I know it's not hotlinking. 1) we have that disabled and 2) I know the person and he isn't the type to hotlink. I know that he has a text link to us on his site, and he's linked to us in past posts, but nothing to justify 99% of our traffic being recorded as being from his site. I don't want him to delink us (rude, and he's part of the group affiliated with the other site).
I have no idea why he's hitting us so much, and it's really out of control. I can see in the stats hits from individual posts that have notihng to do with us (ie, there is no link in the post text to my url).
Any ideas of how to stop this? I need to see the actual traffic being driven to my site, and what I'm getting right now is trash.
Response by poster: We're hosted by DreamHost and I think the stats program is called Analog. On my stats page, it lists the referring URLs and gives the number of pages - this blogger site has zero beside each hit (rather than say, http://images.google.com/imgres which has 111 or so). From the first of April til now, this blogger site has hit our site 5529 times.
The number of fails we have for robots.txt suggests to me that maybe it's partially a spidering/webcrawler/robot issue. Does that jive? We have ~700 misses looking for robots.txt.
posted by Medieval Maven at 10:09 AM on April 19, 2007
The number of fails we have for robots.txt suggests to me that maybe it's partially a spidering/webcrawler/robot issue. Does that jive? We have ~700 misses looking for robots.txt.
posted by Medieval Maven at 10:09 AM on April 19, 2007
I have a similar situation in that a disproportionate number of the referrers to my (not especially popular) site are from the site of a friend who linked to me a few months ago. I don't think his site has a huge number of actual human visitors, nor would they be clicking through, so I'm pretty sure it's a spidering/bot thing.
He happens to be involved with the whole post-meta-ironic burlesque-show scene, so my offhand guess is that maybe his site gets hit by a lot of sex-related spiders that are much more aggressive than the standard non-porn ones? I haven't bothered to investigate further, though.
Also, keep in mind that those hits are not necessarily legitimately coming from his site, whether by actual clicks or bots; I can write a script to fetch a page from your site and send whatever I want in the referrer header. (Why someone would falsify that in this particular way, though, I dunno. That's how referrer spam works, but it doesn't sound like that's the case here.)
posted by staggernation at 10:24 AM on April 19, 2007
He happens to be involved with the whole post-meta-ironic burlesque-show scene, so my offhand guess is that maybe his site gets hit by a lot of sex-related spiders that are much more aggressive than the standard non-porn ones? I haven't bothered to investigate further, though.
Also, keep in mind that those hits are not necessarily legitimately coming from his site, whether by actual clicks or bots; I can write a script to fetch a page from your site and send whatever I want in the referrer header. (Why someone would falsify that in this particular way, though, I dunno. That's how referrer spam works, but it doesn't sound like that's the case here.)
posted by staggernation at 10:24 AM on April 19, 2007
Do you have someone who can pull your log files and filter out all the lines with his referrer? Either one is pretty easy to do on either windows or a mac once you have all the files downloaded and uncompressed.
Looking at the details will help figure out what's going on. If the clicks have a variety of IP addresses and user agents, then they may well be legitimate referrals. In which case, you should be thrilled that the site is sending you so much traffic.
On the other hand, if they seem to come from a relatively small # of IP address, or and or a single user agent, then the site is probably involved in targeting you with referrer spam.
posted by Good Brain at 11:18 AM on April 19, 2007
Looking at the details will help figure out what's going on. If the clicks have a variety of IP addresses and user agents, then they may well be legitimate referrals. In which case, you should be thrilled that the site is sending you so much traffic.
On the other hand, if they seem to come from a relatively small # of IP address, or and or a single user agent, then the site is probably involved in targeting you with referrer spam.
posted by Good Brain at 11:18 AM on April 19, 2007
BTW There are all sorts of bots out there crawling, which is probably why you have so many misses for robots.txt. I don't recall seeing legit crawlers giving a referrer. For example, Google, MSN and Yahoo's bots don't.
posted by Good Brain at 11:22 AM on April 19, 2007
posted by Good Brain at 11:22 AM on April 19, 2007
In the "Referrer Report" section of your Analog stats, if the first column (#reqs) next to the referring URL is a high number, but the second column (#pages) for the same URLS is 0, then that means hotlinking. For the band website I run, we get a lot of Myspace URLs that show up like this, due to people hotlinking our photos on their Myspace profiles.
If it's not hotlinking, looking at the raw logs will help you sort out what's going on. You may already know this but you have access to your raw server logs via FTP (I'm a DH user as well). There should be a folder called "logs" at the top level of your login (same level as your whatever.com folders). In that folder should be another folder called whatever.com, and in that I think are the files. There may be yet another directory called http or something. I'm writing this from memory so I'm not sure what they're called, but one of those files is a symbolic link to your current log, and then there are folders with the past few days' logs as well.
posted by statolith at 12:01 PM on April 19, 2007
If it's not hotlinking, looking at the raw logs will help you sort out what's going on. You may already know this but you have access to your raw server logs via FTP (I'm a DH user as well). There should be a folder called "logs" at the top level of your login (same level as your whatever.com folders). In that folder should be another folder called whatever.com, and in that I think are the files. There may be yet another directory called http or something. I'm writing this from memory so I'm not sure what they're called, but one of those files is a symbolic link to your current log, and then there are folders with the past few days' logs as well.
posted by statolith at 12:01 PM on April 19, 2007
Response by poster: . . . . I'm pretty sure we've disabled hotlinking, but it's possible that the gallery portion of the site is somehow removed from that blanket exclusion. We host two other sites on DH and I had a problem with an image being hotlinked to myspace. . . . hm.
Thanks for the ideas so far . . .if anyone else has a plan, post away!
posted by Medieval Maven at 1:02 PM on April 19, 2007
Thanks for the ideas so far . . .if anyone else has a plan, post away!
posted by Medieval Maven at 1:02 PM on April 19, 2007
Even if you disable hotlinking, the request is still going to show up in your log files. Sure, the return code might be "403 forbidden" instead of "200 ok", but unless your logging software is configured to split those into a separate category, they'll be considered just as any other request.
Also, there are spammy web crawlers out there which will crawl your entire site and use the same referring URL for every single page request, regardless of what the actual referrer is. The URL they use is often the first one which linked to your page. I get these all the time - it will appear that I had 100 hits in a row from a google search for something, but in fact it was only 1 referrer, and the bot was re-using it while crawling my site.
Have you considered using Google Analytics? It's javascript-based, which means it doesn't count a lot of the spambots. It's also smart enough to separate search engine traffic from other types.
posted by helios at 3:40 PM on April 19, 2007
Also, there are spammy web crawlers out there which will crawl your entire site and use the same referring URL for every single page request, regardless of what the actual referrer is. The URL they use is often the first one which linked to your page. I get these all the time - it will appear that I had 100 hits in a row from a google search for something, but in fact it was only 1 referrer, and the bot was re-using it while crawling my site.
Have you considered using Google Analytics? It's javascript-based, which means it doesn't count a lot of the spambots. It's also smart enough to separate search engine traffic from other types.
posted by helios at 3:40 PM on April 19, 2007
Response by poster: We've signed up for Google Analytics (yay, free) and we're looking at the files on DH. I will post when we nail down what it is (or isn't, or if we don't, or if it just stops). My money is at the moment on a bot because the pages that people are coming from just dont have anything on them that could be a hotlink to our site.
posted by Medieval Maven at 6:15 AM on April 20, 2007
posted by Medieval Maven at 6:15 AM on April 20, 2007
Or you can just drop him from your Analog referer report by adding:
REFREPEXCLUDE http://his-site.com/*
REFREPEXCLUDE http://www.his-site.com/*
...to your analog.cfg file.
posted by genghis at 8:54 AM on April 20, 2007
REFREPEXCLUDE http://his-site.com/*
REFREPEXCLUDE http://www.his-site.com/*
...to your analog.cfg file.
posted by genghis at 8:54 AM on April 20, 2007
This thread is closed to new comments.
Also, when looking at the logs, look at an entry where the referrer was the blogger site, and see what client it was. It could be that search engines scan the crap out of that blogger site and then link to yours repeatedly because it's linked somewhere. Lots of webstats packages allow you to filter out results that were from googlebot, yahoo's search bot, etc etc... Alternatively you could have a robots.txt file.
Lastly, though not too likely, there's the possibility that the blogger site has such a nice comprehensive list of links that people use it as a home/default page and really do click to visit you a lot... but based on what you're saying, it doesn't seem that way.
Oh yeah - also potentially important: what web stats package do you use?
Basically, I think we need more info to help you well, but the above are a few things to look at first...
posted by twiggy at 9:28 AM on April 19, 2007