"He's making it up as he goes along!"
September 2, 2005 8:05 PM Subscribe
Why is google spidering specific but non-existent pages on my blog?
Over the last couple of days I've seen google's bot scanning my website, trying to access specific URLS:
Unfortunately, those URLs don't exist and they've never existed. So why is google requesting them? Is it just guessing, or what? Also, my site is set up so that requests which don't got to an existing page will go to the index page, so, if google requests these non-existent pages and gets the same content each time (ie: my index page) will it think (incorrectly) that I've set up some SEO linkfarm and lower my page-rank as a punishment?
Over the last couple of days I've seen google's bot scanning my website, trying to access specific URLS:
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:32 +0100] "GET /summit/contact.html HTTP/1.1" 200 38112 "-
" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:46 +0100] "GET /pages/devonshire.html HTTP/1.1" 200 38114
"-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:57 +0100] "GET /pages/hampton.html HTTP/1.1" 200 38111 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:04:09 +0100] "GET /chatsford/contact.html HTTP/1.1" 200 38115
"-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Unfortunately, those URLs don't exist and they've never existed. So why is google requesting them? Is it just guessing, or what? Also, my site is set up so that requests which don't got to an existing page will go to the index page, so, if google requests these non-existent pages and gets the same content each time (ie: my index page) will it think (incorrectly) that I've set up some SEO linkfarm and lower my page-rank as a punishment?
they might be using it as a test to see if you are in fact a link farm?
sorta like: "If I throw these random links at you, and you provide stuff to me, I suspect you of trying to pollute your legitimate links too"
posted by clord at 12:27 AM on September 3, 2005
sorta like: "If I throw these random links at you, and you provide stuff to me, I suspect you of trying to pollute your legitimate links too"
posted by clord at 12:27 AM on September 3, 2005
I notice you're returning HTTP 200 codes. You should return 404 errors or 301 redirect codes if you want Google to realize those pages don't exist.
posted by cillit bang at 3:32 AM on September 3, 2005
posted by cillit bang at 3:32 AM on September 3, 2005
Response by poster: Well, basically every URL that arrives at the site goes through a rewriting process, and those that don't land on a certain point in the database (ie: an extant page) will return the frontpage content, so a 200 is correct in terms of outcome, but in this particularly situation, and given what I feared and clord suggested too - I could be damaging my google page rank by doing that, if they're doing what we think they might be doing.
Don't we imagine that google would be intelligent enough to check if content they were returned when they made a "random" check was actually relevant - ie: if they did trip over a link farm, they normally build the search terms/urls into the content of the dynamic page to increase their googleness. My site will just return the same index/content page time and time again, but it will (unless entirely coincidentally) bear no relevance to the URL google's decided to test me with...
So, in short - is the consensus that google are checking for me being a link farm, and I'm currently doing nothing to disuade them of that notion?
posted by benzo8 at 4:43 AM on September 3, 2005
Don't we imagine that google would be intelligent enough to check if content they were returned when they made a "random" check was actually relevant - ie: if they did trip over a link farm, they normally build the search terms/urls into the content of the dynamic page to increase their googleness. My site will just return the same index/content page time and time again, but it will (unless entirely coincidentally) bear no relevance to the URL google's decided to test me with...
So, in short - is the consensus that google are checking for me being a link farm, and I'm currently doing nothing to disuade them of that notion?
posted by benzo8 at 4:43 AM on September 3, 2005
Every page should have one single canonical URI and everything else should be a redirect to it. Not just to help out Google, it's also good design.
posted by cillit bang at 4:46 AM on September 3, 2005
posted by cillit bang at 4:46 AM on September 3, 2005
"those that don't land on a certain point in the database (ie: an extant page) will return the frontpage content"
That's a really bad idea, as you're finding out. Get a helpful error page in there and use the right status code.
Is your site set up to respond only to the correct host header, or will it treat any request to the correct IP as being for that site? If it's the latter then the requests could be either for someone who previously had that IP, or for a domain name that's been incorrectly pointed your way.
posted by malevolent at 7:09 AM on September 3, 2005
That's a really bad idea, as you're finding out. Get a helpful error page in there and use the right status code.
Is your site set up to respond only to the correct host header, or will it treat any request to the correct IP as being for that site? If it's the latter then the requests could be either for someone who previously had that IP, or for a domain name that's been incorrectly pointed your way.
posted by malevolent at 7:09 AM on September 3, 2005
Response by poster: The site is set up for virtual hosts, so will only respond to www.benzo8.org requests. In terms of the URL handling - I'm running Mambo, and that pretty much does its own thing. I've got the SEO-friendly URL modules switched on and I guess that's what handles the rewriting, but I think even without it, it defaults to the index page, so I'm gonna have to get my hands dirty with the php I guess and find out how to get it to fail gracefully rather than lazily... Thanks to all.
posted by benzo8 at 7:13 AM on September 3, 2005
posted by benzo8 at 7:13 AM on September 3, 2005
This thread is closed to new comments.
posted by RustyBrooks at 8:33 PM on September 2, 2005