Where are these queries without a referer header coming from?
April 11, 2008 7:53 AM   Subscribe

I modified my website's search engine to log queries--including the search string, timestamp, referer, and visitor IP--in order to discover what visitors are searching for from which pages. Several of the queries have a blank referer and come from the IP 66.249.70.235. Those query strings are not spam-like. Typically they have some relation to my site's topic. The IP appears to belong to the Googlebot, buy why would it behave that way? The queries seem like things a human would enter. Is there some method of accessing a site's search engine that strips the referer header and leaves a bogus IP? As I said, nothing about this appears malicious. I am just wondering what is happening.
posted by Fred Mars to Computers & Internet (10 answers total)
 
I can't speak for what the esteemed Mr. Googlebot is actually doing, but Referrer headers are optional and only usually set by browsers (or bots trying to look like browsers). The bot is likely using CURL or something similar to post only the bare minimum request that will get a reply out of your site.

You may want to either separately log or coordinate with your webserver access logs to get the User Agent of those requests, that'll give you some more to work with.
posted by Skorgu at 8:02 AM on April 11, 2008


By the way, the IP is not bogus.
posted by Class Goat at 8:31 AM on April 11, 2008


Could the Googlebot be following a link to a search query from elsewhere?
posted by Plug Dub In at 9:16 AM on April 11, 2008


Response by poster: Thanks for the info and suggestions. I changed my script to log the User Agent too, so we will see if it identifies itself as the Googlebot.

Sorry, what I meant by "bogus" was not clear in my question. I think I meant something more like a spoofed IP address, implying that someone is, for some reason, posing as the Googlebot.

Here are some example entries that are puzzling to me:
TERM             DATE AND TIME
-----------------------------------------
map              Thu Apr 10 23:11:02 2008
files            Thu Apr 10 23:16:47 2008
map              Thu Apr 10 23:25:32 2008
files            Fri Apr 11 00:30:52 2008
These could conceivably be a bot's attempt to find a site map for my website. It would be odd for a human to enter those terms there in the first place, especially repeatedly and so close together in time. But there are also domain-relevant queries in there, coming from the same IP. For anyone who has studied Googlebot, does it commonly submit queries to local search engines in this way?

I understand that I will probably never know exactly what is happening. I am mainly curious because of the seeming mix of human-like and robot-like behavior.
posted by Fred Mars at 9:43 AM on April 11, 2008


are there referring urls that come up often in relation to the searches for those same terms? That might suggest pages that link to directly to search results which the bot may be following.
posted by Good Brain at 9:56 AM on April 11, 2008


Wild ass idea: if the get variable of your search term is a short, common one ("s") it might be getting lumped in with a completely different application that Google thinks you might have since you don't error out when its provided.
posted by Skorgu at 10:34 AM on April 11, 2008


If someone enters the URL directly into the address bar, instead of navigating from another page, Referer would not be set then either.
posted by nakedcodemonkey at 10:07 PM on April 11, 2008


The IP is not spoofed.

The IP is not just a signature by the sending party. It's also the return address for packets your server sends back. If the IP is spoofed, the packets go back to the wrong place.

People trying to do a DDOS often spoof their IPs, but that's not what you're looking at here. If someone is trying to get information from your server, no matter what that information is, then the IP you see in the log is the IP that they're using to receive the packets your server is sending back which contain the information they want. It has to be correct or else they wouldn't see those packets.
posted by Class Goat at 10:43 PM on April 11, 2008


Best answer: Google's bot does now fill out forms, using words that appear on your website: http://searchengineland.com/080411-140000.php
posted by lunchbox at 7:50 AM on April 12, 2008


Response by poster: Thanks for all the suggestions and explanations! I believe Google's new crawling tactic explains what I've seen in my logs. It will be interesting to watch how the bot's choice of search terms changes over time.
posted by Fred Mars at 8:26 AM on April 15, 2008


« Older web privacy at work?   |   Hat advice... Newer »
This thread is closed to new comments.