'Fsockopen', Plus A Couple of Googlebot Questions
May 30, 2008 12:38 PM   RSS feed for this thread Subscribe

Website help: I'm asking for some input on (1) straightening out a problem with Google and its "description" for each page of my blog; (2) constructing a robots.txt exclusion for certain types of archive pages; and (3) the reason why something is looking for page URLs that have "function.fsockopen" at the end of them.

(1)
Google seems to be using the first bit of text from my blog — its subtitle and a phrase "Skip to content" which I don't see anywhere — as opposed to what is in the META description tag. (Results in which I see this.) I prefer it to index the latter, since the META description is an excerpt from the page and thus is better for search engines.

For some pages, it's appropriately indexing the META description. But for many more (probably the majority), it still has the blog's subtitle.

Is this merely a case of the pages with subtitles not having been visited by the Googlebot spider recently? If so, is there anything I can do to get Google to respider the whole site? I'm registered with Google Webmaster Tools. I don't have access to setting a faster crawl rate.

Or is it something wrong with the page's tagging or code? If so, what's wrong with it?

(2)
I have been trying to exclude archive pages from search engines; the post's entire content is reproduced there and Google doesn't like duplication. I do have lt;meta name="googlebot" content="noindex,noarchive,follow,noodp" /> in the archive headers, but I had also tried to exclude it via robots.txt.

Unfortunately, my attempt at doing so ended up excluding a good handful of sites it shouldn't've. My attempt was:
Allow: /200*/*/*/*/
Disallow: /200

The idea was to allow URLs in this format — http://www.[sitename].com/2008/09/11/blog-post — but to disallow all other posts that began with 200 — which would cover all the archive pages.

Can I just invert the two (put the disallow before the allow) to fix that? Or is there another way to do it? This is my site's robots.txt file.

(3)
I'm told that Googlebot could not find about 18 pages that were mentioned "either in your Sitemap or by following links from other pages during a discovery crawl." 7 of them are quirks or links I had to fix, but 11 of them were in this format:

http://www.[sitename].com/2004/10/25/blogger-1025-0648-pm/function.fsockopen

I have absolutely no idea what's causing this. Is it something on my end? These "fsockopen" listings are not in my sitemap (I double-checked). I'm really not even sure where to begin researching this one. I do have this in my htaccess file, if it's a possible cause:

AddType application/x-httpd-php5 .php
AddHandler application/x-httpd-php5 .php
posted by WCityMike to computers & internet (10 comments total)
For (3) - PHP will generate a link like this if it generates an error when executing a function in a PHP script. It looks like this happened while Googlebot was indexing your site; for instance, these search results. You can google for "warning: fsockopen" to see some examples of PHP doing this, like this busted page (assuming it hasn't been fixed.) You can eliminate these by setting up proper error handling, so the PHP errors don't get displayed to the client.
posted by pocams at 12:47 PM on May 30, 2008


BTW, fsockopen is a PHP function that opens a socket connection, which is why it's often something that fails - if the far-side server goes down, you'll get these errors.
posted by pocams at 12:48 PM on May 30, 2008


> You can eliminate these by setting up proper error handling, so the PHP errors don't get displayed to the client.

Thanks ... I don't think it's relevant anymore since the WordPress plugin that caused that isn't on the page anymore. Probably just bad timing with those particular spiderings.
posted by WCityMike at 12:49 PM on May 30, 2008


for (1) look here ... partcularly read the bit about dmoz data.
posted by gyusan at 1:02 PM on May 30, 2008 [1 favorite]


For (1) if you check the html source, you'll see "skip to content." This is probably for text-based web browsers, and probably used to skip over the navigation, which can take up a few screens in these browsers.
posted by beerbajay at 1:24 PM on May 30, 2008


Beerbajay, thanks, but I knew where it came from. What I didn't understand is why it's taking from that and not the META description. Ditto with gyusan's response -- the linked-to page also indicates they draw from META descriptions, which are in the page.
posted by WCityMike at 2:11 PM on May 30, 2008


Oy, duh. But I found this, which seems to indicate that it is at least hard to force google to use your meta description.
posted by beerbajay at 3:37 PM on May 30, 2008


How much time have you waited between making the changes and now? My experience with non-popular blogs is that googlebot only does a deep crawl about once a month. So it may take 30+ days for Google to reflect any changes you make on single blog post pages.

For (2), in this context I believe there's an implied * wildcard at the end of an allow/disallow statement. So your lines are interpreted as:

Allow: /200*/*/*/*/
Disallow: /200*

...which matches your blog posts for the disallow. robots.txt really wasn't designed well enough to handle what you are trying to do (include some subdirectories and exclude others without explicitly naming either one). The meta tags in the archive headers are the way to go with that.
posted by ghostmanonsecond at 5:30 PM on May 30, 2008


There's no such thing as Allow in robots.txt anyway, only Disallow and User-Agent. Here's a simple guide.
posted by nev at 2:29 PM on May 31, 2008


Nev, understood, but it's accepted/encouraged by Googlebot, and that section falls under a Googlebot-only section of my robots.txt file, I believe.
posted by WCityMike at 11:17 AM on June 6, 2008


« Older Does anyone have suggestions f...   |   In August, a friend and I are ... Newer »
This thread is closed to new comments.