How can I get Google to re-index my site and how did it find it in the first place?
August 21, 2006 8:28 PM   Subscribe

How can I get Google to re-index my site and how did it find it in the first place?

Way, way before it was ready, a site I was building somehow got crawled by Google.

The version that got crawled has "lorem ipsum" text everywhere and stuff like "catchphrase goes here" in place of actual content.

[I've learned my lesson now -- I will password-protect sites in future, or use robots.txt, or whatever. There's no point lecturing me on this aspect.]

So, question one -- for some reason it got into Google. How? It wasn't linked from anywhere, and we certainly didn't go to the form on Google which says "please list my website".

The only thing we did was change hosting companies. Did Google sense a disturbance in the force when the DNS records clicked over? It seems unlikely.

Question two -- how can I get Google to come back and re-crawl the site? I've joined their SiteMaps program in the hope that would help, but it hasn't. I've used the "please crawl me" form three times now, and more than a month has gone by, and still nothing.

I assumed it would happen after a couple of weeks, but it's getting a little embarassing. When you search for the site name you get "sitename.com -- lorem ipsum blah blah tagline goes here" in the Google results.

Is there any SEO black magic which I can employ to help, or should I just wait?

Two technical details which someone said might be affecting it, although I'm not sure I believe them:
  1. The front page which got indexed is longer the front page, it's blank with a redirect to the actual front page (for futureproofing reasons, long story).
  2. The page which got indexed is "index.html" but has server-side includes (I tweaked the server, I like it that way); but Google doesn't know that, the page is never referred to by name, only with a trailing slash.
posted by AmbroseChapel to Computers & Internet (14 answers total) 3 users marked this as a favorite
 
If you google "google sandbox + 6 months," you'll find some explanations. Basically it can take 6-9 months for new/updated sites to re-index on google.

Frustrating, right?
posted by empyrean at 9:26 PM on August 21, 2006


Oops, now that I re-read your question and my answer, I realize I misunderstood and crossed my answer terms up.
posted by empyrean at 9:29 PM on August 21, 2006


Things like the Google Toolbar can phone home if that feedback thing-a-ma-job is turned on. If you browse your upcoming site with a browser that runs the toolbar, Google will see it and dispatch the all-seeing spiders.

It's just another way they try and stay ahead of the competition.
posted by unixrat at 9:51 PM on August 21, 2006


Oops. The above was meant to be in reference to:

How? It wasn't linked from anywhere, and we certainly didn't go to the form on Google which says "please list my website".
posted by unixrat at 9:52 PM on August 21, 2006


So, you're saying I'm in the infamous "sandbox", and the sandbox effect was triggered by a change to the WHOIS registry? Google really does sense a disturbance in the force?
posted by AmbroseChapel at 9:53 PM on August 21, 2006


Things like the Google Toolbar can phone home if that feedback thing-a-ma-job is turned on.

Good point. I hadn't considered that.
posted by AmbroseChapel at 9:56 PM on August 21, 2006


Google's spider is relatively active. You can see it yourself in your referral logs, under the user-agent name Googlebot. (It's a good idea to analyze the spider's path through your site, just to know what it's attractived to.)

The spider observes the speed of changes to your site, and alter how often it checks in accordingly.

New pages can get indexed faster than altered older ones, especially if the alterations. More popular pages, and pages on fast-moving topics, get more attention from the spider. A small site (under 1,000 pages) that hasn't been changing may get a visit every month or two, but a popular medium-sized site (between 1,000 and 100,000 pages) that changes often gets a visit on the order of weeks or even days.

Keep in mind: the spider might visit, but the data might not make into the index. They're two separate things.

You might think to forcing the spider through a new path, by changing your URLs. But that could result in the dreaded duplicate page penalty. You'd look like a spammer. You don't want that.

You really should get rid of that redirect. That could be the entire problem right there. Often spammers use redirects to fool the spider. (Always think iabout the battle between the spammers and Google when thinking about SEO. Google is constantly changing its spidering and indexing policy in its arms race with the spammers. Keep up.) In general, a 301 "Permanently Moved" is better than a 302 "Found". 301's tell the spider not to come to that URL anymore, so it's less usable by spammers.

However, in your case, neither is good. You don't want to distract attention from your /index.html.


So, I'd keep submitting what I could through the official webmaster tools like you're doing, I'd republish the site on a regular basis, I'd get rid of the redirect, and I'd be patient.


(I'm not sure about your first question on how Google found out. Google is very private about its data mining techniques. It's possible they're monitoring DNS changes, but unlikely. Far more likely candidates are their user inputs: other sites, the Toolbar, search queries, Google Mail.

And if, as I imagine, the server-side include isn't visible via HTTP, it's immaterial.)
posted by maschnitz at 1:04 AM on August 22, 2006


Google's Info Page for Webmasters. Google crawls my site once a week, roughly, and finds new content daily. What you're wanting are Sitemaps. If you're using Wordpress or Movable Type or another such blog/content management system, look for plugins to create Google sitemaps.
posted by jeversol at 9:39 AM on August 22, 2006


Have you tried Google Sitemaps?
posted by monju_bosatsu at 9:41 AM on August 22, 2006


Yes, I've tried SiteMaps, and I said so in the original post!
posted by AmbroseChapel at 5:05 PM on August 22, 2006


>if, as I imagine, the server-side include isn't visible via HTTP, it's immaterial

The soi-disant "SEO expert" I spoke to said that Google tests for new content by getting HTTP headers only, and that if a page had SSIs, which change, but the framework of the page itself didn't, then Google wouldn't notice the content-length changing. Which is ... an interesting theory.

He also advised me to change the extension of the page back to ".shtml" because then Google would know it was dynamically generated. But that doesn't make sense because, as I said above, Google only sees "domain.name.com/dir/", not "domain.name.com/dir/index.html".
posted by AmbroseChapel at 5:10 PM on August 22, 2006


Guess what? Finally! The "lorem ipsum" version of the site is gone from Google's index.

I filled out the "add my site" form last night one last time and put "PLEASE! You indexed the site before it was ready!" in the comments field.

I also changed the front page so that it's not blank, as in "empty page, instant redirect", but "page containing heading tags, brief text and important keywords, ten-second redirect".

I also poured chicken blood in a pentacle, lit thirteen black candles and danced around counterclockwise chanting for a while*.

Who knows which of these actually worked?

Seriously, I don't know which worked, but I suspect it was joining up with SiteMaps. The site got reindexed within a week or so of that.

But if this thread has taught me anything, it's that nobody really knows, and the combination of desperate need and a complete absence of hard knowledge is the perfect breeding ground for superstition, folk legend and snakeoil sales. By which I don't mean you guys here of course.

* not actually. My wife wouldn't let me.
posted by AmbroseChapel at 5:21 PM on August 22, 2006


Cool, glad my late-night gibberish could help. Sorta.

There are people who know this stuff. I've met them. They spend their days reverse engineering Google's spider. They set out test sites for the spider to crawl. Then they watch the index to see what makes it through.

But they are diamonds in the SEO rough. They get paid accordingly.

Google does not like being gamed. Yet there is huge financial motive to game them. So there's always going to be this escalating war between Google, and the spammers and shady SEO folks. Webmasters are the collateral damage.
posted by maschnitz at 5:52 PM on August 22, 2006


>they are diamonds in the SEO rough

I'll say.

OK I just logged into SiteMaps and all the information in it is out of date. SiteMaps itself still only seems to know about the "lorem ipsum" version of the site, even now that the correct version is in Google's index. Curiouser and curiouser.
posted by AmbroseChapel at 6:46 PM on August 22, 2006


« Older Please help me select RAM for my Dell 5150...   |   The all-purpose gerund Newer »
This thread is closed to new comments.