Google News has dropped my site as a news source! Help!
July 24, 2006 1:30 PM   Subscribe

My organization's web site was recently redesigned (late May), and since then, Google News has dropped us as a news source! The Google Gods are vain and capricious! What can I do?

potsmokinghippieoverlords.org (not its real name, of course) recently redesigned its web site. Not only that, but we switched to a new web host, so the IP address changed as well.

Before the grand re-launching of the site, our press releases would be posted to our site and picked up by Google News within a couple of hours. All the press hippies were happy! (we had a good two year run of hits like this)

Yet ever since the fateful relaunch in mid-May, we've had one (1) press release picked up by Google News. I was expecting some lag time between getting crawled by bots and search engine spiders, but it's nearly August! The press hippies are worried!

I went to this page to send them a message in late June, telling them "hey, we've changed our web site, this is our new IP, please count us as a news source again" but no reply.

We're still getting hits from wire service stories and such -- but why no more love from Google News? Any suggestions? Similar experiences?
posted by potsmokinghippieoverlord to Technology (14 answers total) 1 user marked this as a favorite
 
Can you tell us more about your organization? Is it an actual media outlet, or is it some other kind of private organization (for-profit company, NGO, etc) that occassionally issues press releases?
posted by alms at 1:40 PM on July 24, 2006


The medium-sized American metropolitan daily newspaper I write for seems to slide in and out of Google News. I've asked our online folks why that happens, but have never received a satisfactory answer.

...and we're out. I just searched on my own name and didn't get any relevant hits.

You could try asking Google.
posted by Bitstop at 1:54 PM on July 24, 2006


but why no more love from Google News

I have no solutions for you but rest assured you are not alone. A site I write for on a regular basis has been having the GN fight for going on half a year now with little success. GN is inconsistent in their policies (other similar sites with less unique news contend are listed and we are not) and it seems the key to success is persistence.
posted by phearlez at 2:21 PM on July 24, 2006


The change in IP address should be utterly inconsequential; that is the entire point of DNS. What would send off red flags to me is the fact that the site was redesigned. I don't know how Google News gathers its material but if it's using screen scraping then a major change in the structure of the HTML could render the given scraping recipe worthless. If this was the case it would take an engineer (whoever added the site and scraping recipe in the first place) to look into it, which I assume takes a while.

I know that Google News says that it's all machine-generated but unless it works entirely by RSS/Atom then there has to be some initial hand tuning by a live person to get a site parsed by their scraper.

This also means that if your site has a feed you need to check to be sure it's still working and produces valid XML. You might also want to verify that the auto-discovery links still exist in the page headers and/or you employ a redirect if the feed's URL changed.
posted by Rhomboid at 2:25 PM on July 24, 2006


Yep, I think that might be it, if you don't publish a feed. Google would have been screen-scraping your site, and if you've just redesigned it then their screen-scraping won't work any more.

Quick solution—start publishing a feed if you don't already. Google likes those.

(I've actually found Google receptive when you e-mail them, so you should try it.)
posted by randomination at 3:22 PM on July 24, 2006


Response by poster: Hmm. We do not have a feed in place yet. Sounds like we should expedite this.

I've emailed Google in late June and heard nothing, as I mentioned. No harm in trying again, I suppose.

Thanks for all the replies!
posted by potsmokinghippieoverlord at 4:07 PM on July 24, 2006


Response by poster: UPDATE:

I wrote to the Google folks again, and I just received this response:

Thank you for bringing this to our attention. After some investigation, we've found that our system can't crawl your articles due to the format of their URLs. In order for our system to crawl your content, the article URLs can't contain only an isolated four-digit number that resembles a year. Please keep in mind that each of your article URLs must contain a number consisting of at least three digits to be crawlable by our system.

For example, if the only digits in your article URL are "2006," our system may not be able to crawl your content. Once you update your URL structure, our system should automatically begin crawling your articles.

Regards,
The Google Team


So apparently Google is choking on our new URLs, which are now:

http://www.potsmokinghippieoverlord.org/news/press/2006/google-hates-me.html

where it used to be:

http://www.potsmokinghippieoverlord.org/news/display.html?ID=1189

I'm still a bit confused.
posted by potsmokinghippieoverlord at 10:51 AM on July 25, 2006


That is an amazing technical shortcoming for Google News to have. Remarkable.

So, are you going to redo your site?
posted by alms at 7:01 PM on July 25, 2006


Response by poster: alms: I'm trying to get clarification from Google as to precisely what would work.

I got pointed to this link, which features this:
Technical requirements:

* In order for the Google crawler to correctly gather articles, each page that displays an article's full text needs to have a unique URL that does not change. Google cannot include sites in Google News that display multiple articles at the same URL.

* The URL for each article must contain a unique number consisting of at least three digits.

* Keep in mind that Google cannot include sites for which the URL of the main page includes a date. URLs with dates in them often change on a daily or weekly basis. This prevents Google from crawling the site for new content, as Google is unable to detect the most current URL to be crawled.

* Google's automated crawler is currently best able to crawl regular HTML links. Google is unable to crawl image links or links embedded in JavaScript.
posted by potsmokinghippieoverlord at 4:09 PM on July 26, 2006


Response by poster: And yet Google News does not seem to have a problem with NYT links:

http://www.nytimes.com/2006/07/26/us/26rangers.html?ref=us
posted by potsmokinghippieoverlord at 4:57 PM on July 26, 2006


Response by poster: Google replies to my request for clarification, and the saga continues:
Thank you for your reply. As we mentioned in our previous email, if the only digits in your article URL resemble a year (e.g. "1999" or "2006") our system may not be able to crawl your content.

For example, our news crawler wouldn't crawl articles with the following URLs:
http://www.potsmokinghippieoverlords.org/news/display.html?ID=2006
http://www.potsmokinghippieoverlords.org/news/display.html?ID=yr2006

It would crawl these pages:
http://www.potsmokinghippieoverlords.org/news/display.html?ID=2006/15/04
http://www.potsmokinghippieoverlords.org/news/display.html?ID=yr2006/15/04

Additionally, in order to have your articles crawled by Google News, your article URLs must contain a number consisting of at least three digits. This only applies for the inclusion of your content in Google News.

If you're able to restructure each of your URLs, Google News should begin crawling your content automatically.

Regards,
The Google Team
posted by potsmokinghippieoverlord at 12:24 PM on July 27, 2006


Response by poster: OK -- heard from the Google Newsheads again. I asked if we changed our folder from "2006" to "006" if that would work, and I asked if mm/dd was needed:
It appears that our system should be able to crawl URLs with the same formatting as the example you provided. Also, while we do not require a mm/dd stamp for article's URLs, please be aware that we can't guarantee that we will crawl all of the content on a news site.
So! We'll make a slight adjustment, wait for the google spiders, and see what gets crawled!
posted by potsmokinghippieoverlord at 3:14 PM on July 28, 2006


Response by poster: **UPDATE**

(not that anyone cares, but in case someone stumbles across this thread...)

Made the adjustment in the URL (the file structure is the same), so now press releases read:

http://www.potsmokinghippieoverlord.org/news/006/newsworthiness.html

We published a new story online today and Google News picked it up with two hours! Hooray!!!
posted by potsmokinghippieoverlord at 6:27 PM on August 3, 2006


It’s tough to swallow how much power google has over influencing your site architecture.
posted by sonunet at 4:30 PM on January 16, 2007


« Older Translation Help Please   |   Best practices for researching classical music? Newer »
This thread is closed to new comments.