Google News has dropped my site as a news source! Help!
July 24, 2006 1:30 PM Subscribe
My organization's web site was recently redesigned (late May), and since then, Google News has dropped us as a news source! The Google Gods are vain and capricious! What can I do?
potsmokinghippieoverlords.org (not its real name, of course) recently redesigned its web site. Not only that, but we switched to a new web host, so the IP address changed as well.
Before the grand re-launching of the site, our press releases would be posted to our site and picked up by Google News within a couple of hours. All the press hippies were happy! (we had a good two year run of hits like this)
Yet ever since the fateful relaunch in mid-May, we've had one (1) press release picked up by Google News. I was expecting some lag time between getting crawled by bots and search engine spiders, but it's nearly August! The press hippies are worried!
I went to this page to send them a message in late June, telling them "hey, we've changed our web site, this is our new IP, please count us as a news source again" but no reply.
We're still getting hits from wire service stories and such -- but why no more love from Google News? Any suggestions? Similar experiences?
potsmokinghippieoverlords.org (not its real name, of course) recently redesigned its web site. Not only that, but we switched to a new web host, so the IP address changed as well.
Before the grand re-launching of the site, our press releases would be posted to our site and picked up by Google News within a couple of hours. All the press hippies were happy! (we had a good two year run of hits like this)
Yet ever since the fateful relaunch in mid-May, we've had one (1) press release picked up by Google News. I was expecting some lag time between getting crawled by bots and search engine spiders, but it's nearly August! The press hippies are worried!
I went to this page to send them a message in late June, telling them "hey, we've changed our web site, this is our new IP, please count us as a news source again" but no reply.
We're still getting hits from wire service stories and such -- but why no more love from Google News? Any suggestions? Similar experiences?
The medium-sized American metropolitan daily newspaper I write for seems to slide in and out of Google News. I've asked our online folks why that happens, but have never received a satisfactory answer.
...and we're out. I just searched on my own name and didn't get any relevant hits.
You could try asking Google.
posted by Bitstop at 1:54 PM on July 24, 2006
...and we're out. I just searched on my own name and didn't get any relevant hits.
You could try asking Google.
posted by Bitstop at 1:54 PM on July 24, 2006
but why no more love from Google News
I have no solutions for you but rest assured you are not alone. A site I write for on a regular basis has been having the GN fight for going on half a year now with little success. GN is inconsistent in their policies (other similar sites with less unique news contend are listed and we are not) and it seems the key to success is persistence.
posted by phearlez at 2:21 PM on July 24, 2006
I have no solutions for you but rest assured you are not alone. A site I write for on a regular basis has been having the GN fight for going on half a year now with little success. GN is inconsistent in their policies (other similar sites with less unique news contend are listed and we are not) and it seems the key to success is persistence.
posted by phearlez at 2:21 PM on July 24, 2006
The change in IP address should be utterly inconsequential; that is the entire point of DNS. What would send off red flags to me is the fact that the site was redesigned. I don't know how Google News gathers its material but if it's using screen scraping then a major change in the structure of the HTML could render the given scraping recipe worthless. If this was the case it would take an engineer (whoever added the site and scraping recipe in the first place) to look into it, which I assume takes a while.
I know that Google News says that it's all machine-generated but unless it works entirely by RSS/Atom then there has to be some initial hand tuning by a live person to get a site parsed by their scraper.
This also means that if your site has a feed you need to check to be sure it's still working and produces valid XML. You might also want to verify that the auto-discovery links still exist in the page headers and/or you employ a redirect if the feed's URL changed.
posted by Rhomboid at 2:25 PM on July 24, 2006
I know that Google News says that it's all machine-generated but unless it works entirely by RSS/Atom then there has to be some initial hand tuning by a live person to get a site parsed by their scraper.
This also means that if your site has a feed you need to check to be sure it's still working and produces valid XML. You might also want to verify that the auto-discovery links still exist in the page headers and/or you employ a redirect if the feed's URL changed.
posted by Rhomboid at 2:25 PM on July 24, 2006
Yep, I think that might be it, if you don't publish a feed. Google would have been screen-scraping your site, and if you've just redesigned it then their screen-scraping won't work any more.
Quick solutionstart publishing a feed if you don't already. Google likes those.
(I've actually found Google receptive when you e-mail them, so you should try it.)
posted by randomination at 3:22 PM on July 24, 2006
Quick solutionstart publishing a feed if you don't already. Google likes those.
(I've actually found Google receptive when you e-mail them, so you should try it.)
posted by randomination at 3:22 PM on July 24, 2006
Response by poster: Hmm. We do not have a feed in place yet. Sounds like we should expedite this.
I've emailed Google in late June and heard nothing, as I mentioned. No harm in trying again, I suppose.
Thanks for all the replies!
posted by potsmokinghippieoverlord at 4:07 PM on July 24, 2006
I've emailed Google in late June and heard nothing, as I mentioned. No harm in trying again, I suppose.
Thanks for all the replies!
posted by potsmokinghippieoverlord at 4:07 PM on July 24, 2006
Response by poster: UPDATE:
I wrote to the Google folks again, and I just received this response:
Thank you for bringing this to our attention. After some investigation, we've found that our system can't crawl your articles due to the format of their URLs. In order for our system to crawl your content, the article URLs can't contain only an isolated four-digit number that resembles a year. Please keep in mind that each of your article URLs must contain a number consisting of at least three digits to be crawlable by our system.
For example, if the only digits in your article URL are "2006," our system may not be able to crawl your content. Once you update your URL structure, our system should automatically begin crawling your articles.
Regards,
The Google Team
So apparently Google is choking on our new URLs, which are now:
http://www.potsmokinghippieoverlord.org/news/press/2006/google-hates-me.html
where it used to be:
http://www.potsmokinghippieoverlord.org/news/display.html?ID=1189
I'm still a bit confused.
posted by potsmokinghippieoverlord at 10:51 AM on July 25, 2006
I wrote to the Google folks again, and I just received this response:
Thank you for bringing this to our attention. After some investigation, we've found that our system can't crawl your articles due to the format of their URLs. In order for our system to crawl your content, the article URLs can't contain only an isolated four-digit number that resembles a year. Please keep in mind that each of your article URLs must contain a number consisting of at least three digits to be crawlable by our system.
For example, if the only digits in your article URL are "2006," our system may not be able to crawl your content. Once you update your URL structure, our system should automatically begin crawling your articles.
Regards,
The Google Team
So apparently Google is choking on our new URLs, which are now:
http://www.potsmokinghippieoverlord.org/news/press/2006/google-hates-me.html
where it used to be:
http://www.potsmokinghippieoverlord.org/news/display.html?ID=1189
I'm still a bit confused.
posted by potsmokinghippieoverlord at 10:51 AM on July 25, 2006
That is an amazing technical shortcoming for Google News to have. Remarkable.
So, are you going to redo your site?
posted by alms at 7:01 PM on July 25, 2006
So, are you going to redo your site?
posted by alms at 7:01 PM on July 25, 2006
Response by poster: alms: I'm trying to get clarification from Google as to precisely what would work.
I got pointed to this link, which features this:
I got pointed to this link, which features this:
Technical requirements:posted by potsmokinghippieoverlord at 4:09 PM on July 26, 2006
* In order for the Google crawler to correctly gather articles, each page that displays an article's full text needs to have a unique URL that does not change. Google cannot include sites in Google News that display multiple articles at the same URL.
* The URL for each article must contain a unique number consisting of at least three digits.
* Keep in mind that Google cannot include sites for which the URL of the main page includes a date. URLs with dates in them often change on a daily or weekly basis. This prevents Google from crawling the site for new content, as Google is unable to detect the most current URL to be crawled.
* Google's automated crawler is currently best able to crawl regular HTML links. Google is unable to crawl image links or links embedded in JavaScript.
Response by poster: And yet Google News does not seem to have a problem with NYT links:
http://www.nytimes.com/2006/07/26/us/26rangers.html?ref=us
posted by potsmokinghippieoverlord at 4:57 PM on July 26, 2006
http://www.nytimes.com/2006/07/26/us/26rangers.html?ref=us
posted by potsmokinghippieoverlord at 4:57 PM on July 26, 2006
Response by poster: Google replies to my request for clarification, and the saga continues:
Thank you for your reply. As we mentioned in our previous email, if the only digits in your article URL resemble a year (e.g. "1999" or "2006") our system may not be able to crawl your content.posted by potsmokinghippieoverlord at 12:24 PM on July 27, 2006
For example, our news crawler wouldn't crawl articles with the following URLs:
http://www.potsmokinghippieoverlords.org/news/display.html?ID=2006
http://www.potsmokinghippieoverlords.org/news/display.html?ID=yr2006
It would crawl these pages:
http://www.potsmokinghippieoverlords.org/news/display.html?ID=2006/15/04
http://www.potsmokinghippieoverlords.org/news/display.html?ID=yr2006/15/04
Additionally, in order to have your articles crawled by Google News, your article URLs must contain a number consisting of at least three digits. This only applies for the inclusion of your content in Google News.
If you're able to restructure each of your URLs, Google News should begin crawling your content automatically.
Regards,
The Google Team
Response by poster: OK -- heard from the Google Newsheads again. I asked if we changed our folder from "2006" to "006" if that would work, and I asked if mm/dd was needed:
posted by potsmokinghippieoverlord at 3:14 PM on July 28, 2006
It appears that our system should be able to crawl URLs with the same formatting as the example you provided. Also, while we do not require a mm/dd stamp for article's URLs, please be aware that we can't guarantee that we will crawl all of the content on a news site.So! We'll make a slight adjustment, wait for the google spiders, and see what gets crawled!
posted by potsmokinghippieoverlord at 3:14 PM on July 28, 2006
Response by poster: **UPDATE**
(not that anyone cares, but in case someone stumbles across this thread...)
Made the adjustment in the URL (the file structure is the same), so now press releases read:
http://www.potsmokinghippieoverlord.org/news/006/newsworthiness.html
We published a new story online today and Google News picked it up with two hours! Hooray!!!
posted by potsmokinghippieoverlord at 6:27 PM on August 3, 2006
(not that anyone cares, but in case someone stumbles across this thread...)
Made the adjustment in the URL (the file structure is the same), so now press releases read:
http://www.potsmokinghippieoverlord.org/news/006/newsworthiness.html
We published a new story online today and Google News picked it up with two hours! Hooray!!!
posted by potsmokinghippieoverlord at 6:27 PM on August 3, 2006
It’s tough to swallow how much power google has over influencing your site architecture.
posted by sonunet at 4:30 PM on January 16, 2007
posted by sonunet at 4:30 PM on January 16, 2007
This thread is closed to new comments.
posted by alms at 1:40 PM on July 24, 2006