Google's Wayback Machine
February 16, 2006 8:19 PM   Subscribe

Can anyone explain the strange results received when doing a search on the keyword 'http' on Google?

The results seem like a "top 10 of the web 1998" rather than a list of pages referencing the http protocol (beyond the obvious transport). Results when searching on other search engines look like what one would expect.

Microsoft as number 1? Altavista a number 3? My Excite as number 8??
posted by eschatfische to Computers & Internet (21 answers total)
 
It is a top 10 of sorts: think of all the links that include "http" (viz, all of them) -- you're seeing the number one hit for links including http, which is ... ta-da! Microsoft.
posted by bonaldi at 8:25 PM on February 16, 2006


Response by poster: That would mean that *Altavista* has more links to it than anyone but Microsoft and the W3 Consortium. I'd be shocked and amazed if that were the case here in 2006.
posted by eschatfische at 8:27 PM on February 16, 2006


Well, it sure seems to be the case. Try a search for www -- it's even higher there. I think they're taking google links out of the loop though -- the search for http definitely used to have it as #2
posted by bonaldi at 8:29 PM on February 16, 2006


Altavista is very popular as a translation tool, thanks to babelfish.

Just think of all the times people link to pre-translated pages.
posted by clord at 8:31 PM on February 16, 2006


Response by poster: I still don't think so. Results of a few searches:

link:altavista.com - 61,600 results
link:www.altavista.com - 61,600 results
link:babelfish.altavista.com - 23,400 results

link:yahoo.com - 978,000 results
link:www.yahoo.com - 978,000 results

That's over an order of magnitude. Yet AltaVista beat Yahoo in the rankings for http, and is running neck and neck at number 2 for www. Note that it's easy to find sites containing www which come up with far more than Altavista's 61,600 results.
posted by eschatfische at 8:43 PM on February 16, 2006


It's not sites with http in the url, or at least it's not much that. It's sites that Google thinks are about this "http" thing, just like a Google search for "car" gets you sites about cars, not just sites with "car" in the URL.

Most importantly, it's sites that link to Microsoft with "http" in the text of the link. If people tend to link to you this way, you won't get raised in the results for "http", but if people tend to link to you like this -- http://www.metafilter.com/ -- then you'll get raised.

So that means there are a lot of links to Microsoft which use literal URLs. That's not surprising; they're the 600-lb gorilla, and every security advisory, Windows tips and tricks page, and so on in the world probably includes some literal Microsoft URLs, especially when you count in mailing lists and forums and so on. The second hit, the W3C, is no surprise. AltaVista was the dominant search engine for longer than Google has existed, and Yahoo before that:

Results 1 - 10 of about 1,790,000 for http "www.yahoo.com".
Results 1 - 10 of about 1,950,000 for http "www.altavista.com"

(Using a "link:" search doesn't give useful results there -- remember, we're after links that have the word "http" in the link text.) Between 1995 and 2001, every "How do I use the Internet?" document in the world probably contained a link to Altavista and Yahoo, and lots of them would have URLs as the link text.

And so on: lots of people link to literal CNN URLs, and to Amazon items, using the URL as the link text (again, I bet web forums and blogs help a lot here), then a couple more search engines, and then Adobe because people give out the URL of Acrobat Reader, and so on.

A search for "www" has similar results for the same reason: slightly different ordering, but again lots and lots of websites that people have linked to using the URL as the text of the link.
posted by mendel at 8:58 PM on February 16, 2006


Similar: Where do the most people link when they say "click here?"
posted by mbrubeck at 9:03 PM on February 16, 2006


Response by poster: Great answer, Mendel. Still, I don't buy it. Look at this:

Results 1 - 10 of about 11,000,000 for http "www.ebay.com".
Results 1 - 10 of about 9,530,000 for http "www.whitehouse.gov".
Results 1 - 10 of about 1,970,000 for http "www.imdb.com".

It's sites that Google thinks are about this "http" thing, just like a Google search for "car" gets you sites about cars, not just sites with "car" in the URL.

The reason I think this is so weird is that these are obviously not sites about http. There are loads and loads of relevant pages about the HTTP protocol -- and if you search for http on Yahoo or Altavista or any other search engine, you'll find them. The results on Google genuinely don't seem relevant either in content (they're mostly just plain not about http except the w3c links) or rank (they don't seem to be the top pages containing the phrase http or are the top sites being linked to with the phrase http).
posted by eschatfische at 9:13 PM on February 16, 2006


No, they're not "sites about http" in that fluffy sense; I shouldn't have used that phrase. They're sites that Google has given a high rank in terms of how "http" is connected to their site. The main thing Google uses is pagerank: if a lot of high-ranking sites link to a site using a word or a phrase, that site gets ranked high for that word or phrase.

That's how googlebombs work, for example. A Google search for "miserable failure" gets you George Bush's biography on the White House site because a lot of anti-Bush people have intentionally linked to that page with those words as the text of the link to accomplish that result.

The results of the "http" search are an accidental googlebomb, because a lot of people put the word "http" in links, and as far as google's concerned, there's no difference between the text "http://www.microsoft.com" and the text "http www microsoft com".

These sites aren't "about" http like the White House biography of Bush isn't "about" a miserable failure, but Google has been fooled into thinking otherwise in both cases.

As for your ebay results: result numbers in Google over 100k or so can be off by a couple orders of magnitude. I only included those results to refute the previous comment about searching with "link:". Ebay not being listed means that Google doesn't think it's as important on a search for "http" as Microsoft using its algorithms, which prioritize link text in pagerank but aren't limited to only that.
posted by mendel at 9:26 PM on February 16, 2006


Here's some more examples of accidental googlebombs. Searching for txt, the first result is a text file, the RFC that defines URIs; obviously an extremely popular document, and one that people would tend to link to with the URL as the text of the link (in a "You can find that here" sort of link). Or shtml, where the first result is an important SEC document which happens to have an shtml extension. phtml does the same thing. www2 turns up a bunch of sites that happen to be on machines named "www2". com gives you results similar to "http" except they're all .com domains, and index.html gives you results like "http". net is a neat one: the first result, Microsoft's .NET site, is both something that would get linked to with the word "net" a lot and is about "net", but most of the rest of the first page of results are clearly popular sites that happen to be at .net domains.

In all of those, the only single principle that explains why those documents would come up is that people link to them using their URL as the text of the link. Since that's the way we know Google's pagerank algorithm works in the first place, it's pretty clear to me that that's what's happening.

If you still don't buy it: what do you think Google's using instead of pagerank?
posted by mendel at 9:42 PM on February 16, 2006


Best answer: I suspect older pages tend to have higher page rank, as they've been around longer to collect more inbound links. A lot of those older pages then link to old favourites like Altavista. This in turn boosts its page rank to levels unreachable by the modern MySpaces and Diggs. Kind of like a pyramid scheme.

Also contributing to the high ranking of old pages for http and www was that in the olden days people were more likely to use a full URL for the link text, because webmasters and browsers were less comfortable with the hyperlink concept. (I'm theorizing here.)
posted by teg at 8:39 AM on February 17, 2006


To illustrate that point, look at the results of Google for "search engine".

I'd never thought to do that before this thread. Neat!
posted by moss at 9:03 AM on February 17, 2006


Best answer: Older pages have a lot going for them in terms of Google. One main reason for that is that's hard for spammers to get ahold of old pages-- they're either locked, don't exist, or they're still in use.
posted by chaz at 10:28 AM on February 17, 2006


Response by poster: Aha! I didn't know that PageRank favored sites that had older domain registrations, so that (combined with the likely older age of the sites that have linked to these top-ranked sites) would explain the bias towards older sites in the results.

Funny how PageRank actually seems to be reducing relevance here.
posted by eschatfische at 5:35 PM on February 17, 2006


I think teg's right. People have gotten more savvy about how to use links in a paragraph of text. Back in the 90's, there were a lot more links that looked like "http://www.metafilter.com" and a lot fewer that looked like "Metafilter."

If websites that were created during Altavista's heyday are more likely to use "http" in link text than newer sites, then that could help explain Altavista's position in the results.
posted by nebulawindphone at 5:47 PM on February 17, 2006


(Lemme rephrase that. mendel et al are definitely right. I just think teg may be onto something too.)
posted by nebulawindphone at 5:49 PM on February 17, 2006


The second result of http://a
posted by erebora at 10:14 PM on February 17, 2006


mendel:In all of those, the only single principle that explains why those documents would come up is that people link to them using their URL as the text of the link. Since that's the way we know Google's pagerank algorithm works in the first place, it's pretty clear to me that that's what's happening.

actually, although i agree with the spirit of your comments, pagerank doesn't have anything to do with anchor text (which you've called link text). Pagerank infers the global importance of a page on the web by the number and quality (recursively measured by pagerank) of pages that link to that page. In theory, only the presence or absence of a link matters, not the text attached to it. Also, pagerank is not the same as counting the number of links to a page, since the pagerank of the pages that link to a page make quite a bit of difference to the eventual score.

However, pagerank isn't the only thing Google uses to rank pages. Ranking pages by how similar (roughly speaking, how many times the same words co-occur) they are to the anchor text pointing to that page is a pretty well-known technique, one that Google certainly use.

As suggested earlier in the thread, the age of pages on the web is likely to be something that Google explicitly consider as well. A patent that Google applied for has this to say:

29. The method of claim 26, wherein the scoring the document includes: determining an age of each link pointing to the document, determining an age distribution associated with the links based on the ages of the links, and scoring the document based, at least in part, on the age distribution associated with the links.

My point in all of this is that the big search engines use a lot of different techniques to rank documents, and the results that you see are a combination of them all. In fact, MSN have so many different factors in their ranking scheme that they use a neural net to combine them. People place far too much importance on pagerank, as it's not the only thing that makes a page rank highly.
posted by nml at 6:51 PM on February 19, 2006


From what I know about Google, mendel is right (or at least the closest to being right.)

nml, I think you're hung up too much on the word 'pagerank'. You're defining it as the specific number that results from the specific algorithm that (if you're familiar with Google voodoo) we know and love.

I think mendel is referring to 'pagerank' on a higher level, which would include what you're defining as pagerank.

Of course, my comment's a little late...
posted by blacklite at 2:16 AM on February 22, 2006


blacklite: nml, I think you're hung up too much on the word 'pagerank'

entirely possible ;o). I study search engines, so i'm coming at this topic from a fairly technical perspective.

blacklite: You're defining it as the specific number that results from the specific algorithm that (if you're familiar with Google voodoo) we know and love.

I understand what you're getting at, and i largely agree with mendel's comments as well. However, i'd like to point out that it wasn't me who defined pagerank that way.

I'm happy enough for 'pagerank' to be used in lieu of 'whatever order google spits results out in', i just wanted to mention that's not really what it means.
posted by nml at 4:47 PM on February 22, 2006


Related MeTa follow-up.
posted by cribcage at 8:38 PM on March 10, 2006


« Older Itunes merged my music and destroyed my...   |   How to get my Second Life account? Newer »
This thread is closed to new comments.