<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

      <title>Comments on: How Does a Google Query Work?</title>
      <link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work/</link>
      <description>Comments on Ask MetaFilter post How Does a Google Query Work?</description>
	  	  <pubDate>Wed, 20 Jun 2007 16:01:40 -0800</pubDate>
      <lastBuildDate>Wed, 20 Jun 2007 16:01:40 -0800</lastBuildDate>
      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>

<item>
  	<title>Question: How Does a Google Query Work?</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work</link>	
  	<description>How Does a Google Query Work? How many machines does it hit? How many clusters are there? How many Google DNS servers? How many data centers? I know Google is famously secretive about this information, but I&apos;d love to understand just how the results page gets back to me, with as much detail as possible. Many people take guesses about this, but I&apos;m looking for some real concrete data.</description>
  	<guid isPermaLink="false">post:ask.metafilter.com,2008:site.65244</guid>
  	<pubDate>Wed, 20 Jun 2007 15:19:10 -0800</pubDate>
  	<dc:creator>raconteur</dc:creator>
	
	<category>google</category>
	
</item>
<item>
  	<title>By: The Deej</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#980718</link>	
  	<description>I doubt you will get a concrete answer, but the book &lt;a href=&quot;http://www.amazon.com/exec/obidos/ASIN/0553383663/metafilter-20/ref=nosim/&quot;&gt;The Google Story&lt;/a&gt; has some fascinating information about how how they started some of the technical stuff, and how it has been scaled.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-980718</guid>
  	<pubDate>Wed, 20 Jun 2007 16:01:40 -0800</pubDate>
  	<dc:creator>The Deej</dc:creator>
</item>
<item>
  	<title>By: Diz</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#980752</link>	
  	<description>They&apos;re actually quite open about a lot of their stuff.   I read &lt;a href=&quot;http://labs.google.com/papers/gfs.html&quot;&gt;this&lt;/a&gt; last year in a file systems class.  It&apos;ll cover the lower level questions of how many machines you&apos;re hitting to get data.  I wouldn&apos;t be surprised if there&apos;s papers covering the database system and load balancing stuff in &lt;a href=&quot;http://labs.google.com/papers.html&quot;&gt;here&lt;/a&gt; too.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-980752</guid>
  	<pubDate>Wed, 20 Jun 2007 16:31:22 -0800</pubDate>
  	<dc:creator>Diz</dc:creator>
</item>
<item>
  	<title>By: husky</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#980753</link>	
  	<description>If you&apos;re also interested in how Pagerank works maybe &lt;a href=&quot;http://www.smashingmagazine.com/2007/06/05/google-pagerank-what-do-we-really-know-about-it/&quot;&gt;this&lt;/a&gt; link will help.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-980753</guid>
  	<pubDate>Wed, 20 Jun 2007 16:31:23 -0800</pubDate>
  	<dc:creator>husky</dc:creator>
</item>
<item>
  	<title>By: hincandenza</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#980783</link>	
  	<description>I answered something about the scale of (generic search engines) &lt;b&gt;&lt;a href=&quot;http://ask.metafilter.com/34063/seeking-search-engine#531604&quot;&gt;here&lt;/a&gt;&lt;/b&gt;.  That thread has some very interesting information in general, as well.&lt;br&gt;
&lt;br&gt;
The short answer to give you an idea: Google is estimated to have hundreds of thousands of machines, in dozens of datacenters (&lt;i&gt;all the major search engine players typically do&lt;/i&gt;).  &lt;br&gt;
&lt;br&gt;
The number of clusters is large- several clusters in each datacenter I&apos;d presume, each cluster representing a full &amp;quot;copy&amp;quot; of the web that Google has crawled as well as the supporting data to search this information.  DNS work is either offloaded to an Akamai, or done in-house, but in either case the DNS will be widely distributed so that the DNS responder is close to you (and thus your round trip time is low in resolving *.google.com).&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
As said above, some of the basic details on their size, and their core technologies like map/reduce and the GFS, are fairly well published.  Fact is, at this point there&apos;s only really one company that could compete with Google just from a financial/resource perspective in terms of trying to build what they did: Google has had years to build up their size and expertise, so simply &lt;i&gt;knowing&lt;/i&gt; how they do what they do isn&apos;t remotely enough anymore.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
The basic way a search query works for all search engines is pretty well standard, but here&apos;s the overview version:&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Crawling the web&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
Typically you have crawling servers- which walk the web robotically, following links on page after page- and downloading the content.  They will gather big blocks of this content and pass it on to index servers, which will parse the downloaded documents- each document has a unique 64-bit or larger documentID that makes it uniquely referenced- to pull out the key words and pass those on as indexed chunks.  In this way, billions of web pages are generated into their content blocks (containing the cache web page) and the Indexes, which tell us that a given keyword like &amp;quot;metafilter&amp;quot; or &amp;quot;raconteur&amp;quot; can be found on particular documentIDs, and/or that the words &amp;quot;metafilter&amp;quot; were found around a certain linked document on the referring page.  It also will do page ranking of some sort, particular to each search engine, which says that pages on microsoft.com or cnn.com or whatever are more valuable than xxxpornhost.linkfarm.ru.&lt;br&gt;
&lt;br&gt;
All that work is done separately, and the cache/indexes regularly updated automatically throughout the day/week/month as it re-crawls the web.  This is transparent to you, the web searcher.&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Front end webservers&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
You hit the web page and perform a search query.  This is usually one of likely hundreds of front-end web servers around the world taking your request&lt;br&gt;
The query term is scrubbed (&lt;i&gt;unusual length, invalid characters, known blocked IPs or bad terms, etc&lt;/i&gt;) before being sent to intermediary servers, we&apos;ll call those aggregators.&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Aggregators&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
These intermediary servers cache common query results, and perform the aggregation from the lower layer.  This gives the web servers a smaller number of servers to talk to, which can offload the work of talking to the thousands of actual data storage servers.  There are likely dozens of these aggregators in each cluster (&lt;i&gt;and again, likely several clusters per datacenter, and say 20-40 datacenters around the world, easily&lt;/i&gt;).  They are responsible for keeping open tcp connections to each of the lower-layer machines, and sending simultaneous queries to all of them at once.&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Data storage layer&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
At the lower layer, there are for each cluster likely hundreds if not &lt;i&gt;thousands&lt;/i&gt; of machines.  Google uses very inexpensive PCs, usually buying cheap discarded or sub-par hardware directly from vendors like Intel and throwing together hordes of machines. If a machine doesn&apos;t work- throw it out, it&apos;s cheaper to replace than to spend people-hours on it.  These machines are likely not even as powerful as your own desktop or laptop, but that doesn&apos;t matter- thousands of them working together makes up for their individual weakness and unreliability.&lt;br&gt;
&lt;br&gt;
The query is resolved via a &amp;quot;two-pass&amp;quot; method, typically.  First, the &amp;quot;index&amp;quot; is checked.  The &amp;quot;index&amp;quot; is what you think: essentially a massive database of keywords, followed by all the unique document ID that has that information.  So for a word like &amp;quot;metafilter&amp;quot;, every documentID that contains Metafilter should be in that index.  This index is massively huge- many terabytes of space- and would take ages to search if it were on one machine.  So, the index is spread among those hundreds or thousands of machines, and the aggregator simultaneously asks all of them &amp;quot;do you have this info?&amp;quot;.  Only a few might respond, but each machine will be able to search its small chunk of the index much faster (hence the sub-second times for query results).  In addition, the data is stored redundantly, and usually the aggregator will timeout any requests that aren&apos;t answered within N milliseconds (so that slow or dead servers don&apos;t kill the query response time).&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Sorting the results&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
The aggregator then has returned to it the response from all machines that had a match.  It will then rank (&lt;i&gt;using various methods- each search engine spends a lot of effort on refining these ranking methods, so obviously they&apos;re highly proprietary&lt;/i&gt;) the results for the top X hits.  It&apos;ll usually cache these results so that if you hit the Next option you&apos;ll get quick responses.&lt;br&gt;
&lt;br&gt;
Now, for the second pass, it&apos;ll take the top 10 or 20 of those results, and do documentID lookups to the content servers (&lt;i&gt;which may or may not also be the index servers, but will also number in the hundreds of servers&lt;/i&gt;) to get the blurb of the document that contains the key words.  This allows your results to have those little excerpts you&apos;ll see showing you the text in it related to your search, which isn&apos;t stored in the index.&lt;br&gt;
&lt;br&gt;
&lt;b&gt;&lt;u&gt;Passing the results back to you&lt;/u&gt;&lt;/b&gt;&lt;br&gt;
With the top N results, as well as the text excerpts from the top 10 results (or whatever set of 10 depending on what page you&apos;re on) it&apos;ll then pass that back to the web server.  The web server then formats that raw dataset into pretty html, and sends it back to you.&lt;br&gt;
&lt;br&gt;
This whole process, through the magic of optimization and highly distributed computing, will take about one tenth of a second.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-980783</guid>
  	<pubDate>Wed, 20 Jun 2007 16:52:42 -0800</pubDate>
  	<dc:creator>hincandenza</dc:creator>
</item>
<item>
  	<title>By: junesix</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#980811</link>	
  	<description>&lt;em&gt;I&apos;m looking for some real concrete data&lt;/em&gt;&lt;br&gt;
There is no concrete data because Google isn&apos;t in the business of revealing such numbers. &lt;br&gt;
&lt;br&gt;
This 2006 article titled &lt;a href=&quot;http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp&quot;&gt;How Google Works&lt;/a&gt; is a good reader. It estimates over 450,000 servers spread across 5 datacenters. A 6th datacenter is supposed to have gone up in Belgium. The article goes into the Google search mechanics a bit but &lt;strong&gt;hincandenza&lt;/strong&gt;&apos;s post is much better for the nitty-gritty of what&apos;s actually involved in a search request and returning results.&lt;br&gt;
&lt;br&gt;
The Wikipedia entry for &lt;a href=&quot;http://en.wikipedia.org/wiki/Google_platform#_note-investwallonia&quot;&gt;Google platform&lt;/a&gt; is also informative for numbers.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-980811</guid>
  	<pubDate>Wed, 20 Jun 2007 17:26:25 -0800</pubDate>
  	<dc:creator>junesix</dc:creator>
</item>
<item>
  	<title>By: reeddavid</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#981183</link>	
  	<description>Though it doesn&apos;t tell you how things work, you might find interesting this coverage of Google&apos;s new data center (Secret codename: Project 2) in Oregon. &lt;br&gt;
&lt;br&gt;
" &lt;a href=&quot;http://www.nytimes.com/2006/06/14/technology/14search.html?ex=1182571200&amp;en=ba4b9b4cc5a8472a&amp;ei=5070&quot;&gt;Hiding in Plain Sight, Google Seeks More Power (NYT)&lt;/a&gt;&lt;br&gt;
" &lt;a href=&quot;http://www.networkworld.com/news/2006/061906-top-secret-google-data-center-almost.html&quot;&gt;Top-secret Google data center almost completed (Computerworld)&lt;/a&gt;&lt;br&gt;
" &lt;a href=&quot;http://news.com.com/2300-1030_3-6089390-1.html?tag=ne.gall.pg&quot;&gt;Photos of construction (CNET)&lt;/a&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-981183</guid>
  	<pubDate>Thu, 21 Jun 2007 02:00:44 -0800</pubDate>
  	<dc:creator>reeddavid</dc:creator>
</item>
<item>
  	<title>By: MetaMonkey</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#981377</link>	
  	<description>Theres a little more info in &lt;a href=&quot;http://ask.metafilter.com/61024/I-want-to-start-a-search-enginenow-what&quot;&gt;this Askme&lt;/a&gt;.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-981377</guid>
  	<pubDate>Thu, 21 Jun 2007 07:32:09 -0800</pubDate>
  	<dc:creator>MetaMonkey</dc:creator>
</item>
<item>
  	<title>By: raconteur</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#981448</link>	
  	<description>This is really great stuff. Thanks, everyone. I&apos;m also interested in understanding how much data Google parses every day. ie-- Google parses the equivalent x Libraries of Congress every day-- that sort of thing. Having a really hard time finding that, probably owing to the lack of hard numbers...</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-981448</guid>
  	<pubDate>Thu, 21 Jun 2007 08:55:12 -0800</pubDate>
  	<dc:creator>raconteur</dc:creator>
</item>
<item>
  	<title>By: chrisamiller</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#982081</link>	
  	<description>Have you tried emailing them?  A stat like that sounds like something that they might divulge willingly, if you can get in touch with the right google rep.  Probably someone from their PR department.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-982081</guid>
  	<pubDate>Thu, 21 Jun 2007 18:51:06 -0800</pubDate>
  	<dc:creator>chrisamiller</dc:creator>
</item>
<item>
  	<title>By: lodev</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#982331</link>	
  	<description>They surely have more than 5 DC&apos;s. One that I know of is &lt;a href=&quot;http://flickr.com/photos/erwinboogert/sets/29760/&quot;&gt;this one in Groningen&lt;/a&gt;, the Netherlands.&lt;br&gt;
&lt;br&gt;
They are also going to start construction of a datacenter in Saint Ghislain, Belgium in about a month or two. It should become operational in Q1 2008.&lt;br&gt;
&lt;br&gt;
One way to get to know DC locations is by checking their job offerings. Sometimes the locations are listed.&lt;br&gt;
&lt;br&gt;
&lt;small&gt;&lt;small&gt;&lt;small&gt;&lt;small&gt;&lt;small&gt;I could tell more, but then I&apos;d be breaking NDAs...&lt;/small&gt;&lt;/small&gt;&lt;/small&gt;&lt;/small&gt;&lt;/small&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-982331</guid>
  	<pubDate>Fri, 22 Jun 2007 01:52:56 -0800</pubDate>
  	<dc:creator>lodev</dc:creator>
</item>
<item>
  	<title>By: sparkletone</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#991316</link>	
  	<description>The &amp;quot;How Google Works&amp;quot; article is interesting, but I recall a friend of mine who works for Teh GOOG just laughing at the server and data center numbers in it.&lt;br&gt;
&lt;br&gt;
Of course, he couldn&apos;t be very specific as to &lt;em&gt;why&lt;/em&gt; he was laughing, but there was definite LOLing.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-991316</guid>
  	<pubDate>Sat, 30 Jun 2007 19:47:56 -0800</pubDate>
  	<dc:creator>sparkletone</dc:creator>
</item>
<item>
  	<title>By: spiderwire</title>
  	<link>http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work#992034</link>	
  	<description>Um... a bit late to the party, perhaps, but &lt;a href=&quot;http://209.85.163.132/papers/googlecluster-ieee.pdf&quot;&gt;this PDF&lt;/a&gt; explains the architecture pretty straightforwardly. It&apos;s a little old, but very specific about the hardware they use, the cost of various operations (cache misses, etc.) in cycles, etc.&lt;br&gt;
&lt;br&gt;
Their terminology isn&apos;t quite the same as what Hal describes, but it&apos;s process is close, &lt;b&gt;except&lt;/b&gt; that the index servers (&amp;quot;aggregators&amp;quot;) actually only talk to a small piece of the overall database (it&apos;s a share-nothing architecture -- so Google&apos;s &amp;quot;snapshot&amp;quot; of the Web is broken up into serial pieces, a &amp;quot;shard,&amp;quot; which has a &amp;quot;pool&amp;quot; of servers responsible only for that shard), and then pass their results back up to the document servers to do the last-step ranking, extraction, and formatting.&lt;br&gt;
&lt;br&gt;
This may seem like a minor quibble, but it&apos;s important to grasp that parallelization -- Google&apos;s architecture wouldn&apos;t work without it.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.65244-992034</guid>
  	<pubDate>Sun, 01 Jul 2007 22:37:15 -0800</pubDate>
  	<dc:creator>spiderwire</dc:creator>
</item>

    </channel>
</rss>
