How Does a Google Query Work?
June 20, 2007 3:19 PM   Subscribe

How Does a Google Query Work? How many machines does it hit? How many clusters are there? How many Google DNS servers? How many data centers? I know Google is famously secretive about this information, but I'd love to understand just how the results page gets back to me, with as much detail as possible. Many people take guesses about this, but I'm looking for some real concrete data.
posted by raconteur to Technology (12 answers total) 20 users marked this as a favorite
 
I doubt you will get a concrete answer, but the book The Google Story has some fascinating information about how how they started some of the technical stuff, and how it has been scaled.
posted by The Deej at 4:01 PM on June 20, 2007


They're actually quite open about a lot of their stuff. I read this last year in a file systems class. It'll cover the lower level questions of how many machines you're hitting to get data. I wouldn't be surprised if there's papers covering the database system and load balancing stuff in here too.
posted by Diz at 4:31 PM on June 20, 2007


If you're also interested in how Pagerank works maybe this link will help.
posted by husky at 4:31 PM on June 20, 2007


Best answer: I answered something about the scale of (generic search engines) here. That thread has some very interesting information in general, as well.

The short answer to give you an idea: Google is estimated to have hundreds of thousands of machines, in dozens of datacenters (all the major search engine players typically do).

The number of clusters is large- several clusters in each datacenter I'd presume, each cluster representing a full "copy" of the web that Google has crawled as well as the supporting data to search this information. DNS work is either offloaded to an Akamai, or done in-house, but in either case the DNS will be widely distributed so that the DNS responder is close to you (and thus your round trip time is low in resolving *.google.com).


As said above, some of the basic details on their size, and their core technologies like map/reduce and the GFS, are fairly well published. Fact is, at this point there's only really one company that could compete with Google just from a financial/resource perspective in terms of trying to build what they did: Google has had years to build up their size and expertise, so simply knowing how they do what they do isn't remotely enough anymore.


The basic way a search query works for all search engines is pretty well standard, but here's the overview version:

Crawling the web
Typically you have crawling servers- which walk the web robotically, following links on page after page- and downloading the content. They will gather big blocks of this content and pass it on to index servers, which will parse the downloaded documents- each document has a unique 64-bit or larger documentID that makes it uniquely referenced- to pull out the key words and pass those on as indexed chunks. In this way, billions of web pages are generated into their content blocks (containing the cache web page) and the Indexes, which tell us that a given keyword like "metafilter" or "raconteur" can be found on particular documentIDs, and/or that the words "metafilter" were found around a certain linked document on the referring page. It also will do page ranking of some sort, particular to each search engine, which says that pages on microsoft.com or cnn.com or whatever are more valuable than xxxpornhost.linkfarm.ru.

All that work is done separately, and the cache/indexes regularly updated automatically throughout the day/week/month as it re-crawls the web. This is transparent to you, the web searcher.

Front end webservers
You hit the web page and perform a search query. This is usually one of likely hundreds of front-end web servers around the world taking your request
The query term is scrubbed (unusual length, invalid characters, known blocked IPs or bad terms, etc) before being sent to intermediary servers, we'll call those aggregators.

Aggregators
These intermediary servers cache common query results, and perform the aggregation from the lower layer. This gives the web servers a smaller number of servers to talk to, which can offload the work of talking to the thousands of actual data storage servers. There are likely dozens of these aggregators in each cluster (and again, likely several clusters per datacenter, and say 20-40 datacenters around the world, easily). They are responsible for keeping open tcp connections to each of the lower-layer machines, and sending simultaneous queries to all of them at once.

Data storage layer
At the lower layer, there are for each cluster likely hundreds if not thousands of machines. Google uses very inexpensive PCs, usually buying cheap discarded or sub-par hardware directly from vendors like Intel and throwing together hordes of machines. If a machine doesn't work- throw it out, it's cheaper to replace than to spend people-hours on it. These machines are likely not even as powerful as your own desktop or laptop, but that doesn't matter- thousands of them working together makes up for their individual weakness and unreliability.

The query is resolved via a "two-pass" method, typically. First, the "index" is checked. The "index" is what you think: essentially a massive database of keywords, followed by all the unique document ID that has that information. So for a word like "metafilter", every documentID that contains Metafilter should be in that index. This index is massively huge- many terabytes of space- and would take ages to search if it were on one machine. So, the index is spread among those hundreds or thousands of machines, and the aggregator simultaneously asks all of them "do you have this info?". Only a few might respond, but each machine will be able to search its small chunk of the index much faster (hence the sub-second times for query results). In addition, the data is stored redundantly, and usually the aggregator will timeout any requests that aren't answered within N milliseconds (so that slow or dead servers don't kill the query response time).

Sorting the results
The aggregator then has returned to it the response from all machines that had a match. It will then rank (using various methods- each search engine spends a lot of effort on refining these ranking methods, so obviously they're highly proprietary) the results for the top X hits. It'll usually cache these results so that if you hit the Next option you'll get quick responses.

Now, for the second pass, it'll take the top 10 or 20 of those results, and do documentID lookups to the content servers (which may or may not also be the index servers, but will also number in the hundreds of servers) to get the blurb of the document that contains the key words. This allows your results to have those little excerpts you'll see showing you the text in it related to your search, which isn't stored in the index.

Passing the results back to you
With the top N results, as well as the text excerpts from the top 10 results (or whatever set of 10 depending on what page you're on) it'll then pass that back to the web server. The web server then formats that raw dataset into pretty html, and sends it back to you.

This whole process, through the magic of optimization and highly distributed computing, will take about one tenth of a second.
posted by hincandenza at 4:52 PM on June 20, 2007 [122 favorites]


I'm looking for some real concrete data
There is no concrete data because Google isn't in the business of revealing such numbers.

This 2006 article titled How Google Works is a good reader. It estimates over 450,000 servers spread across 5 datacenters. A 6th datacenter is supposed to have gone up in Belgium. The article goes into the Google search mechanics a bit but hincandenza's post is much better for the nitty-gritty of what's actually involved in a search request and returning results.

The Wikipedia entry for Google platform is also informative for numbers.
posted by junesix at 5:26 PM on June 20, 2007


Though it doesn't tell you how things work, you might find interesting this coverage of Google's new data center (Secret codename: Project 2) in Oregon.

Hiding in Plain Sight, Google Seeks More Power (NYT)
Top-secret Google data center almost completed (Computerworld)
Photos of construction (CNET)
posted by reeddavid at 2:00 AM on June 21, 2007


Theres a little more info in this Askme.
posted by MetaMonkey at 7:32 AM on June 21, 2007


Response by poster: This is really great stuff. Thanks, everyone. I'm also interested in understanding how much data Google parses every day. ie-- Google parses the equivalent x Libraries of Congress every day-- that sort of thing. Having a really hard time finding that, probably owing to the lack of hard numbers...
posted by raconteur at 8:55 AM on June 21, 2007


Have you tried emailing them? A stat like that sounds like something that they might divulge willingly, if you can get in touch with the right google rep. Probably someone from their PR department.
posted by chrisamiller at 6:51 PM on June 21, 2007


They surely have more than 5 DC's. One that I know of is this one in Groningen, the Netherlands.

They are also going to start construction of a datacenter in Saint Ghislain, Belgium in about a month or two. It should become operational in Q1 2008.

One way to get to know DC locations is by checking their job offerings. Sometimes the locations are listed.

I could tell more, but then I'd be breaking NDAs...
posted by lodev at 1:52 AM on June 22, 2007


The "How Google Works" article is interesting, but I recall a friend of mine who works for Teh GOOG just laughing at the server and data center numbers in it.

Of course, he couldn't be very specific as to why he was laughing, but there was definite LOLing.
posted by sparkletone at 7:47 PM on June 30, 2007


Um... a bit late to the party, perhaps, but this PDF explains the architecture pretty straightforwardly. It's a little old, but very specific about the hardware they use, the cost of various operations (cache misses, etc.) in cycles, etc.

Their terminology isn't quite the same as what Hal describes, but it's process is close, except that the index servers ("aggregators") actually only talk to a small piece of the overall database (it's a share-nothing architecture -- so Google's "snapshot" of the Web is broken up into serial pieces, a "shard," which has a "pool" of servers responsible only for that shard), and then pass their results back up to the document servers to do the last-step ranking, extraction, and formatting.

This may seem like a minor quibble, but it's important to grasp that parallelization -- Google's architecture wouldn't work without it.
posted by spiderwire at 10:37 PM on July 1, 2007


« Older Bald In The Middle   |   I'm looking for songs about selling your soul to... Newer »
This thread is closed to new comments.