How does Google do its thing?
January 27, 2009 3:11 PM   Subscribe

How does Google run their data centers? I remember watching a Q&A with someone at Google trying to recruit kids at a college. One of them asked what kind of hardware they run on, he responded that they buy commodity rack servers off (I forget the exact generic domain he gave). Does this mean they use cheap rack servers with direct-attached storage? How does this work?

I cannot find the video, but I remember it being surprisingly frank. They were recruiting kids for what seemed to the sysadmin jobs of the company. They were divulging a fair amount about Google's operations and one thing that I remember was the emphasis on buying commodity servers on the cheap.

How does this work? Do they engineer their software to work in this fashion? Or, what I'm really asking, is how do I do this on a scale of two to three racks? Is it feasible?

hincandenza touched on this in his wonderful answer:
At the lower layer, there are for each cluster likely hundreds if not thousands of machines. Google uses very inexpensive PCs, usually buying cheap discarded or sub-par hardware directly from vendors like Intel and throwing together hordes of machines. If a machine doesn't work- throw it out, it's cheaper to replace than to spend people-hours on it. These machines are likely not even as powerful as your own desktop or laptop, but that doesn't matter- thousands of them working together makes up for their individual weakness and unreliability.
It seems, from what I gather, that they are teaming up with a virtualization partner to deliver this kind of flexibility. Are there commercial, off the shelf programs that do this? Does ESX and Virtual Center do this? I have worked with ESX but didn't think it was as flexible as this. Surely someone else has tried to do this? It would be so much cheaper to buy cheap, cheap machines and add capacity as needed, than buying expensive sans. I just haven't found anything that would allow me to take, say, an Exhcange installation, and throw it into a cloud. So what's the deal? Is this all custom Google stuff?
posted by geoff. to Computers & Internet (21 answers total) 8 users marked this as a favorite
Pretty much everything is custom. And things change pretty regularly.

And ESX does something completely different from what large cloud datacenters do. ESX does pretty much the opposite - most enterprise servers require very little processing power so ESX divides a powerful box into a lot of cheaper ones. This typically is more efficient in terms of floor space and energy consumption versus buying multiple small servers. ESX is also used when upgrading hardware - companies will consolidate several old machines into one new, more powerful machine. But what drives a cloud DC (e.g. Amazon, Google) is very, very different from what drives a typical enterprise DC (typically constrained by floor space, cooling or power delivery).
posted by GuyZero at 3:18 PM on January 27, 2009

Also, FWIW, Amazon uses Xen as the basis for its EC2 cloud computing service. But they have pretty small individual VMs and 2006 reports indicated that they were only running 2-1 virtual-physical. I expect they're running 16-1 for small instances and 2-1 still for the extra large EC2 instances in the current EC2 config. But who knows how much of a mix of hardware they have for the physical boxes.
posted by GuyZero at 3:32 PM on January 27, 2009

I don't have any inside knowledge of Google's datacenters, but it is almost certainly something similar to a Beowulf cluster. And GuyZero is correct, virtualization is essentially the opposite of grid computing.
posted by synaesthetichaze at 3:34 PM on January 27, 2009

AFAIK, Google wrote a lot of cluster stuff to move their crappy hardware. For example, they wrote a filesystem that places redundancy all over and automatically recovers on disk failure (by increasing replication). They also wrote some tools Map/Reduce to exploit the system in typical languages. You write a function called Map that takes a operates on a chunk of data and emits a result, and a reduce function that merges the results from Map. The end result is Map running locally on the server where the data is, on hundreds of computers at once with low communication overhead and individual google searches that cost enough CPU power to boil a kettle of water.

Google makes no effort to make this system appear as a single virtual computer, they don't run Exchange servers on this cloud. The software has to be written to take advantage of this. Which is why their tech talk series is full of parallel computing lectures; it takes a long time to write these things correctly and there aren't enough undergraduates sufficiently trained in the subject.
posted by pwnguin at 3:34 PM on January 27, 2009

Read up on Google File System.
-White Paper
-Wikipedia ("It is designed to provide efficient, reliable access to data using large clusters of commodity hardware.")
-How Stuff Works
posted by niles at 3:41 PM on January 27, 2009 [3 favorites]

Also, Exchange has a significant amount of clustering stuff built right into it and takes a lot of stuff from Windows clustering services. Any major Exchange installation is already in a grid of sorts, depending on how much you feel that clustering and grid computing overlap. But only cheap and/or really small installations run Exchange on a single box. I'm sure someone at MSFT has the numbers to prove that the economic activity generated by Exchange admins alone puts Exchange into the realm of a small country in terms of GDP. At any rate, what you ask for - the ability to invisibly grid something like Windows across multiple machines and make it look like one large machine - is next to impossible. Software that runs in a grid is always written specifically for that environment. if there's one thing software developers know how to do, it's making more software development work.
posted by GuyZero at 3:59 PM on January 27, 2009

Hadoop is an open source version of MapReduce that you might be interested in checking out.
posted by kdar at 4:09 PM on January 27, 2009

Their distributed db system, BigTable, is built on top of GFS and hand some cool features.
posted by PantsOfSCIENCE at 4:20 PM on January 27, 2009

nthing the above that everything is custom. As an aside 10 years ago I took a special topics course on parallel computing by a professor who was fairly well known in the field. It is a fascinating and crazy difficult topic to really wrap your mind around. There are lots of different physical architectures which radically change the algorithms in use but even then with those different architectures it is suboptimal for everyday programs to write to those architectures because they are so specific and frankly it is difficult. It is the same reason that lots more people program in perl/python/java/ruby/c# than in C or even assembler. The holy grail would be to able to write a normal program and have a smart compiler automatically farm it out to the cloud. That isn't possible yet...right now it takes smart people making decisions on how to split up work loads so that they are done parallel and that the results can be combines properly.
posted by mmascolino at 4:22 PM on January 27, 2009

They have a whole website developed to this (well its focused around 'green' but gives a lot of details.)

See here.
posted by jourman2 at 4:30 PM on January 27, 2009

Also, in these sorts of high-volume, high-performance situations, you often end up sharding your data into multiple servers. Here's a pretty good article on sharding. Another good article linked from that one.

An article on Google from the same site,
posted by GuyZero at 4:32 PM on January 27, 2009

On a purely practical level, I know a datacenter engineer who has seen the some google cluster hardware. The physical boxes are open air for more cooling and are, as stated above, replaced upon any failure. They don't use traditional server racks and have specialized cooling techniques as stated by jourman2's link.
posted by rabbitsnake at 4:33 PM on January 27, 2009

I think you'll enjoy's case studies, which include google.
posted by PueExMachina at 5:12 PM on January 27, 2009

What kind of software are you looking to develop?

Somebody suggested Hadoop before, and I think that'd be a great place for you to start. It's essentially an open source implementation of some key components of Google's infrastructure (MapReduce and GFS).

You might also want to look into things like Erlang if you're interested in scalability on a medium scale.
posted by atomly at 5:43 PM on January 27, 2009

Google is custom beyond your wildest dreams, but more than that they have successfully built their company around a (spectacularly engineered) unified platform and architecture. That alone is both the most difficult task and the key to success: most large companies have been assembled of smaller, acquired pieces that all have their own architectures and systems and cruft.

Hadoop is an example of something you can use off the shelf, today, but to get to where Google is you have to enforce every service, every product, and every application being built on top of it (the paradigm of MR and the underlying bits that power it): that requires smart engineers and discipline. I've oversimplified quite a bit, but you get the point.
posted by kcm at 5:46 PM on January 27, 2009

Purely anecdotal from a conversation I once had with a Google sysadmin, it's pretty impressive.

Everything's cheap-ass hardware, that boots to the network, installs their version of Linux, and then the machine takes its place in the hive, asking the boxes further upstream what it should be doing. It gets its instructions and starts receiving data from where it needs to.

Every piece of data is stored, at the VERY least, in triplicate, on three different boxes, usually distributed out to different physical datacenters. This means if their cheap-ass hard drives die (and they've done white papers on their massive failure rates, and how often SMART doesn't say a damn thing) nothing is lost.

This is part of the awesomeness of the GFS/BigTable stuff.

They monitor box performance like crazy and in aggregate, so they can see if there's something suspicious happening and do what's necessary to fix it.

The sysadmin told me an interested story about how a datacenter went entirely offline, no backup generator was able to activate, so they lost a HUGE datacenter of theirs. They watched as the front controlling clusters recognizes the downed datacenter and began routing around it, but they also watching as it detected that many servers had lost their "triplicate" copy, and started to replicate the data that was lost by that datacenter elsewhere, thus slowing things down slightly for a bit, while the replication was performed. But it survived and stayed active and serving, until the dc was brought back up.

Crazy. Shit.

It's like fucking skynet, man. They even have to have the power companies increase their wattage to the datacenters because of how densely they cram the boxes, and because of how little they care if the boxes are at all serviceable.
posted by disillusioned at 5:49 PM on January 27, 2009

In Google's case, I think most or all of the sexy bits are custom stuff (Google Filesystem, Map Reduce, BigTable), and, of course, their other applications build on top of all that.

There have been people diligently working on open source versions. Hadoop provides both Map Reduce and a distributed file-system in the style of GFS. HBase is a BigTable inspired non-relational DB that runs on top of Hadoop. Hypertable is another open-source BigTable-like DB. CloudStore (formerly KFS) is another GFS-style clone. None of these things will be very useful though unless you are writing your application to take advantage of their capabilities.

There are other pieces that might help adapt existing applications to run on a scalable fault-tolerant infrastructure, but that won't take you very far with Microsoft Exchange unless microsoft decides they are going to engineer it to run on commodity hardware, rather than gold-plated servers.

Things that might help existing applications: DRBD provides a way to do redundant block devices (ie disk volumes) across a network on Linux using commodity hardware. Throw in something like redhat GFS and that storage can be read and written to from multiple machines simultaneously. Gluster FS is another linux filesystem that can be run across multiple commodity machines for scalability and availability.

Google's approach to hardware is interesting. At their scale, price-performance needs to be considered at the datacenter level. Using a large number of commodity machines vs a smaller number of high-end servers puts them in a sweet spot since desktop components have much higher production volumes than server components. But, there are other less obvious advantages to using desktop-spec machines as well.

Power consumption is a major issue for large datacenters, both because energy costs have been rising, but also because power consumption results in heat that has to be dealt with. Power distribution equipment and cooling equipment end up accounting for a significant part of the budget of a new data center. As a result, Google has been very interested in maximizing computation per watt/hour because this ends up being a significant contributor to overall price/performance calculations. CPU performance has been outstripping memory and disk performance. The result is that systems tend to be imbalanced. Fast multicore CPUs end up burning electricity waiting because there isn't enough disk or memory bandwidth. Using more machines with fewer, lower-end CPUs ends up giving them more overall memory and disk bandwidth, and allows maximal utilization of resources and improving overall efficiency.
posted by Good Brain at 7:24 PM on January 27, 2009

Good Brain: don't forget Google's work on efficient power supplies, a major problem before they rallied behind it.
posted by kcm at 8:52 PM on January 27, 2009

pwnguin: that "enough energy to boil a kettle of water" stuff was not true. Google are very careful about energy. See this post.
posted by w0mbat at 12:09 AM on January 28, 2009 [1 favorite]

Thanks! Hadoop is exactly what I was looking for. I thought since ESX's fault tolerance, high availability function worked transparently to the service, that it could be expanded so that multiple services could run on a grid. It makes sense that grid computing must be aware of its hardware layer to some degree, to propogate and deal with failure.
posted by geoff. at 4:11 PM on January 28, 2009

ESX HA is semi-hack.

First, it relies on all VM images being on SAN storage accessible from all servers in the ESX cluster. Then it's basically using vMotion and a heartbeat so that if a server dies the image is vMotion'ed to another server. So it requires a fairly specific setup and some expensive hardware (a SAN). But ultimately any given VM can only access as many resources as the largest box in the cluster. The idea with a grid is that a single program can access all the CPU and disk resources in a unified manner.
posted by GuyZero at 5:04 PM on January 28, 2009

« Older I Loves Me Some Culture Clash   |   How do I edit PDF Title, Author and Publisher info... Newer »
This thread is closed to new comments.