What does 4.3 million GB look like?
November 12, 2020 9:35 PM   Subscribe

So on the news that Google Photos will no longer offer unlimited storage, I started wondering about how the whole process works. I realize this is probably a trade secret, but how does Google store all this data in a way that is immediately retrievable?

According to this TechCrunch article, "People now upload more than 4.3 million GB to Gmail, Drive and Photos every day." And I saw a Google blog that says "Today, more than 4 trillion photos are stored in Google Photos, and every week 28 billion new photos and videos are uploaded."

What I can't understand is how this works from a practical standpoint. I am not a tech person, but I assume that those 4,300,000gb (which doesn't include Youtube) have to be redundant because people would be angry if their information was lost. So is Google just like adding ten million gb of storage a day? How?

I assume that they are worried about costs and so must be using consumer drives or something similar, no? Do they have people just like in a data center physically installing like thousands of disks a day?
posted by Literaryhero to Technology (6 answers total) 6 users marked this as a favorite
 
Best answer: Yes on consumer drives, as far as I know, and also yes on people physically racking drives in several data centers around the world. Not sure if it’s daily, but realistically it probably is.
posted by protocoach at 9:50 PM on November 12, 2020


Best answer: To a first order, I suspect that 4.3 million GiB of photos are uploaded, but Google immediately compresses a lot of that data - so I'd expect that to be more like 143 TiB (assuming a roughly 30x compression ratio).

I don't know what Google does, but it's relatively common to have highly tiered storage systems that don't have ready access to all the data you're looking for. Google will likely have their front-end servers with as much recent data as possible - for instance, to load all the thumbnails for a page of images. There will be fast file servers that keep somewhat less recent data in Flash SSDs - for instance, the last couple days of images you've taken in low detail. There will be slower file servers that keep bulk data in basically higher end hard drives - for instance, the last few weeks of images you've taken in full detail. And, yeah, there will be a whole bunch of "cold storage" servers - either tape drive or bulk hard drives - that will store the older data that isn't usually accessed in a few weeks. Here's an article about Facebook's system with 2 PB of data in a single rack - that's two weeks worth of data!

The storage is usually tiered like that to allow for loading data while you navigate to it. While you're loading the thumbnail for an image, the next slower tier will load the higher resolution version and perhaps the few images around it. While you load a slideshow of images from last week, the cold storage nodes will start to retrieve them for the last few months.
posted by saeculorum at 9:56 PM on November 12, 2020 [6 favorites]


Best answer: Building on what others have said, here's a blog post from Backblaze (backup software) on their physical hardware — how much storage they can fit in their "pod", how much this costs per GB. Here's a different blog post from Backblaze on the software they leverage to store large amounts of data.

Granted this is from a data backup company that's probably operating at two or three orders of magnitude smaller than Google, my understanding of the industry says this is a pretty typical approach. Commodity hardware is used because it's cheap and available. The software operating over the commodity hardware is used to mitigate the downsides — higher failure rates than more specialized hardware.
posted by Axle at 10:29 PM on November 12, 2020 [3 favorites]


Best answer: Google are probably big enough to buy custom-specified hardware all the way down, so consumer-style drives and servers but tailored to their own specific requirements.

For density, a typcial data centre rack is 42U high (1U is one 'rack unit', 1.75 inches/44.45mm). Buying normal current hardware easily gets > 200TB in a 4U server, so call it about 2PB/rack. So if Google were buying that kind of hardware, they would need to buy two racks a day for the data. They will have some redundancy on top of that (hardware failures) which will increase the requirement, but not as much as you might think.

On tape, a modern tape stores about 20TB, so 200 tapes/day needed to store Google 4PB. Here's a fairly typical vendor's page with a bunch of facts about their robot system. They claim 27PB in a unit sized at around 3 racks. However, anything user-facing will not be using tape. Even in best-case situations, the read latency from tape is measured in minutes, by which time the human user will have become bored and moved on. They probably do use tape for cold backup systems, however.
posted by Urtylug at 1:14 AM on November 13, 2020


Best answer: So the company I work for deals in similar amounts of data (different line of work than Google) but the way we do it is that we have 8 data centers around the US (and more around the world). Each operates as a pair so that data can be geo-redundant, for backup and failover (assumption being that the data center is destroyed).

You personally access the one closest to you. The ones in the center of the US cover more area and fewer customers. The back up one is most likely in the middle of the US if you are on the coast, and farther north/south if you are in the middle of the US.

We used to buy specialized hardware, but have since turned to commodity hardware and storage. Storage is done in storage area networks . Servers which store pedabytes and more used to be expensive, but now they don't really cost that much.

We only use tape drives for long-term back up, which means it likely will not need to be accessed for more than 180 straight days. Everything 180 days and newer is served live, and then it moves to back up drives and archived storage. We personally keep data for 10 years max. Nothing older than that.
posted by The_Vegetables at 8:01 AM on November 13, 2020 [2 favorites]


Response by poster: Thanks everyone. All of the answers (and the links) really helped me understand this. Particularly saeculorum pointing out that the stated figure is what is uploaded prior to compression and then everyone explaining what the storage actually looks like. Thanks!
posted by Literaryhero at 3:10 PM on November 13, 2020


« Older Hand Drum -> Bluetooth Speaker ?   |   Does this exist: sliding house key holder? Newer »
This thread is closed to new comments.