How to talk about "Big Data"?
September 6, 2021 2:57 PM   Subscribe

How many Netflixes to the moon and back? How much Encyclopedia Brittanica can fit on the head of a pin? How long would you need to listen to Spotify to read all emails ever?

At work, we are increasingly being asked to talk about the size and scope of our data in ways that everyday punters (such as government ministers) can understand.

Back in the day, it was simple. "A whole set of encyclopedias can fit on a single CD-ROM." That was impressive, because we all knew how big a set of encyclopedias was, and how big a CD-ROM was.

But now we are in the realm of petabytes. Many of Earth's people are magically streaming gigabytes at a time to their pocket phones, out of thin air, while on the bus, and without headphones.

One example from our department (not my example):

"If I was to express the volume (Tb) of downloads in the currency of music and music streaming since the Open Data Portal went live in August last year, it would equate to more than 26 years of nonstop listening or roughly four million songs."

This doesn't butter my parsnips any more than it butters yours. It is completely meaningless because it is dealing with such scale. Nobody knows what four million songs look like, or can conceive of listening to them for 26 years. And on top of that, data size simply isn't impressive any more. Sure we all want to have a terabyte of RAM, but only people who know about RAM and terabytes really think that way. Infinite data immediately at our fingertips makes communicating about what that means really tricky.

But I'm at a loss how else to talk about apples without comparing them to other, slightly different apples.

Any suggestions?
posted by turbid dahlia to Technology (19 answers total) 2 users marked this as a favorite
 
This is only a half serious answer, but--how many times the size of Wales would it be when laid out if it were a) text and b) printed.
posted by hoyland at 2:58 PM on September 6, 2021


Response by poster: It's not a bad approach, comparing digital to reality. Our problem is compounded because we are also starting conversations about "digital twins", which in theory will ultimately be 1:1 with the real physical world. So talking about A4 printouts of, like, XML code, covering the landmass of Australia x times, really gums up the narrative.

I'd honestly prefer to keep it simple. "We serve 5,000 terabytes of data a week" (or whatever the figure actually ends up being). That's not wrong, and for people who know what it means they'll know what it means. But the first thing the bigwigs in corporate comms then do (because they don't do anything else) is ask "What even is terabytes?" So then we have to talk about Amazon Prime or whatever, but that carries zero information because nobody knows anything about Amazon Prime, they just know that the bulk cat litter is on the way.

Anyway, sorry, I'll sit quietly now and listen!
posted by turbid dahlia at 3:11 PM on September 6, 2021


Chemistry teachers try to explain Avogadro's number by telling you how much space a mole (6.022 x 10^23) of donuts or marshmallows and such would take up.
posted by aniola at 3:59 PM on September 6, 2021


Wolfram Alpha says that 5,000 terabytes* is roughly equivalent to ≈ 0.98 × digitized material content of the Library of Congress (as of 2015) ( ≈ 5.1 PB ).

*try typing in different amounts, the comparisons change.
posted by oceano at 4:01 PM on September 6, 2021


This is less about different comparisons, but sometimes you can use different ways of describing the same comparison to different effect. For example, you might phrase

If I was to express the volume (Tb) of downloads in the currency of music and music streaming since the Open Data Portal went live in August last year, it would equate to more than 26 years of nonstop listening or roughly four million songs

as

If you set your radio* to play X Tb's worth of songs, one after the other, nonstop, it would take more than 26 years from start to finish.

*or your iPod, your phone, your Alexa, your whatever. The changes here are personalizing the scenario a bit ("your") and introducing a more concrete scenario with an identified actor (the radio plays the songs) in place of a more passive, theoretical construction ("years of nonstop listening"). Sometimes that helps people's eyes glaze over less and the information to be better internalized.

If you go farther along the personalization line, you could also try something like "4 million songs. That's enough for every single person in Wales to listen to a different song than anyone else, with around a million songs left over."

Alternatively, you might compare things to the past: "If you count all the songs composed in Europe since the Greeks that still survive today - all the classical music, folk tunes, religious chants, yodels - you'll get around ####. Today that many different songs are downloaded every X minutes. Y times that many songs are downloaded every year." For terabytes: "In 2000, your computer probably had 20 gigabytes of storage. In 2010, your average computer had 250 gigabytes, and that was a lot. In 2021, your new laptop might come with 500 gigabytes, or even 1 terabyte - plenty of room for all your photos, spreadsheets, documents, music, and videos. We serve 5,000 terabytes every week."


I don't know what the goal of this kind of communication is for you, but sometimes it might also be worth asking yourself whether you're highlighting the most relevant part of the information for the specific purpose at hand. The number of terabytes might be worth knowing if, say, you need to talk about the infrastructure that will support that number. On the other hand, maybe the number of users served, or how many movies/songs/articles per person they're getting, is more relevant to the question the ministers are deciding, or whatever. Or the amount of time it takes to transfer the data, the amount of equipment needed, and so on. I say this because people understand information best when they can clearly see why they need to understand it - precisely how it ties to the decisions they need to make.
posted by trig at 4:09 PM on September 6, 2021 [7 favorites]


Executive types love having things compared to stadium sizes. Pick the biggest place in your area that does sports and pull the math backwards from there.

If your data is converted to copies of the Great Gatsby, every person at a sold out game at Soldier Field would have 81,000 copies of Great Gatsby in their pockets. Or, to make things easier on them, each attendee could simply carry 70 fully loaded kindle paperwhites.

Check my math on that but you get the idea. Convert everything to hardcover books and units of people, add sports, and you'll make it digestible.
posted by phunniemee at 4:10 PM on September 6, 2021 [4 favorites]


It's important to think about what information you do want to convey. Examples that don't work fail because they're not conveying anything other than "a lot." The CD-ROM/encyclopedias example works because we have some sense of the value of the information in a set of encyclopedias, and we see the value of having all of that in a smaller, accessible form.

So what do you want to convey to these people? What more do you want them to know beyond "it's a lot"?

Another way of looking at it: would your audience care or do anything differently if the amount were 100x larger or 100x smaller than it is? Should these people make some decision differently if some amount you're telling them about was 100TB vs 10PB? The answer to those sorts of questions might help you think about what the audience needs to know.

And then I'm guessing your best bet might be to use a relative measure. "It's 50 times the size we budgeted for last year." "We're still at just one tenth of the transfer rate of [some other service]."

And if in the end, this audience doesn't really need to know anything about the amount other than it's "a lot"... well then go with whatever sounds cool.
posted by whatnotever at 4:11 PM on September 6, 2021 [4 favorites]


The problem of finding analogues is that they kinda have to be analogous to make sense--you need an apple to compare with an apple. So you end up with comparisons to physical media, or other possibly more familiar digital media. You end up with Libraries of Congress or football fields of paper stacked 10 meters high or whatever other unsatisfactory measurement you have that hopefully relates to something within their realm of experience. Yes, it's not impressive anymore, but in general we've been jaded to technology's capabilities for a while now, and there isn't much you can do about that. When the Googles and Facebooks of the world are more likely to describe things in petabytes and exabytes, 5000 terabytes will never seem particularly impressive.

The only other thing I can think of to add concreteness is an actual description of what data you have stored or that you serve, e.g. X number of records, each with Y amount of detail on average, or whatever it is you're storing or serving. That way, at least the scope is directly represented in terms of what it's being used for.
posted by Aleyn at 4:14 PM on September 6, 2021 [1 favorite]


This is a hard question to answer because of two issues:

1. Data size is often changed. What I mean is that compression is everwhere. Lossless compression is used in databases. Very sophisticated lossy compression is used for video and audio files. Also, data is often stored in unoptimized ways with duplication or just numbers stored as text or simple tables stored as XML.

2. Our ability to process data makes the appropriateness of such comparisons change every year. DVDs hold almost 10x the data that CDs did. Currently, $100 hard drives hold 1000 times the data that a DVD does.

You can change the conversion to 'hours of music' or 'hours of video' or 'books stacked to the moon' or whatever to make data sizes seem more or less impressive. That's just marketing. Let corporate communications come up with their silly analogy.

If you want to communicate some meaningful comparison, I agree with the above: compare with a competitor, or another year, or relative to your capacity. Talk about the number of data records you served out. Give the absolute data size numbers. Those in-the-know will pay attention to that and ignore the analogy.
posted by demiurge at 4:19 PM on September 6, 2021 [1 favorite]


Maybe it's time to haul out analogies like:

* Number of DVDs worth of data
* Number of full set of Encyclopedia Britannica worth of data
* Number of the entire Library of Congress worth of data
* Number of hours watching Netflix at 4K (rather than DVD quality) worth of data
posted by kschang at 4:44 PM on September 6, 2021


This might be more appropriate for an audience with an IT background, but Amazon literally offer a service where they tow a shipping container full of storage to your data centre, to help you slurp in all the data and migrate it into AWS *:

> Each Snowmobile comes with up to 100PB of storage capacity housed in a 45-foot long High Cube shipping container that measures 8 foot wide, 9.6 foot tall and has a curb weight of approximately 68,000 pounds.

So that suggests data volume could be measured in units of Snowmobiles, aka 45 foot long shipping containers full of hard drives. "Big data" would entail multiple snowmobiles.


* god help you if you ever want to get your data back out again, they're not going to be equally motivated to make that easy.
posted by are-coral-made at 5:32 PM on September 6, 2021


aside:
> A4 printouts of, like, XML code, covering the landmass of Australia x times

this is not the future we were promised

> XML is like violence: if it doesn’t solve your problem, you aren’t using enough of it.
> — Heard from someone working at Microsoft
posted by are-coral-made at 5:36 PM on September 6, 2021 [3 favorites]


Maybe convert bytes to money? Amazon charges about $20/TB/month for online-accessible storage.
posted by RobotVoodooPower at 7:58 PM on September 6, 2021


In January, Lifewire had comparisons in An understandable guide to everything from Bytes to Yottabytes; one petabyte = "over 4,000 digital photos per day, over your entire life."
posted by Iris Gambol at 7:59 PM on September 6, 2021


Years ago IBM to express the potential of their shiny new DB2 database system (kinda like an old style Mysql;) they described that the amount of data it could hold was so large you could not hit all the disc drives with a fire hose.

When logged into AWS's filesystem a unix command 'df' shows the size of the partition, usually something like 192G or 2T for gigabytes or terabytes respectively but the potential empty space (and again this was several years ago) was 8E, that's 8 Exabytes or 8 million terabytes or 8 billion Gigabytes. I don't think that makes sense to anyone, literally anyone as anything but super crazy huge big.

Perhaps just giving a dollar figure, cost of n storage will be $m.
posted by sammyo at 9:01 PM on September 6, 2021


According to google, the current extant amount of in the world is approximately 44 zettabytes.

The volume of Earth's oceans is about 660,000,000 km^3.

44 zetabytes = 4.4e+22 bytes
660,000,000 km3 = 6.6e+23 milliliters

If 1 byte = 1 milliliter (mL), current amount of data is a bit less than 1/10th the volume of the ocean.

In contrast, the volume of the Great Lakes combined is about 23,000 km3 (2.3e+19 mL), 44 zetabytes is about about 2,000 Great Lakes.

--

Then you have to talk to what it means by "storage" - is data unique or redundant?

Especially in an environment of cloud and the relatively plummeting price of storage, the concept of moving data around is important - both the total volumes and the rate that it can be moved and be accessed.
posted by porpoise at 11:18 PM on September 6, 2021 [1 favorite]


Best answer: I want to suggest you may also be just generally frustrated with the pointlessness of the question from a technical / problem-solving perspective. I get the vibe that what they are going for in asking for in analogies is, "that sounds awesome. Is it awesome? How awesome ARE we?" And, well, marketing always has to ask this question, and it IS useful in communicating the magnitude of what you are doing to laypeople (even some maybe on your team).

I say this because you're kind of sweepingly declaring data sizes "not impressive", but when you bring it down to a human scale it still is impressive even if it's easy. The example that does not "butter your parsnips" actually looks like a fine example to me if you just get less wordy about it - "in the last year we uploaded data that, if it were music, you could listen nonstop for 26 years". If you want to actually know if it's technically advanced it doesn't help, but sometimes folks are just looking for an impressive analogy and a chance to feel like part of something big and cool.
posted by Lady Li at 11:54 PM on September 6, 2021 [1 favorite]


Oh, and "marketing" in that sentence includes "the people who have to sell the execs on giving you more budget".
posted by Lady Li at 11:55 PM on September 6, 2021


1) Like porpoise and phunniemee, I find it helps to scale these big numbers by dividing them by another big number
eg. The world's most abundant species is a marine bacterium called Prochlorococcus marinus there are 10^27 of them but they are tiny. The biomass of humanity is about the same order of magnitude.
2) We had a visitor from foreign who blew our monthly bandwidth allowance by downloading South American soaps to "watch later". They had several years of eye-glazing tosh on their hard-drive and we had an overdraft bill for €70.
Q no e illimitado?
A "No!".
3) In the late 90s I used to compare the size of the DNA database [doubling every 15 months] to km of Encyclopedia Britannicas . . . until I twigged that my 20something Encarta-raised audience had never used a trees-and-glue Encyclopedia.
posted by BobTheScientist at 2:44 AM on September 7, 2021 [1 favorite]


« Older Recommend a door buzzer system for a boutique...   |   Online album liner notes / booklets? Newer »
This thread is closed to new comments.