What has big data done for us lately?
December 4, 2014 1:17 AM   Subscribe

I'm curious to learn what major scientific (or other) advances have been made using data mining of extremely large datasets (a.k.a. "big data")?

I think the explosion of available data over the last decade or so has probably given us great new insights into processes and phenomena, but I have no idea what these might be. Has there been any "like, whoa" stuff to come out of research analyzing big data?
posted by stinker to Science & Nature (23 answers total) 10 users marked this as a favorite
posted by devnull at 1:54 AM on December 4, 2014 [1 favorite]

One thing that jumps to mind is the (theory? discovery?) idea that environmental contamination from leaded gasoline was/is tied to increasing crime rates in the later 20th century.
posted by BigLankyBastard at 3:37 AM on December 4, 2014 [2 favorites]

The thing is that the usual approaches to big data are not geared towards producing "insights". We have hugely improved machine translation, voice recognition, ad targeting, power management, but the basic principle is always that you are just considering a lot more factors than a human could make sense of. It turns out you can do all these things well, but the "insight" that the model contains is a vector of a million tiny weights assigned to a million signals, each individually not that interesting.
posted by themel at 3:52 AM on December 4, 2014 [6 favorites]

The human genome.
posted by a lungful of dragon at 4:10 AM on December 4, 2014 [2 favorites]

posted by a lungful of dragon at 5:00 AM on December 4, 2014

One can argue that the Human Genome Project was a failure of big data, not a success.
posted by akk2014 at 5:03 AM on December 4, 2014 [1 favorite]

Maybe this TED playlist will give you some answers: Making Sense of Too Much Data.
posted by HopStopDon'tShop at 6:05 AM on December 4, 2014

I recently read that some big city fire departments are using Bg Data to help them prioritize inspections and staffing. The (terrifying) problem being that they don't have the resources they need to inspect every building every year. So they built models to identify which buildings are most likely to be involved in a fire and prioritize those for inspections.
posted by lunasol at 6:58 AM on December 4, 2014 [1 favorite]

Hurricane modeling.
posted by oceanjesse at 7:00 AM on December 4, 2014

Historically, what you're describing sounds a lot like Copernicus. He inherited a monstrous body of astronomical observational data from Tycho Brahe and spent years analyzing it, to discover that the planets move in ellipses instead of circles -- and that they all orbited the sun instead of the Earth.
posted by Chocolate Pickle at 7:37 AM on December 4, 2014

The discovery that ABC transporters contain some of the most highly conserved DNA sequences across all domains of life. We didn't need big data to guess that ribosomal RNA and tRNA would be highly conserved, but, if I've got my history right, the universality of ABC transporters wouldn't have been discovered without big data analysis.
posted by clawsoon at 8:13 AM on December 4, 2014

The CanMap project discovered that pretty much all of the massive variation between dog breeds can be explained by a handful of genes. Big PDF here with lots of pretty pictures and graphs.
posted by clawsoon at 10:39 AM on December 4, 2014

The family trees of humans, reptiles, and birds have been rewritten, with a number of unexpected twists (e.g. our breeding with Neanderthals).
posted by clawsoon at 10:54 AM on December 4, 2014

> One can argue that the Human Genome Project was a failure of big data, not a success.

People receiving custom-tailored cancer treatments disagree, as do people who have kids with previously undiagnosable rare diseases, or those who will benefit from cancer vaccine treatments, etc, etc.

It's impossible to imagine modern biological research without genomics, and it's impossible to do genomics without big data.
posted by chrisamiller at 11:47 AM on December 4, 2014 [2 favorites]

I'm not sure if this qualifies as "Big Data" in your mind, but we host a quarter of a Petabyte of data from the Arecibo observatory, and you (yes, you!) can look for radio pulsars in there, as well as looking for gravitational waves and gamma ray pulsars in other large data sets. (It's basically looking for rare but very interesting needles in huge haystacks.)
posted by RedOrGreen at 3:01 PM on December 4, 2014 [1 favorite]

Best answer: I'm much more used to answering this question in the other direction (justifying physics research by talking about how we developed big data, etc.), so I hope I don't muddle it too much. But:

The entirety of particle physics (and, if you're OK speaking in broad foundational strokes, therefore the fundamentals of all of physics, and arguably, therefore, all physics and/or engineering-based applications, which is arguably the entire universe and our understanding thereof...) as we know it today is synonomous with "data mining of extremely large datasets."

The World Wide Web itself was developed by Tim Berners-Lee in the early 80's as a way to solve the problem of how to get large amounts of data from one location to another. It was developed so that particle physicists could share information--the extremely large datasets which make up their craft. Our tools to communicate and process that data come about directly from and because of physics research. Particle accelerators--the microscopes/telescopes of high-energy physics research--large ones of which started really being built in the 50's and 60's, produce so much data that special analysis techniques had to be, and still are being, developed just to figure out what to store and what to throw away. The triggers which today help catch, say, credit card fraud through monitoring spending patterns, are related to the ones used to determine whether a particular collision event produced exotic particles of interest.

You could say that big data itself arose because of fundamental physics research, and also that pretty much all fundamental particle physics discoveries and developments of the past three or four or more decades have been linked to/driven by these data mining/analysis techniques. So, the discovery of the Higgs boson. The discovery and characterization of quarks, force carriers, most of the Standard Model of physics. PET scans and MRIs. Superconductor technology. Here's a nice article in Symmetry magazine about the connection between particle physics and big data, and here's (PDF) one of Fermilab's public fact sheets laying out particle physics' many benefits to society.
posted by spelunkingplato at 3:03 PM on December 4, 2014 [3 favorites]

One can argue that the Human Genome Project was a failure of big data, not a success.

That argument would need a very strong defense, circa 2014. The timing of the HGP aligned well with the explosion of inexpensive, distributed computing in the 1990s. It is pretty much one of the first major successes of big data analysis that has provided a direct benefit to the lives of people, particularly those afflicted with cancers or rare Mendelian or gene disorders.

It has given us the ability to get to a genome-level understanding, as one example, of why some children with acute myelogenous leukemia do not respond well to certain chemotherapies. After testing, we can try to give those children the right medicines as quickly as possible. Or we can do targeted gene therapies that help people who were unlucky and inherited some broken or missing genes, which otherwise cause death or intense suffering.

Epigenetics is also another beneficiary, where we've discovered that environment can apply selection pressure and influence evolution, as well as genetic inheritance. We can study the epigenetic effects of various pollutants and develop informed public policy to help give our children and grandchildren a better chance at leading healthier lives.

This modern field would not have been possible without the bedrock of statistical and bioinformatics tools that were developed to deal with genomic data on this scale.
posted by a lungful of dragon at 4:51 PM on December 4, 2014 [3 favorites]

Much computer analysis of human neurological disorders currently relies upon very large databases of MRI brain scans, which are mined in myriad ways for patterns, e.g. how early atrophy can be detected in Alzheimer's disease, and distinguished from normal aging. Scans are gathered over many years, at many sites, covering many hundreds of subjects. Individuals' scans could be compared with large databases to place them along a timeline of disease progression, for example. A Google Scholar search for ADNI will throw up some specifics.

Galaxy Zoo, as well.
posted by Quagkapi at 5:23 PM on December 4, 2014 [1 favorite]

"One can argue that the Human Genome Project was a failure of big data, not a success."

I am all for treating science and technology with a critical and skeptical eye, but the linked piece is just handwaving rubbish.
For one thing, it relies a misrepresentation of fact to set the stage for its conclusion. The refusal of the Q'ero to have their genome analyzed didn't happen in the "thrilling early years of the Human Genome Project." It happened in 2011, and it wasn't part of the Human Genome Project, it was part of the National Geographic Society's Genographic Project.
posted by Good Brain at 5:53 PM on December 4, 2014

Cameron Beccario’s Earth Wind Map is a visualization of global weather conditions using massive amounts of data from the Global Forecast System's supercomputers. Updated every three hours.

Click, drag, zoom in order to see, for a current instance, Typhoon Hagupit as it approaches the Philippines.

Click on the word ‘earth’ in the lower left-hand corner for additional overlays and display options (e.g. ocean temps, air temps, ocean currents, etc.)
posted by Short Attention Sp at 4:13 AM on December 5, 2014

Just as a start, I'll point you to DataKind's project page.
posted by Freen at 9:30 AM on December 5, 2014

I work in computational biology at a well-known academic institution. It would take me more time than I have available to respond properly to "One can argue that the Human Genome Project was a failure of big data, not a success" and the linked-to article. I wish I could, because the conclusion and the arguments therein are frankly ridiculous.
posted by StrawberryPie at 10:35 AM on December 5, 2014

BIG DATA: How biological data science can improve our health, foods & energy

The scope of these [life sciences] projects was almost unthinkable ten years ago. But when it comes to genome sequencing, the future is here as costs continue to plummet and capacity climbs.
posted by a lungful of dragon at 5:42 PM on December 7, 2014

« Older How to smooth over problem with co-worker in a...   |   How much feedback do you expect after a job... Newer »
This thread is closed to new comments.