Where to find big, open datasets that are related to the life sciences?
March 13, 2015 11:26 AM   Subscribe

Science Hivemind: Datasets Needed! I just learned about some INCREDIBLE tools for doing data analysis with students. Now I need raw data for my students to work with.

I just found out about plot.ly and import.io and Web Plot Digitizer. I feel kind of overwhelmed by the educational potential of these tools, and I need help figuring out some things to do with them.

I want my students to start doing investigations starting with a giant dataset - they ask an original question about the dataset ("Is there a correlation between x and y?" "What time of day does X tend to happen?" "How many cells make X protein?") and then use the tools to answer it.

Now I need data! Raw data is preferred, but plots are acceptable. It just needs to be quantitative and big. Lots of variables are nice so students can look for their own correlations.

Topics: cell biology in general, immune systems, genetic engineering, stem cells, genes in general, protein interactions, protein synthesis, body systems, genetics, (23andme?), chromosomes, cell communication, evolution - fossil record, evolution - ongoing
posted by thelastpolarbear to Science & Nature (13 answers total) 32 users marked this as a favorite
 
Ooh! The canonical source is the UC Irvine Machine Learning Repository. They have hundreds of datasets, though not all of them are related to life science. There're pretty good sources for microarray data out there as well but I can't remember where they are off the top of my head. I'll update if I remember.
posted by un petit cadeau at 11:31 AM on March 13, 2015


Found it! It's ArrayExpress.
posted by un petit cadeau at 11:34 AM on March 13, 2015 [1 favorite]




Response by poster: Just FYI, my students are high school juniors, not honors or AP, so things that involve more student-accessible variables are better.
posted by thelastpolarbear at 11:51 AM on March 13, 2015


Data.gov has thousands of open datasets.
posted by PercussivePaul at 11:56 AM on March 13, 2015 [1 favorite]


NCBI's Gene Expression Omnibus (GEO) has many different genomic data sets from microarray expression, sequencing, and other high throughout genomic experiments.
posted by deludingmyself at 12:01 PM on March 13, 2015 [1 favorite]


Oh, perhaps better for your student level: the Wild Life of Our Homes project has microbial data from a citizen science project looking at what microbes live where in the home environment, and they have some tutorials for visualizing and exploring their microbiome species data.
posted by deludingmyself at 12:05 PM on March 13, 2015


my students are high school juniors

As someone who taught statistics to undergrads recently, I just want to add that you'll be doing these students a huge favor if you equally emphasize proper use of statistical techniques when you explore these data, especially if you're looking at big datasets!
posted by dialetheia at 12:09 PM on March 13, 2015 [5 favorites]


This is a little further from your core topics than the other suggestions, but the data are much easier to understand, well within the ken of a high-school student, and a lot of fun to boot.

Check out CDC WONDER: http://wonder.cdc.gov/

I'd go with the compressed mortality database to start.
posted by cgs06 at 12:49 PM on March 13, 2015


ChEMBL is what I use, but it might not be in a good enough format to use easily. Worth a shot to see if you can help them dig something out of it though.
posted by koolkat at 12:53 PM on March 13, 2015


Regarding evolution and/or paleontology, either FossilWorks or Faunmap might have something you'd like. For genetics info, you could use the UCSC genome browser. Google also has a whole bunch of public data up here ; not sure if it'll have what you want but a quick search for health brought up some options involving disease surveillance, for instance.
posted by mismatched at 7:35 PM on March 13, 2015


Oh! And for ecology (I know you didn't list it, but I'm sure there could be evolutionary stuff you could look into here) check out Protected Planet.
posted by mismatched at 7:44 PM on March 13, 2015


thelastpolarbear: "Now I need data! Raw data is preferred, but plots are acceptable. It just needs to be quantitative and big. Lots of variables are nice so students can look for their own correlations."

My background is in CS, but I took a bioinformatics course in grad school. And now one of the projects I support is the Protein Geometry Database. Bioinformatics is the 'big data' of biology. And fortunately, the Wellcome trust arm-twists researchers into publishing their datasets in exchange for grant funding.

However, one of the challenges you're going to find is that the datasets are very, very complex. Take a look at Protein Data Bank (full sequence) example. It's attempting to describe every amino acid's relation to others, at various scales. I have not seen a general purpose tool that can even accept this data, let alone produce interpretations. From what I can tell, the tools you're describing are intended for tabular datasets with very few dimensions, like eCommerce product search pages.

The other challenge is that big data is also very, very big. A single sequenced human genome is around 4GB. The proteome is probably even bigger. Most free services won't support it, and even if they did, your school network connection, when shared across 30 students all uploading datasets to an analysis engine, will likely cause much frustration. There's a reason my campus is spending millions to upgrade the research area network.

You probably want to work with pre-processed data. Not nessecarily plots, but something a little more pre-chewed, classified and aggregated. Among the most popular datasets at the UCI ML repo that un petit cadeau reported, a few have biological themes you might be able to design a lesson plan around. Breast cancer diagnostics, abalone age, and heart disease all seem relevant in some fashion. And you should be able to to find some correlations. Machine learning can do better on these multidimensional datasets, but running regressions as a first step is fairly common in data analysis.

I don't know that it fits into a biology course, but if you're teaching a stats course, there's some amazing work that should be accessible to high school students regarding identifying patents in anonymized records. More or less, when you have date of birth, ZIP Code and sex in one dataset, you can usually find another dataset using the same three values to also give you names and so on around 87 percent of the time. It's a good way to talk about privacy and ethics in relation to big data.
posted by pwnguin at 10:48 AM on March 14, 2015


« Older I need a mom to tell me how to fix my owie.   |   Objects and ideas for a color party! Newer »
This thread is closed to new comments.