Where to find big, open datasets that are related to the life sciences?
March 13, 2015 11:26 AM Subscribe
Science Hivemind: Datasets Needed!
I just learned about some INCREDIBLE tools for doing data analysis with students. Now I need raw data for my students to work with.
I just found out about plot.ly and import.io and Web Plot Digitizer. I feel kind of overwhelmed by the educational potential of these tools, and I need help figuring out some things to do with them.
I want my students to start doing investigations starting with a giant dataset - they ask an original question about the dataset ("Is there a correlation between x and y?" "What time of day does X tend to happen?" "How many cells make X protein?") and then use the tools to answer it.
Now I need data! Raw data is preferred, but plots are acceptable. It just needs to be quantitative and big. Lots of variables are nice so students can look for their own correlations.
Topics: cell biology in general, immune systems, genetic engineering, stem cells, genes in general, protein interactions, protein synthesis, body systems, genetics, (23andme?), chromosomes, cell communication, evolution - fossil record, evolution - ongoing
I just found out about plot.ly and import.io and Web Plot Digitizer. I feel kind of overwhelmed by the educational potential of these tools, and I need help figuring out some things to do with them.
I want my students to start doing investigations starting with a giant dataset - they ask an original question about the dataset ("Is there a correlation between x and y?" "What time of day does X tend to happen?" "How many cells make X protein?") and then use the tools to answer it.
Now I need data! Raw data is preferred, but plots are acceptable. It just needs to be quantitative and big. Lots of variables are nice so students can look for their own correlations.
Topics: cell biology in general, immune systems, genetic engineering, stem cells, genes in general, protein interactions, protein synthesis, body systems, genetics, (23andme?), chromosomes, cell communication, evolution - fossil record, evolution - ongoing
The Roadmap Epigenomics Consortium released a big pile of data earlier this year.
posted by foxfirefey at 11:35 AM on March 13, 2015 [1 favorite]
posted by foxfirefey at 11:35 AM on March 13, 2015 [1 favorite]
Response by poster: Just FYI, my students are high school juniors, not honors or AP, so things that involve more student-accessible variables are better.
posted by thelastpolarbear at 11:51 AM on March 13, 2015
posted by thelastpolarbear at 11:51 AM on March 13, 2015
Data.gov has thousands of open datasets.
posted by PercussivePaul at 11:56 AM on March 13, 2015 [1 favorite]
posted by PercussivePaul at 11:56 AM on March 13, 2015 [1 favorite]
NCBI's Gene Expression Omnibus (GEO) has many different genomic data sets from microarray expression, sequencing, and other high throughout genomic experiments.
posted by deludingmyself at 12:01 PM on March 13, 2015 [1 favorite]
posted by deludingmyself at 12:01 PM on March 13, 2015 [1 favorite]
Oh, perhaps better for your student level: the Wild Life of Our Homes project has microbial data from a citizen science project looking at what microbes live where in the home environment, and they have some tutorials for visualizing and exploring their microbiome species data.
posted by deludingmyself at 12:05 PM on March 13, 2015
posted by deludingmyself at 12:05 PM on March 13, 2015
my students are high school juniors
As someone who taught statistics to undergrads recently, I just want to add that you'll be doing these students a huge favor if you equally emphasize proper use of statistical techniques when you explore these data, especially if you're looking at big datasets!
posted by dialetheia at 12:09 PM on March 13, 2015 [5 favorites]
As someone who taught statistics to undergrads recently, I just want to add that you'll be doing these students a huge favor if you equally emphasize proper use of statistical techniques when you explore these data, especially if you're looking at big datasets!
posted by dialetheia at 12:09 PM on March 13, 2015 [5 favorites]
This is a little further from your core topics than the other suggestions, but the data are much easier to understand, well within the ken of a high-school student, and a lot of fun to boot.
Check out CDC WONDER: http://wonder.cdc.gov/
I'd go with the compressed mortality database to start.
posted by cgs06 at 12:49 PM on March 13, 2015
Check out CDC WONDER: http://wonder.cdc.gov/
I'd go with the compressed mortality database to start.
posted by cgs06 at 12:49 PM on March 13, 2015
ChEMBL is what I use, but it might not be in a good enough format to use easily. Worth a shot to see if you can help them dig something out of it though.
posted by koolkat at 12:53 PM on March 13, 2015
posted by koolkat at 12:53 PM on March 13, 2015
Regarding evolution and/or paleontology, either FossilWorks or Faunmap might have something you'd like. For genetics info, you could use the UCSC genome browser. Google also has a whole bunch of public data up here ; not sure if it'll have what you want but a quick search for health brought up some options involving disease surveillance, for instance.
posted by mismatched at 7:35 PM on March 13, 2015
posted by mismatched at 7:35 PM on March 13, 2015
Oh! And for ecology (I know you didn't list it, but I'm sure there could be evolutionary stuff you could look into here) check out Protected Planet.
posted by mismatched at 7:44 PM on March 13, 2015
posted by mismatched at 7:44 PM on March 13, 2015
thelastpolarbear: "Now I need data! Raw data is preferred, but plots are acceptable. It just needs to be quantitative and big. Lots of variables are nice so students can look for their own correlations."
My background is in CS, but I took a bioinformatics course in grad school. And now one of the projects I support is the Protein Geometry Database. Bioinformatics is the 'big data' of biology. And fortunately, the Wellcome trust arm-twists researchers into publishing their datasets in exchange for grant funding.
However, one of the challenges you're going to find is that the datasets are very, very complex. Take a look at Protein Data Bank (full sequence) example. It's attempting to describe every amino acid's relation to others, at various scales. I have not seen a general purpose tool that can even accept this data, let alone produce interpretations. From what I can tell, the tools you're describing are intended for tabular datasets with very few dimensions, like eCommerce product search pages.
The other challenge is that big data is also very, very big. A single sequenced human genome is around 4GB. The proteome is probably even bigger. Most free services won't support it, and even if they did, your school network connection, when shared across 30 students all uploading datasets to an analysis engine, will likely cause much frustration. There's a reason my campus is spending millions to upgrade the research area network.
You probably want to work with pre-processed data. Not nessecarily plots, but something a little more pre-chewed, classified and aggregated. Among the most popular datasets at the UCI ML repo that un petit cadeau reported, a few have biological themes you might be able to design a lesson plan around. Breast cancer diagnostics, abalone age, and heart disease all seem relevant in some fashion. And you should be able to to find some correlations. Machine learning can do better on these multidimensional datasets, but running regressions as a first step is fairly common in data analysis.
I don't know that it fits into a biology course, but if you're teaching a stats course, there's some amazing work that should be accessible to high school students regarding identifying patents in anonymized records. More or less, when you have date of birth, ZIP Code and sex in one dataset, you can usually find another dataset using the same three values to also give you names and so on around 87 percent of the time. It's a good way to talk about privacy and ethics in relation to big data.
posted by pwnguin at 10:48 AM on March 14, 2015
My background is in CS, but I took a bioinformatics course in grad school. And now one of the projects I support is the Protein Geometry Database. Bioinformatics is the 'big data' of biology. And fortunately, the Wellcome trust arm-twists researchers into publishing their datasets in exchange for grant funding.
However, one of the challenges you're going to find is that the datasets are very, very complex. Take a look at Protein Data Bank (full sequence) example. It's attempting to describe every amino acid's relation to others, at various scales. I have not seen a general purpose tool that can even accept this data, let alone produce interpretations. From what I can tell, the tools you're describing are intended for tabular datasets with very few dimensions, like eCommerce product search pages.
The other challenge is that big data is also very, very big. A single sequenced human genome is around 4GB. The proteome is probably even bigger. Most free services won't support it, and even if they did, your school network connection, when shared across 30 students all uploading datasets to an analysis engine, will likely cause much frustration. There's a reason my campus is spending millions to upgrade the research area network.
You probably want to work with pre-processed data. Not nessecarily plots, but something a little more pre-chewed, classified and aggregated. Among the most popular datasets at the UCI ML repo that un petit cadeau reported, a few have biological themes you might be able to design a lesson plan around. Breast cancer diagnostics, abalone age, and heart disease all seem relevant in some fashion. And you should be able to to find some correlations. Machine learning can do better on these multidimensional datasets, but running regressions as a first step is fairly common in data analysis.
I don't know that it fits into a biology course, but if you're teaching a stats course, there's some amazing work that should be accessible to high school students regarding identifying patents in anonymized records. More or less, when you have date of birth, ZIP Code and sex in one dataset, you can usually find another dataset using the same three values to also give you names and so on around 87 percent of the time. It's a good way to talk about privacy and ethics in relation to big data.
posted by pwnguin at 10:48 AM on March 14, 2015
This thread is closed to new comments.
posted by un petit cadeau at 11:31 AM on March 13, 2015