Future DNA modeling tech?
June 17, 2016 4:34 PM   Subscribe

Is it theoretically possible to design a computer program that can take an organism's DNA and produce a model of that organism, without being given any other information? (It does not need to be possible with current tech, just theoretically possible.) This is, of course, for a story.

Here is what I need to happen: Person puts a DNA sample into a 'scanner' which sequences the sample, and then sends this data to a program which constructs a model of the organism that this DNA is from. Assume processing power is not an issue.

At first I thought this concept was so simple that it wasn't going to need much explanation. But the more I think about it, the more I get hung up on issues of plausibility, and I'm not sure how to deal with them.

Here are the main issues I'm having:

-Environment. The program does not have any information on the physical environment that this sample is from. Atmosphere, gravity, chemical composition, climate, nothing. Imagine perfectly modeling a human but assuming an incredibly high-pressure environment, or a tropical plant in a freezing environment - a 'perfect' model of these organisms would appear deformed or broken, but if you had no idea what they were supposed to look like, how would you know that it was wrong? Is it possible that the program could extrapolate this environmental info from the DNA?

-Change over time. How would such a program account for this? A human fetus and a human toddler and a human adult have the same DNA; a caterpillar and a butterfly have the same DNA. In terms of modeling, I guess the program could display a birth-to-death animation, but would it be able to figure out, from only the DNA, that a caterpillar makes a cocoon and then turns into a butterfly? Or that deciduous trees shed their leaves every year?

Even if so, might such a program run into issues like: human fingernails never stop growing, rodent teeth never stop growing. Is there a plausible way to control for this, in order to prevent the program from making, say, a perfect image of a human with five-foot-long fingernails?

For that matter - what about diet, and how that influences development? Would a person with a copy of human DNA but no knowledge of humans or Earth be able to use such a program to figure out what an adult human looks like if they didn't already know the various macro- and micro-nutrients required for a human to live and grow to adulthood? Or would it be possible to extrapolate this info from the DNA?

-Epigenetic changes. I don't know a ton about epigenetics, but I know it involves 'switching off' chunks of DNA, for purposes ranging from "making different types of cells in one body" to "turning a worker bee grub into a queen bee." If you grabbed, say, a human liver cell, would the program be able to produce a complete human model from it?

This probably isn't even every possible problem, but this is what sticks out to me as a layperson.

As I said above: all I need my character to do is stick some unknown alien DNA into the program and then see an image of the alien. The actual technical workings of this program don't play into the story at all - it just has to work. But as the author, I need to have at least a vague understanding of how such a program would handle the above problems.

Any articles (or fiction) based on this concept would be awesome, but any input is more than welcome. I just don't want my critique group to be like "yeah I like this story but it fundamentally makes no sense."
posted by showbiz_liz to Grab Bag (29 answers total) 5 users marked this as a favorite
I should add: I'd also appreciate suggestions of where else online I might ask this - any forums for bored genetics students, etc?
posted by showbiz_liz at 4:41 PM on June 17, 2016

No. The environment during development definitely affects how the genetic information is expressed.
posted by Bruce H. at 4:44 PM on June 17, 2016 [1 favorite]

i believe that it's not possible to do directly, because the environment in which the embryo is raised is also important. so there's a chicken and egg problem.

but i don't see why you couldn't have something that starts by assuming a known environment (of, say, the existing animal with the closest matching dna) and then either tries to simulate the development of the given dna there, or iteratively modifies the starting animal's dna to get closer and closer to the target dna - effectively "evolving" (not really evolution, some kind of guided change) the animal in software through successive births with self-consistent dna and environment.

obviously there would be problems with computing power. not just simulating an entire organism at the molecular level, but simulating an evolving series, as they repeatedly give birth, is going to be way beyond the capacity of any computer now or in the forseeable future.
posted by andrewcooke at 4:49 PM on June 17, 2016

well, more exactly, i imagine what i describe would be so difficult as to be close to impossible. but not so far removed from reality that it couldn't be used in a book.
posted by andrewcooke at 4:50 PM on June 17, 2016

and there would be a problem with the starting animal possibly introducing biases. so if you had alien dna but started with a human environment you might end up with an alien-human hybrid (in the sense that the human environment led to a self-consistent end product that expressed the alien dna differently from its actual host) that would even more difficult for the hero to destroy :o)
posted by andrewcooke at 4:52 PM on June 17, 2016

You can hypothetically do this, as long as you have something to compare with that is reasonably related.

For example, we know what the gene for red hair looks like. If your alien has blue hair, and our software can't identify blue hair, we're out of luck. Similarly if the genetic information for scales is too different from the genetic data for scales on file, we won't know.

Environment impacts some things in some ways. Methylation patterns are a big deal because sometimes only one of two genes is used, or expressed at all. There are typically rules for these things and if your computer knew them, you'd be fine. There are methylation patterns associated withe childhood stress (starvation), or you could accept size may be somewhat off.

If you had software that had seen something similar to the alien before, it could come of with a picture (size/facial features/coloring is all in our dna). You could add some level of rna to deal with expression or protein (this could distinguish between butterfly and catepillar. Telomere length could help determine baby vs adult human type thing.

As for fingernail length, you are out of luck. Same with hair style. But you could have a program that assumes something reasonable based on prior knowledge of all aliens.

I think your idea is sound, but you need to accept that your image won't be perfect.
posted by Kalmya at 5:06 PM on June 17, 2016 [1 favorite]

I think your idea is sound, but you need to accept that your image won't be perfect.

It would also be acceptable if, rather than an image, the program could give an imperfect/incomplete description of the organism - like, "we have no idea what color this thing's hair is but we know it HAS hair" or "maybe it has skin or maybe it doesn't, but it definitely has eight limbs," or something along those lines. Given the responses above, maybe I need to switch my thinking a bit, and ask: in such a situation, given that the program could never make a perfect model, how much information could it give about the organism?

Maybe slightly more info about the nature of this story is required: the DNA in question is discovered after being "stored" in a switched-off form within the DNA of an unrelated organism millions of years prior. So the character in the story recognizes "hey, this massive chunk of switched-off DNA might have been put there on purpose," and on a hunch, she runs it through the program by itself. The result of this is just that we find out these aliens used to exist - there's no question of trying to 'bring them back' or anything, it's essentially just a message, like the Voyager Golden Record in DNA form. But no one could possibly have any information about what the original aliens looked like, or what they might have been similar to.
posted by showbiz_liz at 5:16 PM on June 17, 2016

The difficulty is that there's no way to directly tell which part of the DNA is in use and which part is "junk". In humans something between 70% and 90% (they keep changing their estimates) are unused. Some of that is genetic information which used to be needed but which isn't any more. A lot of it is simply garbage. (There's one sequence of about 300 codons which appears more than a million times. If it means anything no one has ever figured it out, and current theory is that it was inserted by a retrovirus.)

When the egg is formed, the cytoplasm contains a complex mix of enzymes, non-enzyme proteins, and RNA which implements and bootstraps the state machine that interprets the DNA and "executes the code". In essence, it knows where the code starts. Without that, you're lost at sea.

Could it be determined algorithmically without that information by examining the DNA alone? Unfortunately, no. It's a problem known in computer science as "dead code removal" and it's isomorphic to the Halting Problem.
posted by Chocolate Pickle at 5:18 PM on June 17, 2016 [1 favorite]

If you know enough about the organism to identify protein coding regions, you can infer function from them, and from that some stuff about the environment the organism is adapted for (heat tolerant, etc.) It's going to be easier to get stuff like that than a straight up BOLO style picture.

ETA: endosymbiont theory might be worth checking out given the "organism within an organism" conceit
posted by momus_window at 5:18 PM on June 17, 2016

This might be a good question for /r/AskScience.

I think you'd need to simulate a bit of universe to "run" the DNA in (and we're not very good at that), and that means making assumptions about the environment.

At best, I suppose your program could run a bunch of simulations, and see if they come out roughly the same. (This is what we do for weather predictions). This would imply that the DNA sample is stable over a wide range of environments. But you're never going to know that there's not, I don't know, a chemical excreted in the womb that makes the scales bright blue, or turns them into feathers.

A real-world example of the environment altering the result is temperature-dependent sex determination.
posted by Leon at 5:19 PM on June 17, 2016

The genetic information is not organized in any sense that a human would use the term. Active genes and junk DNA intermix freely, and in some cases there are junk segments in the middle of active genes, which get clipped out before the resulting RNA gets used. Genes which are parts of the same mechanism don't appear near one another and may not even be on the same chromosome.

A programmer would refer to it all as "content addressable memory". Active genes aren't found by any kind of address. Instead, genes which need to be activated have a header region, and the mechanism that turns them on and off finds them by searching.

And there are a lot of kinds of mechanisms for locating genes, turning them on, and turning them off. It's all ad hoc, which in fact is about what you'd expect from a system created by evolution, whose motto is "Whatever works!"
posted by Chocolate Pickle at 5:28 PM on June 17, 2016

Organisms purge unused DNA. Certainly over the course of millions of years, essentially nothing would be left. It would be quite detrimental, evolutionary speaking, to have so much extra DNA for a number of reasons.

It takes energy to copy DNA.

Mix ups and duplications happen and these Organisms would start expressing the other dna if was compatible with their genetic code. Otherwise they'd translate nonsense which could be quite deleterious.

So, no way, no how. This would never happen.
posted by Kalmya at 5:33 PM on June 17, 2016

Well, there's stuff like this and this, so you can, for example, sort of predict human facial features from DNA alone. More here. However, that requires being able to do a big experiment called a genome-wide association study, so if you have enough study subjects, you can infer stuff like that. The PLoS paper in my first link used roughly 5,000 subjects, which is actually sort of on the small side as far as GWASs go. This kind of analysis wouldn't be possible with n=1, though, and with a species which genome you haven't fully characterized.

What you COULD do, though, is to analyze the DNA in such a way so as to attempt to identify protein-coding genes. Basically, you take the raw code of the alien DNA ('GGCTACG...') and try to find regions that 'line up' with the regions of human DNA that you already know code for proteins found in hair (e.g. keratins), eyes (opsins), etc. So depending on how different your aliens are from humans, something like that could be possible - you could do that kind of analysis and be able to say, hey, this alien has this sequence in their DNA that looks like this region in human that we already know codes for this protein found in eyes, so they could have eyes or, failing that, some kind of structure that detects light.

Tougher still would be to try and predict development or, really, any sort of dynamic process from DNA, because knowing the DNA sequence of something only gives you a sort of a snapshot of that organism. You would need to get an idea all the different sorts of regulatory processes going on to figure out what would happen, since little bits of DNA can be turned on and off or chopped up in all sorts of weird ways to make a whole bunch of different proteins. Then you need to be able to measure those over time with something like RNASeq, which requires being able to take samples from the organism. There's a lot of randomness inherent to these processes which makes them difficult to model or to be able to say what'd happen, so even if you knew all the underlying mechanisms, it'd still be hard to predict the final phenotype of an organism.
posted by un petit cadeau at 5:38 PM on June 17, 2016 [1 favorite]

Kalmya, unfortunately, that isn't true. Most human DNA is unused, and that appears to be the case for all multicellular eukaryotes whose genes have been sequenced.

(For prokaryotes it's a different matter.)
posted by Chocolate Pickle at 5:38 PM on June 17, 2016

There's been some prelim work done in the forensics world on this but few could identify their co-worker from the resulting facial image

http://blogs.discovermagazine.com/d-brief/2015/02/25/dna-image-face/ -article in Discover
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004224 theoretical article in PLOS
posted by beaning at 5:39 PM on June 17, 2016

Oh, also, unused DNA changes relatively rapidly. These sequences would not make sense millions of years later. DNA is conserved in direct relation to its utility. So something very important would change maybe 10% in a million years. Something unused would change complete and be moved/deleted.
posted by Kalmya at 5:42 PM on June 17, 2016 [1 favorite]

Another problem is that sometimes genetic sequences are created on the fly. One problem which puzzled early genetic researchers was how immune systems learned to create antibodies.

It turns out that the T4 cells have a genetic toolbox of maybe a thousand genetic fragments, and when the T4 cell is faced with a new antigen which isn't "self" (and I don't think anyone knows yet how it determine that) then it starts piecing together pieces of DNA from the toolbox and creating random antibodies.

Once it lucks into one that fits the antigen and knows it (another unsolved problem) it starts reproducing like mad, and it also starts releasing a hormone that says "We're sick" to the rest of the immune system. The whole process is only partially understood, but the point for your purposes is that the antibodies (which are proteins) are not coded directly in the DNA.
posted by Chocolate Pickle at 5:44 PM on June 17, 2016

From the perspective of a spec-fic author/editor, I'd say it depends on how hard you need your SF to be. You could get away with this (especially the "we know this and this but not everything by any stretch" version) in almost any context. Only the hardest of hard SF fans would be particularly bothered by the idea of some sort of device that has this capability. I mean, if you have it work instantly, have perfect information, and I dunno start spitting out full-grown copies of the whoozit, then yeah, you're going to get some eyerolls, but most SF is pretty loose on the actual rules of science anyway. If you make a nod to the epigenetic/environmental influence problem and also lean more toward the "we know it has hair, but not how long etc." end of things, you'd be fine just about anywhere.
posted by Scattercat at 6:01 PM on June 17, 2016 [1 favorite]

Oh, also, unused DNA changes relatively rapidly. These sequences would not make sense millions of years later. DNA is conserved in direct relation to its utility. So something very important would change maybe 10% in a million years. Something unused would change complete and be moved/deleted.

It's not that straightforward. For one thing, there are "neutral mutations" which can happen to active genes without changing the protein they code for. So active genes can also change over time.

Anyway, there isn't any mechanism in eukaryotes to clip out and discard unneeded DNA from chromosomes.

Prokayotes can do that, but it's crude: it's simply a copying mistake in the DNA. Usually that results in a failed unit, but since bacteria usually reproduce rapidly (as fast as once every half an hour if conditions are optimum) and exist in vast numbers, losing a few to reproductive mistakes doesn't matter.

And as you pointed out, losing useless DNA makes the cell operate more efficiently, so it gets selected for later. Genetic analysis of prokaryotes shows that in fact their DNA is used efficiently.

But none of that applies to multi-cellular prokaryotes. Because our reproduction rate is so slow and our numbers so few, that mechanism operates so slowly as to be evolutionarily irrelevant, and we carry around vast amounts of useless genetic information.

It's a common mistake to assume that evolution yields perfect solutions. This is such a case. It would be nice if multicellular eukaryotes could do this, but they don't.

And using the mutation rate to determine what code is active and what is not requires comparing the genes of many individuals. The OP's story idea only has DNA from a single individual.
posted by Chocolate Pickle at 7:09 PM on June 17, 2016 [2 favorites]

And using the mutation rate to determine what code is active and what is not requires comparing the genes of many individuals. The OP's story idea only has DNA from a single individual.

If it changes your answer, there's a whole planet full of these things.

(All of these answers are REALLY helpful, by the way!)
posted by showbiz_liz at 7:14 PM on June 17, 2016

But in all the individuals carrying this extra bundle of joy it would be unused and thus would mutate at the "unused" rate. That doesn't help distinguish which parts of it were in use by the original owner and which parts were not.
posted by Chocolate Pickle at 8:22 PM on June 17, 2016

It would also be acceptable if, rather than an image, the program could give an imperfect/incomplete description of the organism - like, "we have no idea what color this thing's hair is but we know it HAS hair"

Sure, if you don't mind it being sloppy. That's just statistics. Have some absurdly huge database of all the genomes of all the known creatures of the galaxy and your machine can note that 87\% of the critters with this particular sequence have hair, and that 97% of the animals with this other sequence are radially-symmetric octopodes. The computer doesn't have to figure out what the dna leads to or understand or model it in any way, just note associations between genotypes and phenotypes. People trusting it too naively will make dangerous mistakes.
posted by ROU_Xenophobe at 9:21 PM on June 17, 2016 [2 favorites]

The junk vs. coding DNA problem isn't such a big a deal compared to the total intractability of actually running a detailed simulation of life. A program capable of doing that would have to simulate folding of all of the proteins in a cell (already a very difficult problem) and then simulate their interactions (most proteins don't act in isolation), plus their interactions with small molecules, etc. Basically you're talking about simulating really complicated physics faster than physics itself can do the same things. If it were a multicellular organism the program would then have to try to guess at what proteins get expressed in which combinations as the organism grows and develops, particularly tricky since often signals from the environment are also important in development. And then you don't know what the organization of the cell is (are there, e.g., weird organelles? how do you bootstrap them if you don't have an egg/progenitor cell?), what environment it needs to survive (nutrients? chemical signals? temperatures? other organisms, like symbiotic microbes?!), etc. It's a mess.

A way more plausible approach is what ROU_X and momus_window allude to: if you have a library of DNA from even somewhat-distantly related species, you can use homology (relatedness of DNA sequences) to make guesses about that species' attributes. You could then have the program draw up some options that were consistent with those predicted features.

But you might get some stuff spectacularly wrong, too, because a lot of the properties of genes are governed by when and where and how much they get expressed. That's harder to predict; you can sometimes even get a big macro-level effect just from a few nucleotides' change. Genes encoding for an important metabolic reaction might get reused because they tend to form orderly crystals that can be used as eye lenses (true example). On the other hand, you might be able to guess that male platypuses were venomous just based on the sequence because they have repurposed specific proteins in similar ways to venomous reptiles, even though those proteins aren't actually related by common descent (i.e. convergent evolution).

Anyway, there isn't any mechanism in eukaryotes to clip out and discard unneeded DNA from chromosomes.

You do get shuffling of genetic material through recombination, transposition events in meiosis, etc. It's more typical for a retrotransposon's activity to be blocked or disrupted or co-opted than neatly cleaved out, for sure, particularly because there's not much fitness cost to our having a slightly bigger genome compared to a microbe (I think?) -- but genomic rearrangements absolutely do happen within eukaryotes and are relevant on evolutionary timescales.

posted by en forme de poire at 4:03 AM on June 18, 2016 [1 favorite]

So on reflection, the idea that this tech could only give a super-vague description of the alien is actually much better for the story than my original idea, from a thematic standpoint.

The general concept is, we've been looking for life in the universe and we've found a ton of it, but it's all very basic single-celled organisms, and we've revised our ideas to about life be like "life is abundant but complex life is rare and maybe only found on Earth." And then the field researcher in the story finds this DNA evidence showing that this intelligent lifeform used to exist but has been dead for millions of years, long enough that the evidence of its whole civilization has been crushed by geological forces by this time. But (whether on purpose or through some weird fluke) they left this 'message in a bottle' embedded in the DNA of a fairly stable single-celled organism for some future race to maybe find. "Kilroy was here" or whatever. So: yay, intelligent life can exist outside earth, but boo, it's already gone, and will we ever find more?

So, if the best that the researcher can do is just show that some kind of complex organism used to exist, and we can't say much about them but we do know x, y, and z, and we know that they were intelligent, but we'll never even know what they looked like - that's even more poignant than if she could see an image of them!

And I can even repurpose some of the comments in this thread as an explanation for why we can definitely never bring them back, despite having the DNA. Someone in my crit group had asked about that and I wasn't sure how to answer it.

I'm really glad I asked this question! Thanks, all!
posted by showbiz_liz at 2:43 PM on June 18, 2016 [1 favorite]

I agree with the above answers that reconstructing an organism from DNA alone is basically impossible, though you could probably make some inferences about at least some aspects of their biology. But, if in your story, these aliens intentionally added their DNA to these organisms as a message, they could have also left additional information in the DNA. DNA is a digital code which is suitable for encoding arbitrary information. How to decode this information would not be obvious, but the encoders could have left clues to make it easier (analogous to the plot of Contact, in which aliens use mathematical patterns which do not occur in nature to guide the human researchers to the correct decoding scheme).

Note that the idea of an intentional message could also help address Kalmya's point. An "information payload" in a genome with no biological functions would degrade badly over evolutionary time, but a sophisticated genetic engineer could in theory also add in genes to stabilize the message-bearing portion of the genome. This could also help serve as a clue to the human biologists that something strange is going on -- "Why is this bacterium spending so much effort to reduce mutations in this portion of its genome which doesn't seem to have any biological function? Maybe we should take a closer look at it..."
posted by biogeo at 4:52 PM on June 18, 2016 [2 favorites]

This is probably too elaborate for your purposes but it just occurred to me: if these are alien life forms, and not the result of some ancient panspermia, then their genetic information may not mean the same thing ours does.

First, their DNA may not be chemically identical to our own. The four codons might not be the same.

But more important is that the triples may not mean the same thing. RNA codons are read off three at a time and most of the triples encode for a specific amino acid. You can see the translation chart here. There are 64 possible combinations and three of them are "stop", with 61 encoding amino acids.

Every living thing on Earth uses the same chart, which is the most important piece of evidence for the theory that all life here began with a single organism. But that chart is arbitrary; it could have been different. There are a titanic number of potential charts and no reason to prefer any of them to any other. (It's 21!*43^21 possible charts.)

What establishes the chart is genetic descriptions of Transfer RNA which are used during protein synthesis. There are 61 of them, and for each one there's an enzyme that loads it with a single amino acid -- always the same amino acid.

This mechanism is mutation-proof because any change to it would cause so many proteins to be built differently that the chance of the organism still being viable is negligible. It was established at the very beginning of life and it's been the same ever since.

But it could have been something else, and if life develops somewhere else from scratch, even if it uses exactly the same DNA as we do chemically, it definitely will be different, unless panspermia is true. (Or creationism.)

If your hypothetical bundle of DNA is not from panspermia, then we might be able to sequence it but we may not be able to read it simply because we can't know what coding was used for it.

Put it this way: English uses all the characters in Latin (plus a few more). I don't know Latin, but I know all the characters. If someone gives me a page of Latin text, I can copy it, but I won't have the slightest idea what it means.

This situation could be comparable.
posted by Chocolate Pickle at 5:02 PM on June 18, 2016 [3 favorites]

Some bacteria form extremely long-lasting spores that can survive over geological time periods. They might be good candidates because they're not actively dividing during that time, so their genomes wouldn't be subject to the same kind of drift and selection.
posted by en forme de poire at 7:17 PM on June 18, 2016 [2 favorites]

(Actually that claim was challenged and seems pretty dubious now. But there are other claims about reviving spores from tens of thousands of years ago that seem to hold up... and if the bacteria were deliberately engineered to be resistant to aging it seems like you could plausibly stretch that yet further.)
posted by en forme de poire at 7:22 PM on June 18, 2016 [2 favorites]

Oooookay, I want to amend my last comment. Everything I said was true, but it's possible you could derive the alien chart if it was the same for the unknown species and for the microorganism which is carrying the payload. Since the microorganism is present now and alive and, presumably, could be produced in larqe quantities for experimental purposes, then its chart could be determined, and we posit that it's from the same origin of life as the unknown species, so the DNA could be read.
posted by Chocolate Pickle at 8:40 PM on June 18, 2016 [1 favorite]

« Older Should I take a big pay cut for a really...   |   How to register my lease car in Florida Newer »
This thread is closed to new comments.