I was told there would be no math.
August 21, 2016 10:22 AM   Subscribe

I've crossed a bunch of plants with one another, and now have a bunch of berries on those plants. Each berry can contain way more seeds than I have room to pot up and grow out individually, so I'm trying to find the point of diminishing returns, where potting up additional seedlings stops giving me new and interesting results, but my math background is inadequate.

A previous cross has given me thirteen distinguishable colorations. I imagine this isn't typical, but it's all the information I have so let's roll with that.

In the previous cross, out of 75 seedlings, result A has occurred 22 times, B 15 times, C 7 times, D 5 times, E 5 times, F 4 times, G 3 times, H 3 times, I 3 times, J 3 times, K 2 times, L 2 times, and M 1 time.

Assuming these results to be typical, how would I:
• calculate the number of unique outcomes likely to result from potting up n seedlings
• calculate the number of seedlings to pot up in order to have an x% chance of result Y?
posted by Spathe Cadet to Science & Nature (13 answers total) 4 users marked this as a favorite
I'm not a plant geneticist, but unless the trait for the characteristic you are looking at has simple Mendelian inheritance patterns, this may be impossible to answer...

I think it's going to depend on knowing the trait and plant you are working with for any to start helping with this...
posted by Tandem Affinity at 10:32 AM on August 21, 2016

To clarify:

The fact that we're talking about plant breeding specifically isn't relevant to the question. I know that whatever answers are generated aren't going to be "right," from a genetic standpoint.

If it helps, treat it as a gumball machine with 13 different flavors of gumball in it, and tell me how to calculate the number of unique flavors I'd get from buying n gumballs, and the number of gumballs I'd need to buy in order to have an x% chance of getting a watermelon one, where watermelon is present in the proportion Y/75.
posted by Spathe Cadet at 10:56 AM on August 21, 2016

The second question is the easier one to answer. If the probability that you get outcome Y is pY, then:
  • the probability that you do not get outcome Y in one event is 1 - pY;
  • the probability that you never get outcome Y in any of N events is (1 - pY)N; and
  • the probability that you don't never get outcome Y (i.e., that you get it at least once) in N events is 1 - (1 - pY)N.
In particular, if you want there to be a probability pfail that you don't get outcome Y, then N must satisfy pfail = (1 - pY)N; which implies that N = log(pfail)/log(1 - pY).

I'll have to think about how to address the first question.
posted by Johnny Assay at 11:12 AM on August 21, 2016

It's easier just to simulate the first question, although I'm sure there's a formula.

I did 10^5 trials for each sample size from 1-50 and plotted the results.

Raw data.
posted by dilaudid at 11:17 AM on August 21, 2016 [2 favorites]

'Sampling effort' is what you are thinking about. Each seed planted and checked is a sample, and you want to do that enough times to cover most of the possibilities but not so much as to waste time. Entire books have been written on this. The urn model is one way to go- that will help you rule out insane over effort but I think will generally over predict what you need to get the bulk of common outcomes while under predicting what kind of effort you need to get those really rare combos.

Also what kind of plants are we talking here? You could also send seeds around to interested parties as a way of increasing sample effort :)
posted by SaltySalticid at 12:34 PM on August 21, 2016

I'm wondering if you're combinint traits in your outcomes. Like yellow flowers and double leaf trait. As it stands, we don't have a big enough sample size to really tell you these answers. But if you can back out to individual trait, maybe you do.
posted by Kalmya at 1:06 PM on August 21, 2016

After further thought, I think that the best way to answer the "expected number of outcomes" question is via a simulation like dilaudid did.

It also occurred to me that you can ask another question: if you kept planting & raising seedlings until you got one of each type, what is the average number of seedlings that you can expect to plant? This basically a version of the Coupon Collector's Problem with a non-uniform probability distribution of the "coupons". At the bottom of the page, there's a formula for the expected amount of time to get all the options, in terms of an integral; for your numbers, it works out to 105.8 plants.
posted by Johnny Assay at 1:12 PM on August 21, 2016

Look up the geometric distribution, and the cumulative geometric distribution. You should get decent info from an intro stats text, or just the Internet.
posted by Valancy Rachel at 1:32 PM on August 21, 2016

I have been informed, that in the case of the gumball situation, it's important to know how many balls of each flavor are in the machine.
(asked my mathematician friends to look at this one)

Edit: Or whether there are infinite gumballs
posted by Thisandthat at 1:34 PM on August 21, 2016

The plants in question are holiday cacti (Schlumbergera).

There are photos of them all here. Seedlings 003A to 114A are the ones I divide into 13 different color categories.[1] The ones after 114 may or may not be from the same parents; not enough of them have bloomed yet to be able to guess.

The berries I'm trying to plan for are a mix of crosses between store-bought varieties, store-bought varieties with my own seedlings, and my-seedling/my-seedling crosses.

Not sure how to answer the gumballs in the machine question. Each berry contains about 70-100 seeds, on average, but the number of possible seeds from a particular cross is astronomical. So I guess either it's a gumball machine that holds 70-100 gumballs with the specified distribution, or it's an infinite, frictionless, spherical gumball machine.


[1] (The casual observer will see a bunch of interchangeable orangeness, but as they are my babies, I'm better than most people at telling them apart, and I say there are 13 categories. We hit the point of diminishing returns with those a while ago, but naming them entertains me, and there hasn't been another batch of plants mature enough to bloom until very recently so there's been no particular reason to stop them from blooming and getting named.)
posted by Spathe Cadet at 2:08 PM on August 21, 2016

I can't work out from your question(s) whether you're asking about calculating the possibility of finding new colorations beyond your existing 13 variations, or just the likelihood of each color you've obtained so far given those frequencies. The latter is doable, the former not really. But even if you're 'ignoring' genetics to get the likelihood of each coloration, I think you should still do your math on how many to pot up per cross and not just in aggregate.
posted by deludingmyself at 2:51 PM on August 21, 2016


The ultimate question, which I am not asking here, is: how many seedlings do I need to plan for from each new cross I make, if I'm trying to minimize the amount of space each batch of seedlings takes up while maximizing the number of interesting (visually distinct) results?

The immediate question, the question I am asking here, is: assuming that the results I've gotten from the one batch are typical, what is the mathematical relationship between the number of seedlings I grow out and the number of visually distinct results they produce?

(The assumption that current and future batches of seedlings will be similarly variable is, no doubt, a bad one, but the current batch's information is the only information I can extrapolate from.)
posted by Spathe Cadet at 4:24 PM on August 21, 2016

I'm going to take for granted that you know the probability of occurrence of each type of seedling, though this is a pretty tenuous assumption.

As Johnny Assay said, if each seedling is type Y with probability pY (independently of the other seedlings), then the probability of getting at least one type Y seedling in N tries is 1 - (1 - pY)N.

The average number of distinct types you'll get in N seedlings is simply the sum of 1 - (1 - pY)N over all types Y. As an example, if you had three types which occur with probability 1/2, 1/3, and 1/6, and you planted 4 seedlings, then the average number of different types among those 4 seedlings would be [1 - (1/2)4] + [1 - (2/3)4] + [1 - (5/6)4], which is 2.26. This summing trick works because of linearity of expected value (I can expand on this if you want to know more).

You can also calculate the average number of seedlings you'll need to plant to obtain the full set of types. This gets pretty gnarly, though. If the probabilities of all the types are p1, p2, ..., pr, then the average number of seedlings needed to obtain a full set is S1 - S2 + S3 - ..., where the signs alternate between + and -, and:

S1 = 1/p1 + 1/p2 + ... + 1/pr
S2 = 1/(p1+p2) + 1/(p1+p3) + ... + 1/(pr-1+pr), with a term for every combination of two types
S3 = 1/(p1+p2+p3) + 1/(p1+p2+p4) + ... + 1/(pr-2+pr-1+pr), with a term for every combination of three types

This formula is derived from the geometric distribution (mentioned by Valancy Rachel above) and the maximum-minimums identity. Continuing the illustration above with three types of seedlings that occur at rates 1/2, 1/3, and 1/6, the average number of seedlings you'd need to "catch them all" would be
1/(1/2) + 1/(1/3) + 1/(1/6) - 1/(1/2 + 1/3) - 1/(1/2 + 1/6) - 1/(1/3 + 1/6) + 1/(1/2 + 1/3 + 1/6) = 7.3.

Neat as it is, this formula is insanely unwieldy for 13 types and you will need a computer to evaluate it. At that point, you might just want to run a random simulation instead. (Edited to add: I think this formula is doing the same thing as the integral Johnny Assay also mentioned above. It looks about equally taxing to calculate.)
posted by aws17576 at 6:22 PM on August 21, 2016 [1 favorite]

« Older Is there an Android app for this?   |   Looking for a article (linked off a former AskMefi... Newer »
This thread is closed to new comments.