How many songs do I have to listen to before I start again?
December 12, 2009 8:49 AM   Subscribe

Can you help me figure out this math question related to probability?

I feel like my college course in probability should have enabled me to figure this out for myself, but clearly, it did not.

I have an iTunes playlist with 500 songs on it. Once I've listened to a song, it disappears from the list, and is replaced by a new song from my library. Songs that have been on the list once can not reappear on the list, and for the purposes of this question, you can assume the library from which the replacement songs are drawn is infinite.

The list is selected randomly and then sorted in alphabetical order, and I start at the top and play my way through it alphabetically until I reach the 500th song. But because of the replacing of songs, which might appear in the list after the song I just listened to (and thus I will have to listen to them before I get to the end of list) or might appear in the list before the song I just listened to (and thus will not get listened to before I get to the end of the list) I actually listen to way more than 500 songs on the way through. What I'm trying to figure out is how many songs I'm likely to listen to in a single pass.

I know that when I'm listening to the first song, there's relatively high odds that it's replacement will belong later in the list. I assume this to be 499/500 -- if the replacement song gets slotted into the list pretty much at random, it could end up in any of 500 slots, and one of them is before the point I'm at in the playlist, while the other 499 after. And when I'm done listening to the 373rd song, then, the odds should be down to 227/500 that I'll have to listen to the replacement song. So, I can figure out the odds that I'll have to listen to any given replacement song.

But I can't get from there to an overall expected number, because I can't figure out how to account for new songs I've already had to listen to vs. the ones I haven't. Say I've listened to 100 songs, and 90 of their replacements ended up later in the list than where I was at the time. That means that while I've listened to 100 songs, I'm only at position 10 on my list -- a long way from the 100 I'd be at if I wasn't replacing the songs.

This is obviously not earth shattering, and I don't need to know this. But it's been nagging at me because I can't figure it out, and I feel like I should be able to.
posted by jacquilynne to Science & Nature (41 answers total) 5 users marked this as a favorite
 
Your problem sounds pretty hairy, I don't think you will be able to get an answer without some programming. Other people more knowledgeable than me will surely come in, but in the meantime I think you need to look up Markov chains.
posted by Dr Dracator at 9:18 AM on December 12, 2009


This is a pretty nasty question, and I'm guessing the easiest solution would be found numerically, not analytically.

Lets simplify the number of songs to 10, and make one very important assumption, a uniform distribution of song placement. This is a wrong assumption, but we would need to know more about your song library to get the true distribution (though for the purposes of an infinite replacement library, its not bad). The uniform distribution implies this, if I have played X songs, and there are Y songs remaining, then the odds of hearing a 'bonus song' are Y/(X+Y), and the odds of missing it are X/(X+Y).

Lets carry forward now into our analysis.

We have played song 1, and now it is replaced.

The odds are 9/10 that we will hear a 'bonus', and 1/10 that we won't. We take these two probabilities to get our expectation, and tag them with the song and bonus count so far.

1,1: 9/10
1,0: 1/10

Now we explore the possibilities for song 2. If we got a bonus track after song 1, then there is a 9/11 chance we will hear another bonus song, and a 2/11 chance we won't. If we didn't get a bonus track after song 1, there's an 8/10 chance we will get one this time, and a 2/10 chance we won't.

2,2: 9/10*9/11
2,1: 9/10*2/11
2,1: 9/10*8/10
2,0: 9/10*2/10

Whew, lets take a breath here. We've made it past two songs, and we have a table of four possibilities (really three), that show our likelihood distribution of hearing 0 bonus songs, 1 bonus song, or 2 bonus songs.

We can fold the 1 bonus song in like this.

2,2: 9/10*9/11
2,1: 9/10*(2/11 + *8/10)
2,0: 9/10*2/10

If we keep going, I expect that we'll see a predictable pattern increasing with a very long tail off to the right of bonus songs. In fact, you could conceivably have any number of bonus songs in a listen, just with astronomically small probability, the same way you could flip a coin and get heads a hundred times in a row.

I would solve this numerically using a program like Python or MATLAB in something akin to a breadth-first evaluation of the tree with a cutoff after the probabilities became sufficiently small. At each 'song depth' I would consider the different probability potentials and split each potential into two new potentials at depth+1. This would repeat until I'd reached a sufficient depth for plotting the results :)
posted by onalark at 9:34 AM on December 12, 2009


Just for clarification, are you asking how many songs you will have to play before no song on the original list of 500 remains? The possibilities vary from 500 to infinity. This like asking how many times you must flip a coin to get 500 heads. I think you will have to pick a number and rephrase the question: "What are the odds that no song from the original list will remain after 500 plays? 1000 plays?" Etc.
posted by weapons-grade pandemonium at 9:35 AM on December 12, 2009


Oooh, bad math there. Here are the corrected lines:

2,2: 9/10*9/11
2,1: 9/10*2/11
2,1: 1/10*8/10
2,0: 1/10*2/10

2,2: 9/10*9/11
2,1: 9/10*2/11 + 1/10*8/10
2,0: 1/10*2/10
posted by onalark at 9:38 AM on December 12, 2009


I am a mathematician, but haven't done probability in quite awhile, so I could be way off here.

You have 500 songs. Let's turn this into a different problem with multiple song lists, where each time you listen to one of the original 500 songs you either add a new song to the next song list or you don't. After playing song x in the first list, there is a (500-x)/500 probability that a new song is added to the second list. So, after the completion of the first list the second list has expected length \sum_{i=1}^{500} (500-x)/500 = \sum{i=1}^{500} (1-x/500) = 500 - (1/500)\sum_{i=1}^{500} (x) = 500 - (1/500) (501*250) = 249.5.

Basically what's happening is that about half of the songs you listen to on the original list will be replaced after, and half before, which means we'll get about 250 songs on our second list. Now we listen to the second list, and get a third list of 125 songs. Continue and you've listened to 500*\sum_[i=0}^{infinity} (.5)^i = 1000 songs.
posted by monkeymadness at 9:51 AM on December 12, 2009


Oh no, my approach is wrong. I was assuming we would ignore the song if it was placed behind (we don't). The total number of songs remaining should never grow.

I think the new numbers are:

1,1: 9/10
1,0: 1/10

----------------------------------------

2,2: 9/10*9/10
2,1: 9/10*1/10
2,1: 1/10*8/10
2,0: 1/10*2/10

=

2,2: 9/10*9/10
2,1: 1/10*(9/10+8/10)
2,0: 1/10*2/10

----------------------------------------

3,3: 9/10*9/10*9/10
3,2: 9/10*9/10*1/10
3,2: 1/10*(9/10+8/10)*8/10
3,1: 1/10*(9/10+8/10)*2/10
3,1: 1/10*2/10*7/10
3,0: 1/10*2/10*3/10

=

3,3: 9/10*9/10*9/10
3,2: 1/10*(9/10*9/10 + (9/10+8/10)*8/10)
3,1: 1/10*2/10*(9/10+8/10+7/10)
3,0: 1/10*2/10*3/10
posted by onalark at 9:55 AM on December 12, 2009


Well, I'm at least relieved to know that this isn't some triviality that I just can't figure out because I forgot everything I learned in probability the day after I took the exam.

FWIW, the practical answer, at least for this passthrough is looking to be about 1500. That's out of a total database of somewhere in the neighbourhood of 12000 songs, and it's taken me several months of iTunes listening to get this far. I've gotten as far as Y, and I'm at 463 on my list. I expect most of the rest of the list will just run straight through since there are so few songs in the larger collection that could end up later than Y on my list.

There is definitely a strong pattern. It took forever to get as far as 10 on my list, and as I get nearer the end, it gets much faster. It's a little bit like interest on a mortgage that way -- the first few payments are almost all interest, and you make only a tiny bite out of the principal, but as you get towards the end, the same payment is almost all principal.

I tried to figure out if there was some way to reduce the whole problem by realizing that while you're at song one, your probabilities are 499/500 of getting a repeater, while at song 499, they're 1/500, and so on for 498/500 and 2/500, but I couldn't quite make it happen.
posted by jacquilynne at 9:58 AM on December 12, 2009


Consider each initial song to be the i/500th element of the ordered sequence of songs. The probability that the song adds a new member to the queue that you listen to is (1-rank_i) (this is the probability than the new draw is higher than it) with the new song having a uniform position between (1-rank_i) and 1, so expected value (1-rank_i)/2+rank_i. The probability that the new song then adds a second new song is 1- rank_i -(1-rank_i)/2 = (1-rank_i)/2. This chain of daughters each with 1/2 the parent's probability of adding a new daughter is a geometric series (1/2) ; the expected number of extensions is 2(1-rank_i).

So, add that up over all the initial i and you get an expected 501 extensions and 500 original songs, for 1001 total songs heard.

To make this more valid, you could consider the ranks as draws from the uniform order statistics, but that makes it too complicated for askme.
posted by a robot made out of meat at 10:11 AM on December 12, 2009


To complicate matters even more, you're drawing from a distribution that's not normal. Songs are more likely to start with T or S than they are with Q or X. You could make some assumptions, as the above posters have done, and get a ballpark answer, but it stands a good chance of being pretty far off.

If you really want to know, the way to go would be to do some Monte Carlo simulations. That is, you gather all the song titles into a big list, then read them into a program that simulates the whole process, say, 10000 times and gives you the average number of songs you listen to. I'm actually kind of tempted to do this with my song library right now.
posted by chrisamiller at 10:21 AM on December 12, 2009


I know you're all dying to see what the distribution looks like using the uniform distribution approach, so I ran it through a 25,000 iteration expansion to see what it would do: Pretty boring, huh?

The blue line is probability of an expected song count, the green line is the density function (i.e. change that X number of songs or less will be played).

I'm pretty sure the reason we're seeing a higher average number of expected songs (~4,000) is the uniform distribution assumption, but it could be a problem with code as well :/
posted by onalark at 10:38 AM on December 12, 2009


This is a pretty fun problem. Hopefully someone who knows more statistics than me can do better, but here's a possible starting point:

Let each song have an associated value x representing its position in the library. Then we have a cumulative distribution function CDF(x) which gives the probability that a randomly-chosen song from the library has value <= x. Since the library is infinite, we might as well make this a continuous distribution, and have x range from 0 to 1.

Now let Emin(n, x) be the expected minimum of n values uniformly sampled from [x,1]. (Surely there is a nice simple formula for this, which I will try to look up.) This gives you the expected value of the next song, given that you have just played a song with value x, and that there are n songs remaining.

Now we can recursively define Esongs(n, x), the expected number of songs you will hear if you start playing a list that currently has n songs, and whose first song has value x:

Esongs(0, x) = 0
Esongs(n, x) = CDF(x) * Esongs(n-1, Emin(n-1, x)) + (1-CDF(x)) * Esongs(n, Emin(n,x))

And so your answer is just Esongs(500, Emin(500, 0)).

Obviously it just depends on the CDF; I have this nagging feeling that the whole thing should collapse into some simple formula involving the integral of the CDF or some such crap, but maybe not. Should be able to do it numerically, though, as onalark suggests.
posted by equalpants at 10:45 AM on December 12, 2009


the replacement song gets slotted into the list pretty much at random

That is, they aren't added alphabetically. Do I have that right?
posted by Obscure Reference at 10:45 AM on December 12, 2009


a robot, do you see anything wrong my expansion? I'm a little puzzled as to the difference in our answers, since your math makes pretty good sense to me.
posted by onalark at 10:46 AM on December 12, 2009


I also assumed that we were looking at "songs listened to" not "total list length", in which case the expected value is 3N for N starting songs. In fact, if you want the whole distribution it's 2N + N iid copies of a geometric RV whose parameter is drawn from the uniform 0,1. You don't need to use the add-them up version that I did, since for the expected value you can integrate out r_i.

Just to clarify: do songs actually get taken off the list as you hear them? That is, at the end the list consists entirely of songs you did not hear? If so what I did isn't right, but a very similar argument will get you the right answer.

onalark, no I didn't actually read yours and it's not super easy to follow.
posted by a robot made out of meat at 11:06 AM on December 12, 2009


Okay, I whipped up about 50 lines of ruby to do this from my music library, which is about 8500 songs. I assumed an initial playlist of 500 songs, sorted alphabetically, and a new song is played only if it comes after the current song alphabetically. After 1000 runs, which only took about 10 minutes on a fairly old computer:

Mean number of songs played: 857.56
Median number of songs played: 858
Standard Deviation: 19.17775
See the Histogram here. (it's more or less normally distributed, with just a few outliers)
posted by chrisamiller at 11:07 AM on December 12, 2009


Just one fast comment:

The answer cannot depend on any details of the music library, except for the number of songs it contains (and we can even get rid of that dependence by taking the limit where the number of songs goes to infinity). The actual names of the songs make no difference.
posted by em at 11:30 AM on December 12, 2009


The answer cannot depend on any details of the music library, except for the number of songs it contains

Yeah, now that I think about it that's correct - regardless of the names, they'll still have a unique order, and all that matters is where in that order the songs are drawn from.
posted by chrisamiller at 11:47 AM on December 12, 2009


That is, they aren't added alphabetically. Do I have that right?

Right. The new songs are selected based on the date I added them to my iTunes library, which should have no real correlation with the alphabet.

Songs are more likely to start with T or S than they are with Q or X.

Does this end up mattering? I went 10 rounds with myself on it and then decided that only the absolute position within the ordering matters, not what letter things start with.
posted by jacquilynne at 11:48 AM on December 12, 2009


Or, you know, what you and em concluded while I was baking cookies.
posted by jacquilynne at 11:48 AM on December 12, 2009


I call firsties on cookies for giving the first numerical answer.
posted by monkeymadness at 12:01 PM on December 12, 2009


monkeymadness: Except that it doesn't seem overly correct...

For what it's worth, my simulation replicates chrisamiller's result, but the observed 1500 result is so unlikely in this scenario that I think we're not simulating iTunes here.
posted by themel at 12:08 PM on December 12, 2009


robot, I see which of my assumptions was wrong now, I didn't assume any initial distribution of the rankings for the songs. I like your analysis, and I'm pretty sure it's the most right of the ones we've seen so far (chrisamiller's seems to be the better simulation :)
posted by onalark at 12:27 PM on December 12, 2009


It may also be worth noting that iTunes shuffle mode is NOT truly random. After their first generation of iPods, they got many complaints about the same artist being played twice in a row. Now, obviously, these people don't really understand randomness, but it was pervasive enough that Apple made some tweaks to the algorithm to make it seem more random.

I have no idea if these modifications affect the way that iTunes fills playlists randomly, and there's no good way to simulate it, since it's not open source and the algorithm is unpublished.
posted by chrisamiller at 12:40 PM on December 12, 2009


Well, I see everyone beat me to it, but my simulation also agrees with chrisamiller's. (Almost: from 50000 trials, I get a mean of 841.8 and a standard deviation of 19.3. Not sure where the difference is coming from.) Tried several probability distributions, too, and the result is the same, just like em said.

That 1500 is really really strange. What on earth could iTunes be doing?
posted by equalpants at 12:57 PM on December 12, 2009


Really interesting problem. Running a Monte Carlo simulation in Mathematica, I also get figures very similar to chrisamiller. In trying to think about a closed form solution, I reasoned that, averaged out across the whole data set, there would be a 50/50 chance that a song newly introduced to the list would have to be listened to, thus adding 250 new songs to the "must-hear" list. But those new songs are going to be skewed toward the back end of the alphabet, so the odds of being a "second-order" repeater have to be smaller than 50/50. I have no idea how much smaller, but I posited 1/3, thus bring a further 83 or so songs to the list. Carrying doggedly on, I guessed that the odds of being a third-order repeater would be 1/4 (so 83/4 = about 21). And in the same fashion, we get 4 fourth-order repeaters, and 1 repeater from all orders higher than 5.

500 + 250 + 83 + 21 + 4 + 1 = 859, which is tantalisingly close to the figures we're getting from the simulations, but I lack the mathematical chops to tell whether this is coincidence or not.
posted by muhonnin at 1:15 PM on December 12, 2009


I wouldn't put too much faith in the 1500 if it's not jibing with the theoretical numbers -- there are other things in my recently played file that didn't come out of this playlist, and I tried to guestimate them out, but I could be that I've listened to things other than this playlist more than I think. Plus I've undoubtedly added some new songs to the library since I started, which would have effed up the playlist away from the theoretical perfect version in this question. I'm really more interested in the figuring it out part than the final number.
posted by jacquilynne at 1:29 PM on December 12, 2009


Dangit, I worked the calculation out long and it's not a sum of geometrics. It's something that I don't know that there's a name for. I also wrote 15 lines to just simulate the problem assuming an infinite population and got similar answers to the above. The total list length was coming out around 1355, with the biggest out of a hundred 1400.
posted by a robot made out of meat at 2:12 PM on December 12, 2009


It seems to me that the average number of songs you end up listening to would be the number of songs in the list (the "500" here, let's call it n in general) times some constant C.

muhonnin's approach suggests the constant is 1 + 1/2 + 1/(2*3) + 1/(2*3*4) + ... = e - 1, or about 1.718, giving that you listen to 500(e-1) songs on average, or about 859.

As someone who actually thinks about this sort of thing and makes a (meager) living doing it (i. e. a grad student in math), I have to say that that seems like the right sort of constant -- it can be written down simply, and there's secretly a differential equation governing this which makes the appearance of e not all that surprising.
posted by madcaptenor at 2:33 PM on December 12, 2009


I knew madcaptenor would beat me to this! I didn't work this out carefully, but it seems to me that the difference equation (or differential equation, if madcap prefers) can be set up something like this.

At each moment let x in [0,1] represent how far along you are alphabetically in your iTunes library; so at the outset x = 0, when you're halfway through it's 1/2, etc.

And let n be the number of songs still ahead of you in the playlist. It seems to me these should be uniformly distributed among the (1-x) proportion of the library ahead of your current position.

Then with each play, n will either decrease by 1 (if the new song is before your current position) or stay the same (if it's after.) So it decreases by 1 with probability x and stays the same with probability 1-x. On average, you can loosely model this by saying that n gets replaced by n-x.

On the other hand, on average you are going to use up 1/n of the remaining (1-x) of the "alphabet line" by playing the current song: this replaces (1-x) with (1-x)(1-1/n), which is to say it replaces x by x + 1/n - x/n.

So you have a difference equation in two variables where (x,n) gets sent to (x+1/n-x/n, n-x). Starting with initial condition (0,500), I hit n=0 in... wait for it... 859 steps.

I'll bet madcaptenor or somebody else can solve this (or the affiliated differential equation) exactly!
posted by escabeche at 3:09 PM on December 12, 2009


It seems to me that the average number of songs you end up listening to would be the number of songs in the list (the "500" here, let's call it n in general) times some constant C.

Tried a bunch of different list sizes, and I believe this is right.

Also, I had a bug in my program, which is what screwed up my earlier numbers. I also get ~859.1 = 500 * (e - 1) now.
posted by equalpants at 3:10 PM on December 12, 2009


I did a different analysis, along the lines of equalpants', except discrete and with no CDFs. We're stepping alphabetically through the whole catalogue of, say, M songs. We keep track of the expected number of songs remaining on the playlist from the current point forward. The chance that we'll play the current song on the playlist is that value divided by the total number of songs in the catalogue from the current song forward.

Initially, the expected number of songs remaining is N, so E[1] = N. This value will decrease by 1 only when we play a song and the replacement comes earlier in the alphabet.

E[i+1] = E[i] - (chance song will be played)*(chance replacement is before current point) = E[i] - (E[i] / (M - i + 1)) * (i / M).

The number of songs played is the sum of E[i] / (M - i + 1) for i = 1 to M. I ran this in Python for M = 10000 and N = 500, and obtained 859.023. As a proportion of 500, that just happens to be only 0.0002 away from e - 1....
posted by parudox at 3:14 PM on December 12, 2009


Since others are looking at it this way, for a list of size one, with initial point r set up a series of iterated integrals to get the probability of having N songs:
Pr{N=1} =r = fail first time
Pr{N=2} =(1-r) Int(r_1 dr_1,from=r, to=1) = (1-r)^2 (1+r)/2 = go first time, fail second time, integrate out the second song's rank from the first song (which it must be greater than) to 1.
Pr(N=3} = go go fail =(1-r) Int( Int( r2 dr2, from=r1 to=1) dr_1 ,from=r, to=1 )
etc. And then you tack on a final integral over the uniform to get rid of r. If you look on the wiki for e, I bet this is one of its series expansions.
posted by a robot made out of meat at 3:54 PM on December 12, 2009


This is interesting, thanks. I could have sworn it was just the series robot and I came up with, but simulations surprise you sometimes.
posted by monkeymadness at 4:29 PM on December 12, 2009


It's differential equation time!

If you hate differential equations, ignore all the computations below except at the very end, when I'm going to make a somewhat underargued prediction which I hope one of you awesome simulators upthread will test.

Following on my previous post: let t be the number of songs played so far, x be the position in the iTunes library (between 0 and 1) and y the number of songs remaining on the playlist. (The variable here called y was called n in my previous answer.)

Then we can model the difference equation above by a system of differential equations

dx/dt = (1-x)/y
dy/dt = -x

From this it follows that

dy/dx = -(x/(1-x)) y

which is easy to solve: we find y = C (1-x)e^x for some constant C.

We know that y is 500 when x = 0, so C = 500.

Now plug this back in, except let's compute dt/dx instead of dx/dt:

dt/dx = y/(1-x) = 500e^x.

So t = 500 e^x + C for some constant C. When x = 0, t = 0, so this time C = -500 and we end up with

t = 500 e^x - 500.

In particular, when you hit the end of the iTunes library, at x = 1, you have t = 500(e-1), just as the simulations predict.

What's more, this gives you a complete prediction for how fast you go through the list: it suggests that after you've played t songs, your position in the iTunes library is

log(t/500 + 1)

and the number of songs remaining on your list is

(t+500)*(1 - log(t/500 + 1)).

So for instance after 100 plays, you should still have 491 songs left to play, and after 500 plays, you should still have about 307 songs left to play.

Does this agree with the data you guys gathered?
posted by escabeche at 5:58 PM on December 12, 2009 [1 favorite]


Yep.
posted by parudox at 6:25 PM on December 12, 2009


See, now that's why I couldn't solve this. I only passed calculus on the basis of a wicked curve.
posted by jacquilynne at 6:33 PM on December 12, 2009


So for instance after 100 plays, you should still have 491 songs left to play, and after 500 plays, you should still have about 307 songs left to play.

Yeah, that matches up with my data. Nice.
posted by chrisamiller at 8:03 PM on December 12, 2009


escabeche for the math and chrisamiller for the model!

If somebody wants to pick through why my approach doesn't work, it would be greatly appreciated.
posted by onalark at 9:08 PM on December 12, 2009


Looks like I'm late to the party here, but if you're looking for more information on problems like this, you might check out the (discrete-time) Ehrenfest chain, which is a simplified model for gas diffusion. Here's a page with some explanation--notice the similarity with your transition probabilities.

It's basically the way I'd approach this, by modeling your position in the queue as a discrete-time Markov chain. The question you're asking is how long on average should it take to transition from one state to another, which you can put in terms of the stationary distribution of the chain. Since the process you're dealing with is a birth-and-death process (meaning the value can only go up or down by at most 1 at each step), there are some nice tools to help figure out what the stationary distribution is, mostly using the detailed balance condition. It should reduce to a series of difference equations, which I believe others already pointed out.

Anyway, cool problem!
posted by albrecht at 9:36 PM on December 12, 2009


onalark, I think the problem with your approach is that the songs played and the songs remaining are actually distributed differently. E.g. when you've gotten to the point that you have the same number of songs played as you have "remaining", you're less than halfway through the alphabet and the chance of getting an extra song is much higher than not.
posted by parudox at 11:36 PM on December 12, 2009


Thanks parudox. Cool problem jacquilynne, thanks for posing it!
posted by onalark at 1:32 AM on December 13, 2009


« Older Need recommendation on FLASH program   |   The Best Way to Get Work Done Newer »
This thread is closed to new comments.