is this a good use case for a gamma distribution?
September 10, 2014 4:49 AM   Subscribe

I want to generate synthetic user-session data to predict how big a peak in application usage might be shortly before a weekly deadline (for timesheet submission - it's a time and labour tracking application). I've come up with a method that looks like it works - it involves a gamma distribution for the login time. But I don't have enough (in fact, any) statistical training to know whether I'm using that distribution meaningfully. Statisticians, please reassure me. Thanks! Excel functions inside...

So, imagine we have 10,000 people who all need to submit last week's timesheet before midday on Monday. We could expect a peak in user activity shortly before the deadline, and in order to size the application tier hardware I'd like to say what the maximum concurrency might be.

In my head I pictured a histogram with a big lump of concurrent sessions shortly before 12:00, tailing steeply down to nearly nothing at the deadline (assuming only a small number end up doing it too late). To the left, there's a long-ish tail of people who are organised enough to submit well ahead of the deadline. So I set out looking for a formula that could generate data with that kind of skew - and which I could also figure out easily how to use in Excel 2007 (my Excel skills are only slightly ahead of my stats knowledge). That's how I came up with the gamma distribution.

Let's say the mean start time for the sessions is 30mins before the deadline, with standard deviation of 15 (so only a small no. miss the deadline). I believe that gives me a variance of 225. So in the gamma distribution, I would have alpha = mean squared over variance = 900/225 = 4, and beta = variance over mean = 225/30 = 7.5.

I'm plugging those numbers into the GAMMAINV() function in Excel - so that my sessions all have a start time of 12:00 - GAMMAINV(rand(), 4, 7.5). When I draw a graph of user sessions against time, it looks pretty much like the picture I had in my head. By tweaking the mean & SD, as well as the average length of the user sessions, I can obviously make the peak move around - and hence do some rudimentary what-if analysis that might inform how we'd educate the users.

But what I don't have is any insight into is whether there's a good statistical basis for doing it this way. In terms of how the gamma distribution works mathematically, is this a reasonable use case?

Please assume the knowledge of an intelligent layperson with little or no specialist training. All help very much appreciated! Thanks.
posted by rd45 to Technology (3 answers total)
 
Best answer: The gamma distribution will work reasonably for you here, and probably makes more sense than the similarly shaped log normal distribution.

The gamma distribution can be derived by a sum of exponential processes: i.e. each individual in your model is most likely to submit near the deadline, with the probability declining as you get further away in time. That's not a terrible model, and is what you are going for.

Note that a model is just that, a model, and lacking any data doesn't have any real meaning. In particular, I suspect that your model would be really bad at modelling the time of registration for everyone who doesn't submit near the deadline: I'd expect your population to be split into two (or more groups), earlybirds and late deadline users.
posted by Cannon Fodder at 5:12 AM on September 10, 2014


Since you have no empirical data, I don't think the distribution you pick matters much. For sizing purposes, all that really matters is the peak transaction rate, and any distribution can be parameterized to give you any rate.

Have you modeled how long it takes to fill out the spreadsheet? Assuming there's a reasonable variance in how long that takes, the width of the peak is going to be at least the same order of magnitude as the submission time number. For capacity planning, assume it's no wider than that, and assume everybody submits their timesheets at the last minute. So if it takes 10 minutes to fill out, and there are 10k submissions, your worst-case peak rate is around 60k per hour or 20 per second. Of course, how many TPS that turns into depends on the architecture of your application.
posted by mr vino at 5:40 AM on September 10, 2014


Response by poster: mr vino said: assume everybody submits their timesheets at the last minute

That's more or less where I started. I had two simplistic models (1. everyone all at once at the last minute, and 2. everyone spread out evenly over all the available time). Both of them were obviously quite unrealistic. That's why I started looking for a plausible-looking distribution with skew - to represent something in between the two obvious (and obviously wrong) extremes.

Cannon Fodder said: I suspect that your model would be really bad at modelling the time of registration for everyone who doesn't submit near the deadline

Yes, you're right, but I'm fine with that. I'm basically ignoring those users. My 10,000 users comprise that subset of the total user population who leave it until near the deadline. If I've sized for the peak, then - as long as they're spread out - anyone who submits outside the peak isn't going to threaten the overall capacity. If they're not spread out, it's just another peak whose size I can estimate in exactly the same way.

Thanks for the replies. I'm sufficiently reassured.
posted by rd45 at 12:56 AM on September 11, 2014


« Older Downs Syndrome food trauma?   |   But I wrote it! Newer »
This thread is closed to new comments.