# Is there actually a trend here, or just movement around the mean?

July 25, 2017 5:57 PM Subscribe

'The sky is falling!' Or is it? I have a number for each of the last six years after the fold. Is there really a downward trend, or just entirely normal, expected variation, or are there not enough data to tell?

Year 4 137,000

Year 5 106,000

Year 6 91,000

That's a scary drop, right? But looking back to year one, we get:

Year 1 121,000

Year 2 121,000

Year 2 132,000

Year 4 137,000

Year 5 106,000

Year 6 91,000

Yeah, there's a three-year drop, but that's off a three-year rise. Ask people back in Year 4 and they'd be worried about the ever-upward trend in demand. Here we are heading into Year 7 and it's the opposite.

So, looming crisis or not enough information?

Year 4 137,000

Year 5 106,000

Year 6 91,000

That's a scary drop, right? But looking back to year one, we get:

Year 1 121,000

Year 2 121,000

Year 2 132,000

Year 4 137,000

Year 5 106,000

Year 6 91,000

Yeah, there's a three-year drop, but that's off a three-year rise. Ask people back in Year 4 and they'd be worried about the ever-upward trend in demand. Here we are heading into Year 7 and it's the opposite.

So, looming crisis or not enough information?

Um, what specifically are you referring to here, what is this data associated with, and where did you get this data from? It's not evident in your OP...

posted by Hermione Granger at 6:03 PM on July 25

posted by Hermione Granger at 6:03 PM on July 25

It's demand for a particular kind of information. I can't say what the information is, or who is asking for it, or why.

There are no competitors for providing the information, and no alternative sources for the information.

The total number of people who might want the information is a couple of orders of magnitude higher than annual demand so I'm pretty sure it's not a case of 'everybody who might want the information already has the information'. New people will want this information all the time, and people who already have it will want it updated, because it changes regularly.

posted by obiwanwasabi at 6:35 PM on July 25

There are no competitors for providing the information, and no alternative sources for the information.

The total number of people who might want the information is a couple of orders of magnitude higher than annual demand so I'm pretty sure it's not a case of 'everybody who might want the information already has the information'. New people will want this information all the time, and people who already have it will want it updated, because it changes regularly.

posted by obiwanwasabi at 6:35 PM on July 25

I made a chart. Whatever the product is, if it were my business, I'd be concerned.

posted by theora55 at 6:42 PM on July 25 [2 favorites]

posted by theora55 at 6:42 PM on July 25 [2 favorites]

obiwanwasabi: "

It could be both, i.e.: a looming crisis but also not enough information to prove it to any degree of certainty. With just six data points like this, it's going to be hard to say anything definitively.

The next steps will most likely involve gathering more information. You can go deeper into this dataset by looking into different subsets (e.g.: you mention new and returning customers, I'd want to know if the decline is across both sets or just in one). And you can go wider into other datasets (e.g.: do you have any measure of people who might have almost demanded this data but didn't quite pull the trigger a.k.a funnel analysis?)

posted by mhum at 7:07 PM on July 25

*So, looming crisis or not enough information?*"It could be both, i.e.: a looming crisis but also not enough information to prove it to any degree of certainty. With just six data points like this, it's going to be hard to say anything definitively.

The next steps will most likely involve gathering more information. You can go deeper into this dataset by looking into different subsets (e.g.: you mention new and returning customers, I'd want to know if the decline is across both sets or just in one). And you can go wider into other datasets (e.g.: do you have any measure of people who might have almost demanded this data but didn't quite pull the trigger a.k.a funnel analysis?)

posted by mhum at 7:07 PM on July 25

Other info is more important than analysis of 6 points. For example, who the buyers were and what else is going on in the industry.

It's a low probability pattern for variation around a mean, but there are lots of other possibilities.

posted by SemiSalt at 7:09 PM on July 25

It's a low probability pattern for variation around a mean, but there are lots of other possibilities.

posted by SemiSalt at 7:09 PM on July 25

'Customers' probably isn't the right word. Treat this as government, regulatory, scientific or medical information. People don't want it - they need it, or a required to have it. It's free, so there are no pricing pressures.

There isn't an industry. We're the only source of this information.

Could I ask how you worked that out?

posted by obiwanwasabi at 7:22 PM on July 25

There isn't an industry. We're the only source of this information.

*It's a low probability pattern for variation around a mean*Could I ask how you worked that out?

posted by obiwanwasabi at 7:22 PM on July 25

Whenever I see a negative trend in my business metrics*, the first thing I want to do is figure out

Let's say my sales are down 10% month over month. Why? Maybe it's because one of my competitors launched a much better product and is starting to steal my sales. Unless I can respond quickly, I'd consider that a concerning reason. On the other hand, perhaps it's because I'm coming off a seasonal peak that's predictable every year. Not so worrisome. Or maybe a single customer made some huge purchases for a big project, and we knew they'd be short-lived, and we're returning to the natural demand. Also not so bad. Where I start to get

* I realize you said it's not an industry, and you don't have customers, but that's really semantics IMHO. You are talking about demand for a good or service, and you have people or entities who consume the good or service. The terminology isn't that important.

posted by primethyme at 7:32 PM on July 25 [4 favorites]

**why**it is happening. It's literally impossible to determine whether or not to be concerned without knowing the why.Let's say my sales are down 10% month over month. Why? Maybe it's because one of my competitors launched a much better product and is starting to steal my sales. Unless I can respond quickly, I'd consider that a concerning reason. On the other hand, perhaps it's because I'm coming off a seasonal peak that's predictable every year. Not so worrisome. Or maybe a single customer made some huge purchases for a big project, and we knew they'd be short-lived, and we're returning to the natural demand. Also not so bad. Where I start to get

**really**worried is when I have no idea why it happened, and I can't figure it out. But honestly, I consider it a pretty bad sign for the business in general if we can't figure out why changes are happening via some analysis. If there's a change, it's because the customers have changed, the market has changed, or a combination of both. The absolute most important thing in business is knowing my customers and what they care about. Knowing the market (including competitors and outside factors driving demand) is also critically important. If I know those things well, I can usually dig in enough to find the root cause of a trend.* I realize you said it's not an industry, and you don't have customers, but that's really semantics IMHO. You are talking about demand for a good or service, and you have people or entities who consume the good or service. The terminology isn't that important.

posted by primethyme at 7:32 PM on July 25 [4 favorites]

Would there be a reason for artificially-inflated numbers early in the program? People who couldn't access this info glad to finally have it but who only need it every 5 years, or a particular promotional program? It may also help you to get month-to-month numbers to compare across years and see if there is anything that strikes you there. In my very different area we saw annual losses, and when we split it by month we saw that a student program we partially funded resulted in a big spike in the summer months. Our usage aside from that short term program was constant.

If I needed to make hiring decisions based on the numbers, and without any other information, I'd assume continued losses. The drop in either of the last two years was larger than the gains in any other year.

posted by tchemgrrl at 8:01 PM on July 25

If I needed to make hiring decisions based on the numbers, and without any other information, I'd assume continued losses. The drop in either of the last two years was larger than the gains in any other year.

posted by tchemgrrl at 8:01 PM on July 25

Without more information I don't think anyone can tell you anything about these numbers. They could be a count of the number of rocks found in your left boot. A downward trend could be good -- fewer rocks -- but it might also mean you have a hole in your boot.

posted by runcibleshaw at 8:40 PM on July 25 [10 favorites]

posted by runcibleshaw at 8:40 PM on July 25 [10 favorites]

*A downward trend could be good*

I'm not asking for analysis about whether the situation is good or bad. For example, I might think it's great if it's going down because it's not my core business and we provide the information for purely historical reasons, but the people who process the information might think it's terrible because they're out of a job.

I'm asking whether there is sufficient information here to indicate a trend, or whether it's likely this is just reversion to / entirely normal dancing about the mean. ('Nobody can tell you that; you're trying to make statistics do something they can't do; it's math, not magic' is a perfectly valid answer.) All I have to go on at the moment are unconvincing trend lines with low R^2 values. That might be the answer for all I know - 'you can't tell' - and that's fine.

*Without more information I don't think anyone can tell you anything about these numbers.*

I could tell you exactly what the numbers are, and it wouldn't help you. For example, I could say 'It's the number of people who ask for classified travel statistics from a state-funded security agency' or 'It's the number of people from an international development community who seek particular epidemiological data from a specialist NGO'. (It isn't either of those things.) That information gives you no additional information about whether there's a trend in those numbers or not. You can't say 'Well, if it's the security thing, there's a trend, but if it's health data, there isn't.'

I'm looking for a purely mathematical / statistical perspective, not a root cause analysis. There is a trend, statistically speaking; there isn't a trend, statistically speaking; or there isn't enough information to say if there's a trend, hopefully with an explanation of how you - a person who knows about statistics - knows or strongly suspects this is the case.

posted by obiwanwasabi at 11:47 PM on July 25

Okay, here's one statistical answer: do a linear regression in which your data is the outcome variable and year is the predictor. This will give you a line of best fit (i.e., a "trend line" with intercept and slope) and will also tell you how "significant" that line is. If it's not significant, this means that you can't reject the null hypothesis -- what that means is that if there is a real trend, you don't have enough power to detect it.

So, when I do this in R, I find that the line of best fit has an intercept of 137000 and a slope of -5429. The negative slope suggests a downward trend (which you can interpret as going down by 5429 per year on average). However, the p-value is 0.2111, which is not significant. So what that is saying is that it is going down but you don't have enough statistical power (i.e., you don't have enough data) to be able to conclude that this isn't just normal variation.

In case you're curious, here's the R code I used:

year < - c(1,2,3,4,5,6)

number < - c(121000,121000,132000,137000,106000,91000)

m < - lm(number ~ year)

summary(m)

posted by forza at 3:52 AM on July 26 [4 favorites]

So, when I do this in R, I find that the line of best fit has an intercept of 137000 and a slope of -5429. The negative slope suggests a downward trend (which you can interpret as going down by 5429 per year on average). However, the p-value is 0.2111, which is not significant. So what that is saying is that it is going down but you don't have enough statistical power (i.e., you don't have enough data) to be able to conclude that this isn't just normal variation.

In case you're curious, here's the R code I used:

year < - c(1,2,3,4,5,6)

number < - c(121000,121000,132000,137000,106000,91000)

m < - lm(number ~ year)

summary(m)

posted by forza at 3:52 AM on July 26 [4 favorites]

In general trying to do time series analysis -- what you're trying to do -- with six data points is not going to tell you very much. Note that you could have many more data points, if the data are available, simply by using monthly or weekly aggregations instead of annual, but those introduce their own problems.

Yeah, you can, maybe. Certainly there are particular kinds of time-serial data that are well-understood enough to make much better sense of variation and trends. Employment seasonality, for example. And a better-specified data generating process will usually lead to better null hypotheses.

posted by ROU_Xenophobe at 4:50 AM on July 26 [1 favorite]

*That information gives you no additional information about whether there's a trend in those numbers or not. You can't say 'Well, if it's the security thing, there's a trend, but if it's health data, there isn't.'*Yeah, you can, maybe. Certainly there are particular kinds of time-serial data that are well-understood enough to make much better sense of variation and trends. Employment seasonality, for example. And a better-specified data generating process will usually lead to better null hypotheses.

posted by ROU_Xenophobe at 4:50 AM on July 26 [1 favorite]

The principled Bayesian way to do this is that you start with guesses of how likely each explanation of the data is likely to be before you looked at any data at all (this is called your prior), then update your probabilities based on how likely the data that you've seen is given your explanation.

If I knew that these numbers were generated by something very likely to be random, then sure, these numbers look random; they didn't make that hypothesis much less likely. (For example, if I generated the numbers by rolling 3 dice, adding them, and multiplying by 10000, this data certainly wouldn't make me think that the dice were becoming more biased over time.)

But if, given the nature of the context, it seemed a priori quite possible that there would be trends in the data (e.g., it's monthly sales or something), then this data supports that pretty strongly, and I'd continue to think that it was probable.

So there's really no answer based solely on these six numbers and no other context.

posted by dfan at 5:38 AM on July 26 [1 favorite]

If I knew that these numbers were generated by something very likely to be random, then sure, these numbers look random; they didn't make that hypothesis much less likely. (For example, if I generated the numbers by rolling 3 dice, adding them, and multiplying by 10000, this data certainly wouldn't make me think that the dice were becoming more biased over time.)

But if, given the nature of the context, it seemed a priori quite possible that there would be trends in the data (e.g., it's monthly sales or something), then this data supports that pretty strongly, and I'd continue to think that it was probable.

So there's really no answer based solely on these six numbers and no other context.

posted by dfan at 5:38 AM on July 26 [1 favorite]

Not a formal analysis, just a quickie, but the mean for those six years is 118,000 with a standard deviation of 17,000 (ish - all numbers rounded).

I standard deviation below the mean is 101,000, two standard deviations below the mean is 84,000.

So 91,000 (year 6) is more than a standard deviation from mean (but less than 2 standard devs). Depending on what you're measuring, the quick and dirty analysis is that the deviation is "interesting."

posted by porpoise at 2:19 PM on July 26

I standard deviation below the mean is 101,000, two standard deviations below the mean is 84,000.

So 91,000 (year 6) is more than a standard deviation from mean (but less than 2 standard devs). Depending on what you're measuring, the quick and dirty analysis is that the deviation is "interesting."

posted by porpoise at 2:19 PM on July 26

*It's a low probability pattern for variation around a mean*

Could I ask how you worked that out?

Could I ask how you worked that out?

I wasn't thinking too clearly, I guess. Certainly not too precisely. And I didn't realize the first two observations were the same. But I was thinking in terms of coin tosses. Record a move up as "heads" and a move down as "tails'. It's not that common to get all heads, then all tails. But, on reflection, the sample size is so small that maybe it's not so rare.

I can see another rough interpretation in terms of regression to the mean. If there is correlation between the terms (meaning high is likely to be followed by high and low by low, and which seems likely), then these terms could be interpreted as a single perturbation followed by regression to the mean.

Not enough data for statistical analysis.

posted by SemiSalt at 3:45 PM on July 26

Thanks all!

posted by obiwanwasabi at 7:20 PM on July 26

posted by obiwanwasabi at 7:20 PM on July 26

You are not logged in, either login or create an account to post comments

What this data is, I have no idea, so I wouldn't have the least clue how to interpret it.

posted by restless_nomad at 6:02 PM on July 25 [1 favorite]