# Math, sweet sweet math

October 16, 2014 1:03 PM Subscribe

I have a set of data: D(t). 5000 samples. Scatter-graphing makes some patterns clear (D-mean increases with t, for instance). D and t are always positive. I want to characterize these, statistically.

I've tried blindly trying to stumble upon a distribution model that fits D to t, but that worked about as well as you'd expect.

So, how do I go about testing if P(D, t) is a function of t, t^2, log(t), t^x, etc... ? That is, given an interval of [D1:D2], what is the probability distribution for t? (And vice versa.)

I've tried blindly trying to stumble upon a distribution model that fits D to t, but that worked about as well as you'd expect.

So, how do I go about testing if P(D, t) is a function of t, t^2, log(t), t^x, etc... ? That is, given an interval of [D1:D2], what is the probability distribution for t? (And vice versa.)

Best answer: Another nonparametric option: Anderson-Darling test.

posted by a lungful of dragon at 1:49 PM on October 16, 2014

posted by a lungful of dragon at 1:49 PM on October 16, 2014

If you're looking for a simple answer (and I apologize if this isn't complicated enough and I'm overlooking part of your question), you could just find a best fit to the data for each of the functions you're considering over the interval you're interested in, calculate the R^2 value for each, and use the function that gives the best fit. Excel could handle this, though 5000 rows can get a bit unwieldy. The Excel solver can find the unknown constants (e.g. A, B, in D(t)=A*t^B) that give the minimum R^2 value. Other software tools (e.g. MATLAB, STATA) could do this as well. Another approach might be using systems identification techniques to identify the system dynamics.

It depends what the data represent -- the underlying process might suggest what function is the best fit. Also, it depends what question you want to answer-- for some questions, another statistical test might be more useful than just curve-fitting and giving the R^2 value.

If, by any chance, your 5000 samples comprise the closing price for the Dow Jones Industrial Average over the past decade, and you're trying to curve-fit to make a pile of money in the stock market in the short term...well, it's a hard problem, requiring a more complicated model than just curve-fitting.

posted by sninctown at 1:51 PM on October 16, 2014 [1 favorite]

It depends what the data represent -- the underlying process might suggest what function is the best fit. Also, it depends what question you want to answer-- for some questions, another statistical test might be more useful than just curve-fitting and giving the R^2 value.

If, by any chance, your 5000 samples comprise the closing price for the Dow Jones Industrial Average over the past decade, and you're trying to curve-fit to make a pile of money in the stock market in the short term...well, it's a hard problem, requiring a more complicated model than just curve-fitting.

posted by sninctown at 1:51 PM on October 16, 2014 [1 favorite]

Response by poster: Thanks! These are great starts.

posted by IAmBroom at 2:44 PM on October 16, 2014

sninctown: If, by any chance, your 5000 samples comprise the closing price for the Dow Jones Industrial Average over the past decade,Heh. Since the Dow is price-weighted instead of capitalization-weighted, IMO trying to characterize it is a fool's errand, anyway. But I digress...

posted by IAmBroom at 2:44 PM on October 16, 2014

If it's only 5000 samples, and you know that it is only a single power law function, you could easily take the brute force approach with a bit of Python/Numpy (or R, or Matlab, or whatever).

Basically, choose a range of index values you'd like to consider, and for each of these models calculate the reduced chi-square of the fit:

Sum_i [ { (D_i(t) - M_i(t)) / sigma_i(t) }^2]

where M(t) = A*t^n, and sigma(t) is the measurement uncertainty for each value. (If they're all equal, or perfect measurements, even simpler - just set all sigma to 1.)

Now just test it for n= [0.0,25.0] in steps of maybe 0.25 - a few seconds on a laptop, and you'll see where the right answer is for deeper digging.

If you want to allow arbitrary polynomial models (A + Bt + Ct^2 +Dt^4 + ...), things get more complicated, because now you have to do separate fits for each parameter set, keep track of the number of free parameters, and weight your results accordingly...

posted by RedOrGreen at 3:11 PM on October 16, 2014 [1 favorite]

Basically, choose a range of index values you'd like to consider, and for each of these models calculate the reduced chi-square of the fit:

Sum_i [ { (D_i(t) - M_i(t)) / sigma_i(t) }^2]

where M(t) = A*t^n, and sigma(t) is the measurement uncertainty for each value. (If they're all equal, or perfect measurements, even simpler - just set all sigma to 1.)

Now just test it for n= [0.0,25.0] in steps of maybe 0.25 - a few seconds on a laptop, and you'll see where the right answer is for deeper digging.

If you want to allow arbitrary polynomial models (A + Bt + Ct^2 +Dt^4 + ...), things get more complicated, because now you have to do separate fits for each parameter set, keep track of the number of free parameters, and weight your results accordingly...

posted by RedOrGreen at 3:11 PM on October 16, 2014 [1 favorite]

Response by poster: Heh... klugey, but I kinda like it!

posted by IAmBroom at 4:26 PM on October 16, 2014

posted by IAmBroom at 4:26 PM on October 16, 2014

Best answer: If you want to find the power of the exponent x, take the log of both sides.

D =t^x

Log D =Log t^x = xLog t.

posted by SemiSalt at 5:45 PM on October 16, 2014

D =t^x

Log D =Log t^x = xLog t.

posted by SemiSalt at 5:45 PM on October 16, 2014

Depending on your purposes, the mutual information literature, and the maximal information criterion (MIC) might be useful.

posted by MrBobinski at 6:28 PM on October 16, 2014

posted by MrBobinski at 6:28 PM on October 16, 2014

Scratch that. MIC would let you know how closely D and t are related, but wouldn't necessarily help you decide what the form of the relationship is. BIC would be help you with that.

posted by MrBobinski at 8:16 AM on October 17, 2014

posted by MrBobinski at 8:16 AM on October 17, 2014

Response by poster:

Kalman Filtering just sprang to mind, but that generally explains trends, not distributions.

posted by IAmBroom at 9:35 AM on October 17, 2014

SemiSalt: If you want to find the power of the exponent x, take the log of both sides.That's unlikely to help, as the data is very noisy. Log/Log of (VERY NOISY)vs(VERY NOISY) is very noisy.

D =t^x

Log D =Log t^x = xLog t.

Kalman Filtering just sprang to mind, but that generally explains trends, not distributions.

posted by IAmBroom at 9:35 AM on October 17, 2014

Response by poster: You know what,

http://imgur.com/frEndsJ

posted by IAmBroom at 2:42 PM on October 18, 2014

**SemiSalt**? I stand corrected. Aside from some quantization error in the lowest samples... That shows a pretty damn clear trend at about D=t^1.6 or so!http://imgur.com/frEndsJ

posted by IAmBroom at 2:42 PM on October 18, 2014

This thread is closed to new comments.

posted by un petit cadeau at 1:27 PM on October 16, 2014