Excelfilter - How do I "fill in" missing data between two points?
January 5, 2015 11:33 AM Subscribe
I am stumped and hope you can help.
I have three columns of data (these aren't real numbers, so go with me, please):
Column A = Date1 Date2 Date3 Date4 Date5
Column B = 100 250 602 610
Column C = 50 51 49 53
What I hope you'll see here is that the data progresses in a non-linear fashion. It's random. Some days, it rises. Some days, it drops.
Is there a function that I can input that would tell Excel to take its best guess and fill in the missing data? I've looked up FORECAST and TREND via Google, but I don't know if that's what I'm looking for.
I have three columns of data (these aren't real numbers, so go with me, please):
Column A = Date1 Date2 Date3 Date4 Date5
Column B = 100 250
Column C = 50 51
What I hope you'll see here is that the data progresses in a non-linear fashion. It's random. Some days, it rises. Some days, it drops.
Is there a function that I can input that would tell Excel to take its best guess and fill in the missing data? I've looked up FORECAST and TREND via Google, but I don't know if that's what I'm looking for.
For clarification, are you trying to populate B and C for Date5?
(Also, answering questions like this is like 80% of statistics. There are countless ways to answer this, which may or may not be appropriate. For example, one "best guess" might just be the average of the known values.)
posted by paper chromatographologist at 11:40 AM on January 5, 2015
(Also, answering questions like this is like 80% of statistics. There are countless ways to answer this, which may or may not be appropriate. For example, one "best guess" might just be the average of the known values.)
posted by paper chromatographologist at 11:40 AM on January 5, 2015
Response by poster: Oy. My tea hasn't kicked in yet, evidently. I forgot to turn on the blockquote.
Let's try this again:
Make sense?
posted by zooropa at 11:44 AM on January 5, 2015
Let's try this again:
Column A = Date1 Date2 Date3 Date4 Date5Basically, I'm trying to fill in the two missing data points with a "best guess." I have the starting points and the end points, but not the middle.
Column B = 100 250 missing1 missing2 602 610
Column C = 50 51 missing1 missing2 49 53
Make sense?
posted by zooropa at 11:44 AM on January 5, 2015
Identifying patterns in complex data is difficult and it's impossible to ever be 100% sure you're right. Filling in the empty spaces is basically asserting that patterns exist. Are there any relationships between the columns, or are all three completely independent? For instance if the dates were out of order, would that affect what numbers you'd hope to see in your missing spaces? Is it important that column B is always even? Do you just want random numbers with kinda the right distribution?
Next question is what are you going to do with these made up numbers? Any analysis or visualization with them would tell you about the guessing process, instead of about the actual data.
posted by aubilenon at 12:08 PM on January 5, 2015 [3 favorites]
Next question is what are you going to do with these made up numbers? Any analysis or visualization with them would tell you about the guessing process, instead of about the actual data.
posted by aubilenon at 12:08 PM on January 5, 2015 [3 favorites]
It doesn't look like you've explicitly said whether your dates are equally spaced.
You have explicitly said that "the data" ... "(I)t's random"; yet you want a prediction. How about 4?
posted by achrise at 12:14 PM on January 5, 2015 [1 favorite]
You have explicitly said that "the data" ... "(I)t's random"; yet you want a prediction. How about 4?
posted by achrise at 12:14 PM on January 5, 2015 [1 favorite]
If you don't know anything about the structure of the data, the best thing to do is fill in missing values with the mean value.
If you do know something about the structure of the data, you can take advantage of interpolation to fill in the missing figures. Basically, fit a best-fit line through the data using whatever structure the data has (linear, quadratic, etc) and then extract the values at the line for the missing points.
posted by zug at 12:19 PM on January 5, 2015 [1 favorite]
If you do know something about the structure of the data, you can take advantage of interpolation to fill in the missing figures. Basically, fit a best-fit line through the data using whatever structure the data has (linear, quadratic, etc) and then extract the values at the line for the missing points.
posted by zug at 12:19 PM on January 5, 2015 [1 favorite]
You were right that the FORECAST function is what you're looking for.
I hope you remember high school algebra. As the linked article describes, this function takes as arguments an x value to predict (which would be your unknown date), your known y values (Column B or C where data is available) and your known x values (the dates where data is available).
You won't be able to make the formula work with the missing data in the middle. You'll have to sort the lines so that the lines with missing data are at the bottom.
Screenshot of the formula in action.
posted by zixyer at 12:22 PM on January 5, 2015
I hope you remember high school algebra. As the linked article describes, this function takes as arguments an x value to predict (which would be your unknown date), your known y values (Column B or C where data is available) and your known x values (the dates where data is available).
You won't be able to make the formula work with the missing data in the middle. You'll have to sort the lines so that the lines with missing data are at the bottom.
Screenshot of the formula in action.
posted by zixyer at 12:22 PM on January 5, 2015
Response by poster: FWIW, I know I don't need to be 100% sure it's right. I'm looking for a reasonable "best guess" so it doesn't look like I completely fudged the data. After all, that's Excel's job. ;)
The columns of data don't necessarily impact one another. For instance, whether column B is higher or lower doesn't impact column C. "Correlation doesn't equal causation" and all. They're independent of each other. That's why I said that it's random.
posted by zooropa at 1:53 PM on January 5, 2015
The columns of data don't necessarily impact one another. For instance, whether column B is higher or lower doesn't impact column C. "Correlation doesn't equal causation" and all. They're independent of each other. That's why I said that it's random.
posted by zooropa at 1:53 PM on January 5, 2015
How about taking the average of the two nearest datapoints?
posted by hiamelie at 12:30 AM on January 14, 2015
posted by hiamelie at 12:30 AM on January 14, 2015
This thread is closed to new comments.
posted by desjardins at 11:36 AM on January 5, 2015