help finding patterns in data
April 24, 2012 12:09 PM Subscribe
Given a set of columnar data, some of which are categorical and others that are numerical, how can I identify which category columns are responsible for signficant changes in the one or more of the numerical columns?
posted by mulligan to computers & internet (7 answers total) 1 user marked this as a favorite
For example, if my columns (in reality there are many many more columns) are:
day, advertiser, domain, gameid, views, clicks
And at some point, the aggregate views (summed over all the rows) suddenly spikes. That is relative to previous days, there is a sharp increase in the number of views for today. Now, this could be a very popular gameid that accounts for all this, or it could be attributed to a suddenly popular domain or big advertiser.
I currently handle this by querying the data set against all the columns I think matter, the order by the difference of the aggregates of day2 and day1. Then the next step involves trying to find all the columns that are consistent across the top rows. The guess is then that the column values that are consistent are responsible for the biggest change.
As the number of columns increases, the above becomes less feasible.
Is there a statistical technique to identify which column, or columns, and which specific column values are responsible for the change from one day to the next?