help finding patterns in data
April 24, 2012 12:09 PM   Subscribe

Given a set of columnar data, some of which are categorical and others that are numerical, how can I identify which category columns are responsible for signficant changes in the one or more of the numerical columns?

For example, if my columns (in reality there are many many more columns) are:
day, advertiser, domain, gameid, views, clicks

And at some point, the aggregate views (summed over all the rows) suddenly spikes. That is relative to previous days, there is a sharp increase in the number of views for today. Now, this could be a very popular gameid that accounts for all this, or it could be attributed to a suddenly popular domain or big advertiser.

I currently handle this by querying the data set against all the columns I think matter, the order by the difference of the aggregates of day2 and day1. Then the next step involves trying to find all the columns that are consistent across the top rows. The guess is then that the column values that are consistent are responsible for the biggest change.

As the number of columns increases, the above becomes less feasible.
Is there a statistical technique to identify which column, or columns, and which specific column values are responsible for the change from one day to the next?
posted by mulligan to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
Response by poster: more generally:
How can I find which subset of columns correlate closely to a significant change in the aggregate of another column?
posted by mulligan at 12:13 PM on April 24, 2012


is this a dataset inside a DB (ie you are running some type of SQL query on it ?) or is it an xls or csv dump of data ?

If it's a DB, there are probably ways to both optimize the dataset and optimize your queries to do this efficiently.

eg when you say "aggregate views ... spike", do you mean for a given (day, advertiser) tuple, the sub of the views for that particular day and advertiser is larger than some running average ?
posted by k5.user at 12:31 PM on April 24, 2012


Response by poster: k5.user:
Yes, the data is in a database.
Regarding your query about what I mean by aggregate, for the purposes of this question, I'm referring to the sum(views) over a day.
posted by mulligan at 12:50 PM on April 24, 2012


How columns is "many, many"? What about something as simple as a stacked chart? Or a series of sparklines, one for each column? Then trends/spikes would jump out visually. Admittedly, this is more qualitative than quantitative, but maybe that would be helpful?
posted by misterbrandt at 1:44 PM on April 24, 2012


Response by poster: 84 columns, potentially more.
The issue isn't just the columns though, the number of distinct values each column can take on makes graphing the data in order to visually find a pattern pretty much impossible for a human. For instance, there are great than 100k different gameids.
posted by mulligan at 9:17 PM on April 24, 2012


I am no expert, but we've had some success at work using this data mining tool to extract these sorts of relationships. Check out Weka.
posted by bmosher at 9:41 PM on April 24, 2012


This is a machine learning problem. Without seeing the data it's hard to say which particular approach is going to be the most appropriate, but since you're trying to predict a continuous outcome from categorical variables, each with many possible categories, it sounds like you want a linear model with some kind of penalty for complexity. I would try something like forward stepwise/stagewise regression to start, and maybe also the LASSO.

To a very quick first approximation, both approaches try to fit an equation y = ax1 + bx2..., where xi corresponds to one of your values and is either 1 or 0 depending on whether that observation has that value (e.g., 1 if the user id is "foo", 0 if otherwise), a, b, ... are numbers the algorithm picks to optimize the fit, and y is the outcome. The LASSO tends to make lots of the a, b... go to zero, whereas in forward step/stagewise regression you add the ax1, bx2, ... terms one at a time until the fit isn't appreciably better.

You may be able to do one or both of these yourself in Weka (you can definitely do them in R). But it may be better to hire an expert, or to look through something like this book. Machine learning is easy to do wrong, and the more complicated the algorithm the more subtly misleading it can be.
posted by en forme de poire at 12:00 AM on April 25, 2012


« Older Senior dog poops inside at night.   |   How to protect dish sponge from sponge-eating cat Newer »
This thread is closed to new comments.