Looking to extract more info from this database
July 17, 2013 8:38 PM Subscribe
We are storing information about application hits in a database. Sadly my database/excel skills are pretty limited and I'd like some help getting more info out of the information we have.
Currently it stores the information in an access database like this:
AppName DateStamp Hour ServerName AccountName IPAddress
App1 17/07/2013 6:00:00PM SERVER1 User1 10.10.10.10
App1 17/07/2013 6:00:00PM SERVER1 User2 10.10.10.11
App1 17/07/2013 6:00:00PM SERVER2 User3 10.10.10.12
App2 17/07/2013 6:00:00PM SERVER3 User1 10.10.10.10
etc
Each hour it adds in each user running each app on each server from each IP address.
So from that I can see who is running what, and generate graphs via excel (yay for pivot tables). We have 1000+ apps however so sometimes its incomprehensible displaying every application. I have got some good graphs going already - for example, the top ten apps displayed in the graphs which does make it somewhat easier to visualise.
I'd like to be able to say "between 8pm and 9pm, this was the variance" where I could say "App1 fell by 10 users, app2 gained 3, app3 gained 32 etc". Or even be able to get excel to pull out the apps with the most variance, identify the "movers and shakers" as it were.
The reason I want to figure this out as we've been hitting record user numbers over the last few weeks and inquiring minds want to know "where is this increase coming from?"
My excel skills are limited to pivot tables and charts and whatever google can tell me. I know there's a variance function but I've no idea how to apply it to the data as stored in the example above.
I hope I've given enough information here but happy to respond to any queries for more information. Thanks!
Currently it stores the information in an access database like this:
AppName DateStamp Hour ServerName AccountName IPAddress
App1 17/07/2013 6:00:00PM SERVER1 User1 10.10.10.10
App1 17/07/2013 6:00:00PM SERVER1 User2 10.10.10.11
App1 17/07/2013 6:00:00PM SERVER2 User3 10.10.10.12
App2 17/07/2013 6:00:00PM SERVER3 User1 10.10.10.10
etc
Each hour it adds in each user running each app on each server from each IP address.
So from that I can see who is running what, and generate graphs via excel (yay for pivot tables). We have 1000+ apps however so sometimes its incomprehensible displaying every application. I have got some good graphs going already - for example, the top ten apps displayed in the graphs which does make it somewhat easier to visualise.
I'd like to be able to say "between 8pm and 9pm, this was the variance" where I could say "App1 fell by 10 users, app2 gained 3, app3 gained 32 etc". Or even be able to get excel to pull out the apps with the most variance, identify the "movers and shakers" as it were.
The reason I want to figure this out as we've been hitting record user numbers over the last few weeks and inquiring minds want to know "where is this increase coming from?"
My excel skills are limited to pivot tables and charts and whatever google can tell me. I know there's a variance function but I've no idea how to apply it to the data as stored in the example above.
I hope I've given enough information here but happy to respond to any queries for more information. Thanks!
I don't think variance is what you want. But let's take a step back.
The hardest and most important part of an exercise like this is really pinning down what it is you want to know--that is, you need to identify the right question before you can get a meaningful answer.
You mentioned they want to know where the increase is coming from--can you elaborate on that? Do you mean geographically (reverse IP lookup)? From which app? Both? Or (as is often the case) do they not know exactly what they want and it's really more about exploration, more about finding the question than answering it--"Hey our server guys say we're getting a bunch more traffic compared to last month, what's going on?"
There are certain things pivot tables do well but they're really pretty limited. You can aggregate into groups with both SUM or COUNT, for example, which seems great, but there's no way to count distinct users; this could be a problem if a user is recorded on two different servers during the same hour, for example (or if the data contains duplicate rows).
I've never used Access but I know Excel and SQL and do mobile app data analysis. I'd be happy to help take a swing at this. (If you post here and I'm slow responding please feel free to memail me... it's late and I'm tired and I may forget to check back tomorrow.)
As a starting point, can you formulate exactly one very specifically worded question? (For example: What is the number of distinct active users, grouped by app and by day?)
posted by kprincehouse at 10:29 PM on July 17, 2013 [1 favorite]
The hardest and most important part of an exercise like this is really pinning down what it is you want to know--that is, you need to identify the right question before you can get a meaningful answer.
You mentioned they want to know where the increase is coming from--can you elaborate on that? Do you mean geographically (reverse IP lookup)? From which app? Both? Or (as is often the case) do they not know exactly what they want and it's really more about exploration, more about finding the question than answering it--"Hey our server guys say we're getting a bunch more traffic compared to last month, what's going on?"
There are certain things pivot tables do well but they're really pretty limited. You can aggregate into groups with both SUM or COUNT, for example, which seems great, but there's no way to count distinct users; this could be a problem if a user is recorded on two different servers during the same hour, for example (or if the data contains duplicate rows).
I've never used Access but I know Excel and SQL and do mobile app data analysis. I'd be happy to help take a swing at this. (If you post here and I'm slow responding please feel free to memail me... it's late and I'm tired and I may forget to check back tomorrow.)
As a starting point, can you formulate exactly one very specifically worded question? (For example: What is the number of distinct active users, grouped by app and by day?)
posted by kprincehouse at 10:29 PM on July 17, 2013 [1 favorite]
Response by poster: Jasper Friendly Bear - yep thats exactly what I was trying to do, thanks very much! Still quite hard to visualise it but I can play around with different chart types.
kprincehouse - you are exactly right about asking the right questions first, and thats part of the problem, theres a lot of "what else can we get from this data?" style of questions at the moment.
I guess at the moment the exact question is "we know that overall our usage jumped from 600 to 700 users between 2 and 3pm. Which applications generated that growth?"
posted by Admira at 11:13 PM on July 17, 2013
kprincehouse - you are exactly right about asking the right questions first, and thats part of the problem, theres a lot of "what else can we get from this data?" style of questions at the moment.
I guess at the moment the exact question is "we know that overall our usage jumped from 600 to 700 users between 2 and 3pm. Which applications generated that growth?"
posted by Admira at 11:13 PM on July 17, 2013
Between 2pm and 3pm on a specific day, or aggregating all 2pm and all 3pm data together?
posted by kprincehouse at 12:09 AM on July 18, 2013
posted by kprincehouse at 12:09 AM on July 18, 2013
(I'm trying to understand how you want this grouped--that is, by day and hour, or just by hour. And the variation you're looking for truly is at the hour-to-hour level, not a larger-grained trend where day or even week may be a more relevant unit? I'm surprised because my first though would be to look more at the forest; if you keep the data finely chopped up into very small groups it will be hard to see larger trends.)
posted by kprincehouse at 12:16 AM on July 18, 2013
posted by kprincehouse at 12:16 AM on July 18, 2013
It's also worth noting that any user activity data I've ever seen has exhibited dramatic cyclicity--high during the day and low at night, as well as low during the week and high during the weekend. This (especially the weekend thing) may be different in your case... I look at game data, so weekends are naturally more popular.
Anway, back to your question. We can totally answer it. Let's just confirm this sounds right before going further:
Our dimensions are app, day, and hour. (We're "grouping" by these, and each permutation of them in the input table will be represented by exactly one row in the output table. Note that this will combine data from all servers.)
Our measure is number of distinct active users in each group. (Where distinct users are determined solely from the AccountName field; IP is not used.)
Sound good?
posted by kprincehouse at 12:25 AM on July 18, 2013
Anway, back to your question. We can totally answer it. Let's just confirm this sounds right before going further:
Our dimensions are app, day, and hour. (We're "grouping" by these, and each permutation of them in the input table will be represented by exactly one row in the output table. Note that this will combine data from all servers.)
Our measure is number of distinct active users in each group. (Where distinct users are determined solely from the AccountName field; IP is not used.)
Sound good?
posted by kprincehouse at 12:25 AM on July 18, 2013
Response by poster: Yep. I guess overall larger trends would be interesting, but like you said - the larger pattern tends to stay the same, busy during the day, quiet at night. Its these sudden jumps some days (and the jump usually lasts between 2 and 3 hours) that has everyone interested, and has exposed the gap in knowledge that we had that we hope this data will help us unveil.
I really appreciate the help ... I'm starting to see that getting the right questions, and very detailed, is the most important part of all.
posted by Admira at 12:36 AM on July 18, 2013
I really appreciate the help ... I'm starting to see that getting the right questions, and very detailed, is the most important part of all.
posted by Admira at 12:36 AM on July 18, 2013
Happy to help!
Interesting. So it sounds like the hour-to-hour is really important, and more specifically what you'll want to dig into these few-hour bursts of anomalous activity. I'll try to put together something on the bus to work tomorrow morning.
What version of Excel are you using?
And also, to help me understand the context, what was it that first indicated that these jumps are happening? (More rows per day in the table? Spikes in server activity? Etc.)
And finally, are there fewer than one million (2^20) rows in the table?
posted by kprincehouse at 1:25 AM on July 18, 2013
Interesting. So it sounds like the hour-to-hour is really important, and more specifically what you'll want to dig into these few-hour bursts of anomalous activity. I'll try to put together something on the bus to work tomorrow morning.
What version of Excel are you using?
And also, to help me understand the context, what was it that first indicated that these jumps are happening? (More rows per day in the table? Spikes in server activity? Etc.)
And finally, are there fewer than one million (2^20) rows in the table?
posted by kprincehouse at 1:25 AM on July 18, 2013
Response by poster: We are using Excel 2007 and its a access 2007 db currently (probably move to sql at some stage). The indication of the burst was the license server that all the servers draw a per user license from when they connect. Sadly it doesn't tell us what they are running, just that they took a license and we can see the spikes there.
Definitely less than a million rows.
posted by Admira at 1:39 AM on July 18, 2013
Definitely less than a million rows.
posted by Admira at 1:39 AM on July 18, 2013
(Got bogged down in some other stuff, will try to finish this up for you this evening.)
posted by kprincehouse at 1:30 PM on July 18, 2013
posted by kprincehouse at 1:30 PM on July 18, 2013
This thread is closed to new comments.
1. Make a pivot table with Hour in the column labels, AppName in the Row Labels, and AccountName (or some other field) as the values.
2. Click into one of the value cells of the pivot table, right click and select “Value Field Settings”.
3. In the Value Field Settings dialogue box, click on the second tab labeled “Show Values As”
4. In the dropdown box where it says “Show Values As”, select “Difference From”. It’s the second option under “Normal”
5. For the base field, select Hour
6. For the base item, select (previous)
7. Click OK
Your pivot table will now show the change in the number of users for each application from the previous hour.
To quickly identify when large changes are happening you could apply the data bars conditional formatting. This will show a bar graph on each cell.
posted by Jasper Friendly Bear at 10:26 PM on July 17, 2013 [1 favorite]