How to calculate Rate of Change
February 1, 2008 6:20 AM Subscribe
How can i calculate (or make a good approximation) of how much some data has changed in bytes?
I have a folder with (currently) 140GB of 300-400 lotus notes mail files. Whilst the total data 140GB stays roughly around that mark, the actual number of bytes that is changing daily is harder to calculate.
All this data is on a netapp SAN by the way, in a LUN, mounted on a Windows 2003 server.
I'm basically trying to calculate the Rate Of Change (ROC) so i can work through this doc . Any help appreciated.
I have a folder with (currently) 140GB of 300-400 lotus notes mail files. Whilst the total data 140GB stays roughly around that mark, the actual number of bytes that is changing daily is harder to calculate.
All this data is on a netapp SAN by the way, in a LUN, mounted on a Windows 2003 server.
I'm basically trying to calculate the Rate Of Change (ROC) so i can work through this doc . Any help appreciated.
Why can't you create a cronjob (or "Scheduled Task" on Win) that checks the folder size each day and reports the delta? You can have a flat text file that the script simply appends the current size (and date?) to each day. You can then write another simple script to parse the txt file and give you the data in whatever format you want (daily deltas, rolling averages, etc)
posted by jpdoane at 7:44 AM on February 1, 2008
posted by jpdoane at 7:44 AM on February 1, 2008
Response by poster: The folder and file size isnt the issue. The mailboxes hover around the same overall size - lets say 100MB. But during that day, mail comes in, mail gets deleted, mail gets modified. The filesize overall stays the same. Potentially that file could change many times over 100MB in changes - lets say i recieve a 10MB mail into my 100MB mail file, i have to delete old mail to make room for it (in my example i was say 95% full) - the before and after file size is still 100MB.
A partial answer is to use the SAN itself after further looking, the snapshots I can spin off daily give the actual deltas. Its just I want to be able to calculate this on data that hasnt yet moved to the SAN. But in order to move it, I need to prepare the volumes and luns, and to do that, i need the ROC.. which i cant get until i move it! Vicious circle.
posted by daveyt at 7:55 AM on February 1, 2008
A partial answer is to use the SAN itself after further looking, the snapshots I can spin off daily give the actual deltas. Its just I want to be able to calculate this on data that hasnt yet moved to the SAN. But in order to move it, I need to prepare the volumes and luns, and to do that, i need the ROC.. which i cant get until i move it! Vicious circle.
posted by daveyt at 7:55 AM on February 1, 2008
Couple of suggestions that really only apply if there are many files (as opposed to one or two large files). Diff is the obvious answer, but running it over 140 gigs isn't going to be fun.
A perl script to md5 sum the files in (say) 64-meg chunks) and store those checksums should be pretty quick to write and IO-bound to run on a modern CPU (and on a SAN that should be a pretty high bound). That would give you a reasonable (but likely inflated unless you can split on some kind of internal division in the file, I don't know enough about how Lotus stores data internally to say) guesstimate of your churn.
Otherwise your SAN might even have some measurement tools that can help but I'm out of my depth there.
posted by Skorgu at 7:55 AM on February 1, 2008
A perl script to md5 sum the files in (say) 64-meg chunks) and store those checksums should be pretty quick to write and IO-bound to run on a modern CPU (and on a SAN that should be a pretty high bound). That would give you a reasonable (but likely inflated unless you can split on some kind of internal division in the file, I don't know enough about how Lotus stores data internally to say) guesstimate of your churn.
Otherwise your SAN might even have some measurement tools that can help but I'm out of my depth there.
posted by Skorgu at 7:55 AM on February 1, 2008
Other ideas, if the changes are all coming from one machine or a group of machines you might be able to measure their traffic to the SAN and infer churn from that. Since these are mail spools you should be able to get some of the statistics you need from the mailer but that'd be a Lotus-specific question.
posted by Skorgu at 7:58 AM on February 1, 2008
posted by Skorgu at 7:58 AM on February 1, 2008
The page is titled "Calculating the size of a volume"
It appears that ROC has to do with the amount your data is growing and not how much your data file is changing.
posted by mphuie at 9:09 AM on February 1, 2008
It appears that ROC has to do with the amount your data is growing and not how much your data file is changing.
posted by mphuie at 9:09 AM on February 1, 2008
This thread is closed to new comments.
posted by bertrandom at 7:00 AM on February 1, 2008