How to calculate Rate of Change
February 1, 2008 6:20 AM   Subscribe

How can i calculate (or make a good approximation) of how much some data has changed in bytes?

I have a folder with (currently) 140GB of 300-400 lotus notes mail files. Whilst the total data 140GB stays roughly around that mark, the actual number of bytes that is changing daily is harder to calculate.

All this data is on a netapp SAN by the way, in a LUN, mounted on a Windows 2003 server.

I'm basically trying to calculate the Rate Of Change (ROC) so i can work through this doc . Any help appreciated.
posted by daveyt to Computers & Internet (6 answers total) 1 user marked this as a favorite
 
One thing you can try is make a backup of all 140 GB, then use rsync via cygwin to sync your backup to the latest version. Rsync will tell you how many bytes were actually changed.
posted by bertrandom at 7:00 AM on February 1, 2008


Why can't you create a cronjob (or "Scheduled Task" on Win) that checks the folder size each day and reports the delta? You can have a flat text file that the script simply appends the current size (and date?) to each day. You can then write another simple script to parse the txt file and give you the data in whatever format you want (daily deltas, rolling averages, etc)
posted by jpdoane at 7:44 AM on February 1, 2008


Response by poster: The folder and file size isnt the issue. The mailboxes hover around the same overall size - lets say 100MB. But during that day, mail comes in, mail gets deleted, mail gets modified. The filesize overall stays the same. Potentially that file could change many times over 100MB in changes - lets say i recieve a 10MB mail into my 100MB mail file, i have to delete old mail to make room for it (in my example i was say 95% full) - the before and after file size is still 100MB.

A partial answer is to use the SAN itself after further looking, the snapshots I can spin off daily give the actual deltas. Its just I want to be able to calculate this on data that hasnt yet moved to the SAN. But in order to move it, I need to prepare the volumes and luns, and to do that, i need the ROC.. which i cant get until i move it! Vicious circle.
posted by daveyt at 7:55 AM on February 1, 2008


Couple of suggestions that really only apply if there are many files (as opposed to one or two large files). Diff is the obvious answer, but running it over 140 gigs isn't going to be fun.

A perl script to md5 sum the files in (say) 64-meg chunks) and store those checksums should be pretty quick to write and IO-bound to run on a modern CPU (and on a SAN that should be a pretty high bound). That would give you a reasonable (but likely inflated unless you can split on some kind of internal division in the file, I don't know enough about how Lotus stores data internally to say) guesstimate of your churn.

Otherwise your SAN might even have some measurement tools that can help but I'm out of my depth there.
posted by Skorgu at 7:55 AM on February 1, 2008


Other ideas, if the changes are all coming from one machine or a group of machines you might be able to measure their traffic to the SAN and infer churn from that. Since these are mail spools you should be able to get some of the statistics you need from the mailer but that'd be a Lotus-specific question.
posted by Skorgu at 7:58 AM on February 1, 2008


The page is titled "Calculating the size of a volume"

It appears that ROC has to do with the amount your data is growing and not how much your data file is changing.
posted by mphuie at 9:09 AM on February 1, 2008


« Older Bang goes my thunderbird !   |   What's happened to my sexual response? Newer »
This thread is closed to new comments.