Monitoring site-to-site internet latency how?
December 2, 2007 11:21 AM   Subscribe

I'm searching for a utility that monitors the speed of an internet connection from a hospital on this side of town to one on that side of town. There is a special consideration. (I didn't think it would be so special when I started looking, but apparently it is.)

The radiologists who read our xrays are a partnership that also reads for the other hospital in town. At each hospital they have one one XP-based reading station with a DSL connection and a vpn to the other hospital, in case they need to look at a particular study that wasn't done at the hospital where they're physically sitting. Lately they have been telling me that the response time for opening one of our studies on the PC at the other hospital has become unusably slow in the evenings, roughly 8pm to 10pm.

The rads know enough to turn the vpn off and test the system with one of the internet measure-your-DSL-speed sites, and they say all these sites say their connection speed is fine. But that only tests the speed over the internet path between that hospital and the test-your-speed site. If there were a slowdown at some node that isn't on that path (but is on the path packets take when they travel between the two hospitals) that kind of test won't find it. I am not a network engineer but it seems to me that this is a job for a continuous traceroute (or some alternative like pathping or Layer Four Traceroute) left running for a couple of days.

There are tons of programs that do this kind of thing. I've looked at examples like pingplotter and Path Analyzer Pro and WinMTR and even some hard-to-find abandonware like Neotrace. But the special ability that seems (surprisingly) to be uncommon is not just to average the speed over the time a program is collecting data but also to preserve the ups and downs so that after running, say, from 6PM today to 6PM tomorrow I could look and say, "well, it's fast enough at 6 and at 7 but hey whoops it really slowed down at 8 and didn't get well until after midnight." The only program I've seen that may do this is pingplotter--there are a couple of screen captures on their site that show what looks like a smallish graph of latency and dropped packets over time. But I can't be sure, the downloadable evaluation version is feature-limited and definitely doesn't do this. The eval version of Path Analyzer Pro is time-limited and has all the (very many, and cool) features of the full version but if it preserves and displays speed-vs.time data I can't find it.

At this point I'm reluctantly tempted to do this the mouthbreathing way, by writing a batch file that runs a command-line tracert to our site a few times and appends the output to a file, and then firing this batch file off every fifteen minutes for 24 hours with the scheduler. That makes a really ugly text file of data that I would have to clean up and parse with win32 sed and then import into excel to graph it up all pretty.

Can anyone point me to a better solution than that? Neither hospital will pay to solve this, each says it's the other one's problem. If there isn't any likely freeware for the job I could probably put 20-30 bucks of my own into it, but there's no use telling me how great SolarWinds is.

Thanks very much!
posted by jfuller to Computers & Internet (10 answers total) 1 user marked this as a favorite
 
If your VPN server is a multi-purpose box, is it possible that something is eating both bandwidth and CPU at that time? For example, a backup script?
posted by ydnagaj at 11:29 AM on December 2, 2007



At this point I'm reluctantly tempted to do this the mouthbreathing way, by writing a batch file that runs a command-line tracert to our site a few times and appends the output to a file, and then firing this batch file off every fifteen minutes for 24 hours with the scheduler. That makes a really ugly text file of data that I would have to clean up and parse with win32 sed and then import into excel to graph it up all pretty.


This is not the 'mouth-breathing' way, this is the UNIX way. When I first saw your problem, I immediately though of a few UNIX commands I would tie together to run this test.

Remember, all you need is a comma or line-separated dump of what the current bandwidth through the VPN is at any given time (though you might also want to gather other diagnostic information like packets lost/packets sent). If you write this to a file Excel will be able to import the whole set as a column of data that could match an easily-generated column of times knowing your start time and end time.

This should take you less than two hours to write and test, let us know if you need further advice on writing the script.
posted by onalark at 11:44 AM on December 2, 2007


iperf is a great tool for this kind of thing. It's more accurate than tracert since it does actual data transfer.

Use windows scheduler to run it every 15 minutes for 30 seconds, dumping into a file and then do the excel data analysis later.
posted by cschneid at 12:01 PM on December 2, 2007


Response by poster: > If your VPN server is a multi-purpose box, is it possible that something is eating both bandwidth
> and CPU at that time? For example, a backup script?

I'm pretty sure it's for vpn only, but that's part of the problem--I work for radiology, not I.S., and don't have the inside knowledge or access that might help a lot. I have an mcse, for what that's worth, but no ccna and definitely no collection of switch or router passwords and can't go poking at them with snmp. Anything I find out will have to be discovered with tools available to any inquisitive busybody.

I've mentioned the problem to the I.S. folks several times, and they just say "everything looks fine on our end" and that's as far as it goes. I'm actually looking for evidence that can't be ignored, if the problem is actually on our end. If, for instance, I had several hours of site-to-site connection speed data (with the vpn up) from the rads' PC to my image server, and could show that the speed across the internet stays fast as far as the VPN server but then, at a certain time of day, gets slow from there to my own servers, that would be evidence that I wouldn't be scared to wave at the CIO, it that's what it took.


> This is not the 'mouth-breathing' way, this is the UNIX way.

Oh, amen to that. But the box they're complaining about is XP, and fo'sure doesn't have cygwin installed (and belongs to the radiologists, not to either of the hospitals, and I'm a bit reluctant to tamper with it in a major way) so any script will be a .cmd file, not .sh. I could probably slip in some of the djgpp tools while nobody's looking. Or more likely just copy the data file to a flash drive and carry it off to mumble over it elsewhere. (In the latter case, using actual *nix utilities and pipes and T's and so on would be a possibility. )
posted by jfuller at 12:06 PM on December 2, 2007


Response by poster: > iperf is a great tool for this kind of thing. It's more accurate than tracert since it does actual data transfer.

iperf looks really interesting for an unrelated problem I'm working on--the rads say the workstations on their hall all get slow around 5pm, and once again IS says they can't find anything wrong. Is there actually a problem here, or are the doctors all just getting the same subjective time dilation when it's almost time to go home? (That traffic doesn't cross the internet, the image servers and the display-station clients are all on the same in-house VLAN.)

Since I haven't yet had time to R the iperf FM much (though I will) can you tell me whether it measures path bandwidth only end to end (i.e. from the iperf server to the client) or whether it can give you any traceroute-like hint, for routed traffic, that getting as far as router A is fast but getting to router B on the other side of A suddenly becomed slow?
posted by jfuller at 1:06 PM on December 2, 2007


iperf goes from end to end only, but it is nice to narrow down exactly what time of day problems happen. It has all sorts of advanced features, but you're right, it probably wouldn't narrow down the exact step of the network that is slowing everything down.
posted by cschneid at 1:12 PM on December 2, 2007


Even with a VPN it makes sense that internet performance goes down during Happy Hour as long as the machines are communicating through consumer connections like DSL. Thing is, you can measure latency all day long for months on end, finding all sorts of conclusions about why things are slow, but at the end of it all you likely won't be able to do much about it. The problem in this case is not the latency (which makes the connection "unusable"), but the method by which you are connecting the machines. The latency is just the symptom.

For links across town you could get a PVC (permanent virtual circuit) or better yet a dry (alarm) pair of copper and run your own DSL. If you get two dry pairs you can roll your own ethernet, straight up.
posted by rhizome at 1:29 PM on December 2, 2007


FWIW, it sounds like your radiologists aren't having a problem with latency, but with throughput (at the packet level, that is— it's affecting the latency of downloading the entire file). So it might not even show up using something like ping or traceroute that mostly measures latency and reliability.

ydnagaj's suggestion is a good one too, I've had slowdowns that appeared network-related but turned out to be a backup, RAID reconstruction, or other batch job eating all the disk i/o bandwidth on the remote machine.

If I were you, I'd write a script that downloaded a typical radiology-relevant file once an hour for a week, printing the time at the start and end and accumulating in a log file; export all the transfer times to, say, excel (perhaps via a roundtrip to perl on a real computer to parse it out and convert to elapsed time), and graph them. That should make it pretty clear whether there is a problem, and once you know that, you can start working on figuring out what kind of problem it is.
posted by hattifattener at 3:50 PM on December 2, 2007


I use two methods for monitoring throughput and latency. For throughput I configure MRTG to tcpblast the remote host. tcpblast is included in MRTG as C source. For latency I use smokeping with a standard icmp ping.

Both have a graphing capability and I'm able to detect changes in throughput and latency going back a year.
posted by dereisbaer at 5:17 PM on December 2, 2007


(The advantage of transferring an actual radiology file is it would detect the various other possible causes of a slowdown. If you're trying to measure how fast your radiologists get their data, measure that, don't measure something else instead unless you know for sure that the something else is the problem.)
posted by hattifattener at 7:13 PM on December 2, 2007


« Older Can I trust PayPal's seller protection? If not...   |   Myopia Distopia Newer »
This thread is closed to new comments.