Calling network engineers: help me understand latency and packet loss
May 19, 2014 10:32 AM Subscribe
I’m trying to be more pro-active with regard to monitoring the network at my workplace, addressing any issues that might arise, and answering that ever-present question: “My computer is so slow today! Is the problem on our end, or on their end?”
I’m the IT guy (largely self-taught) at a small non-profit organization. I had heard good things about a tool called SmokePing, so I recently installed it. It certainly generates interesting-looking graphs, but I end up scratching my head and wondering what it all means.
See, for example, these five graphs that I’ve combined into a single JPG file.
The host that’s shown at the bottom-right (labeled, “Frontier”) is the next hop after our fiber-optic modem. The other four hosts correspond to the cloud apps that we use in our daily operations. The graphs cover the same time span. We access the Internet through a fiber-optic line that's roughly 50 megabits/second down and 3 up. We don't host our website (or anything else) internally.
Here are some questions that occur to me:
I’m the IT guy (largely self-taught) at a small non-profit organization. I had heard good things about a tool called SmokePing, so I recently installed it. It certainly generates interesting-looking graphs, but I end up scratching my head and wondering what it all means.
See, for example, these five graphs that I’ve combined into a single JPG file.
The host that’s shown at the bottom-right (labeled, “Frontier”) is the next hop after our fiber-optic modem. The other four hosts correspond to the cloud apps that we use in our daily operations. The graphs cover the same time span. We access the Internet through a fiber-optic line that's roughly 50 megabits/second down and 3 up. We don't host our website (or anything else) internally.
Here are some questions that occur to me:
- What is the meaning of the slowly-rising hump that’s present in Gmail, Blackbaud, and PetPoint? And why doesn’t it show up in CyberSource?
- I often see sharp, step-wise increases in latency, followed some hours later by sharp, step-wise decreases. What does this mean?
- The particular graphs I included with this post don’t indicate much packet loss, but I sometimes see spikes in both packet loss and latency (the spikes will typically last for a few hours). What am I to make of this? How can I tell if the problem is with our ISP, or something downstream of us? And at what point do I start complaining to our ISP?
- In general, what do I need to learn in order to make the best use of the information that SmokePing gives me?
The "Frontier" graph just shows that your connection to your ISP's local office is basically rock-solid and low-latency. However, lately the problem with ISPs is that they are underprovisioning their peering with the networks used by content providers. That's why you see the contention that k5.user points out. The step increases of 20ms starting at midnight are weird to me. (And it does show up in CyberSource, but the scale compresses it. It seems consistently 20ms).
posted by wnissen at 10:52 AM on May 19, 2014 [1 favorite]
posted by wnissen at 10:52 AM on May 19, 2014 [1 favorite]
Best answer: Not a network engineer, but I do have some knowledge in the area, so I'll take a shot.
What is the meaning of the slowly-rising hump that’s present in Gmail, Blackbaud, and PetPoint? And why doesn’t it show up in CyberSource?
First off, those latency numbers don't look like a problem to me, so I wouldn't worry about them too much.
If you see the same graph every day, it's could be increased traffic between your ISP and the rest of the Internet. Given the time of day, it could be people Netflixing at home. However, the fact the latency went up and down so fast right at 6:00 PM and 12:00 AM tells me that it's something scheduled, possibly some sort of throttling or QOS by your ISP or more likely, by downstream network providers between you and the cloud services you're looking at.
It doesn't show up in Cybersource (probably) because it's closer in hops to you and your traffic to them doesn't cross whatever network where management is going on.
I often see sharp, step-wise increases in latency, followed some hours later by sharp, step-wise decreases. What does this mean?
Again, assuming the graphs are accurate, it looks like some sort of network management at the ISP level. That's also assuming you're not running remote sites where doors open and close right at a certain time, where traffic would all jump and drop all at once.
The particular graphs I included with this post don’t indicate much packet loss, but I sometimes see spikes in both packet loss and latency (the spikes will typically last for a few hours). What am I to make of this? How can I tell if the problem is with our ISP, or something downstream of us? And at what point do I start complaining to our ISP?
Use a tracert to find each hop from you to an external host and then configure your app to monitor each of those hops. You'll have a bunch of graphs in order of distance from you. When the graphs start to get big, you've found the slowdown. You can use a reverse DNS lookup to find a host name, or Arin.net to find out who owns the IP block to try and figure out whether or not it's your ISP. (More simply, the more hops away it is, the less likely your ISP owns it.)
In general, what do I need to learn in order to make the best use of the information that SmokePing gives me?
I'm not clear if you're seeing slowdowns with your cloud providers or Internet access. If you're not, I wouldn't worry about it. I've run several small networks with orders of magnitude less bandwidth than you have, hosting services (like e-mail) internally, and they were mostly fine.
posted by cnc at 10:52 AM on May 19, 2014
What is the meaning of the slowly-rising hump that’s present in Gmail, Blackbaud, and PetPoint? And why doesn’t it show up in CyberSource?
First off, those latency numbers don't look like a problem to me, so I wouldn't worry about them too much.
If you see the same graph every day, it's could be increased traffic between your ISP and the rest of the Internet. Given the time of day, it could be people Netflixing at home. However, the fact the latency went up and down so fast right at 6:00 PM and 12:00 AM tells me that it's something scheduled, possibly some sort of throttling or QOS by your ISP or more likely, by downstream network providers between you and the cloud services you're looking at.
It doesn't show up in Cybersource (probably) because it's closer in hops to you and your traffic to them doesn't cross whatever network where management is going on.
I often see sharp, step-wise increases in latency, followed some hours later by sharp, step-wise decreases. What does this mean?
Again, assuming the graphs are accurate, it looks like some sort of network management at the ISP level. That's also assuming you're not running remote sites where doors open and close right at a certain time, where traffic would all jump and drop all at once.
The particular graphs I included with this post don’t indicate much packet loss, but I sometimes see spikes in both packet loss and latency (the spikes will typically last for a few hours). What am I to make of this? How can I tell if the problem is with our ISP, or something downstream of us? And at what point do I start complaining to our ISP?
Use a tracert to find each hop from you to an external host and then configure your app to monitor each of those hops. You'll have a bunch of graphs in order of distance from you. When the graphs start to get big, you've found the slowdown. You can use a reverse DNS lookup to find a host name, or Arin.net to find out who owns the IP block to try and figure out whether or not it's your ISP. (More simply, the more hops away it is, the less likely your ISP owns it.)
In general, what do I need to learn in order to make the best use of the information that SmokePing gives me?
I'm not clear if you're seeing slowdowns with your cloud providers or Internet access. If you're not, I wouldn't worry about it. I've run several small networks with orders of magnitude less bandwidth than you have, hosting services (like e-mail) internally, and they were mostly fine.
posted by cnc at 10:52 AM on May 19, 2014
Best answer: It looks like the test points you've chosen are over a couple of different paths. It's hard to draw solid conclusions, but here's a few things to think about.
First, you should be much more concerned about packet loss than latency. If you're not a gamer the difference between 30 ms and 80 ms is minimal. When you get to 150ms and up it starts to become noticeable for something like web browsing. You're below those threshholds.
Routers make poor test points because they prioritize moving packets over responding to pings, so the results tend to vary a lot. They also lack processing power compared to servers. That's probably the main reason for the spikiness in the Frontier graph. If Frontier gave you an IP address for a local DNS server, you would probably see more consistent results.
Regarding the Gmail and Blackbaud graphs; that looks like a path change. Latency due to congestion tends to fluctuate up and down, since it's caused by packets queueing on a busy interface. The latency changes based on the depth of the queue. Typical network usage doesn't slam a router queue for hours on end, at least not without causing significant loss. I would guess that someone upstream from you changed from a short path to Gmail to a longer one, and then switched back. If that happens regularly they are probably using some sort of route optimization product that adjusts routes based on a variety of factors (cost being a big one, so it's not always the "best path" from a performance perspective).
Cybersource must be on a different path from Gmail and Blackbaud. If you compare a traceroute to the three sites you will probably see a divergence early on in your ISP's network, so it isn't affected by whatever is rerouting the other two.
The other thing to keep in mind is that sites that use a Content Delivery Network (CDN) can jump around in latency. CDN's have caches in multiple locations and use geolocation tricks to send you to the "closest" uncongested server. As load goes up your traffic can get shifted to another node that is further away but isn't busy. Not sure if you are using IP addresses or DNS names in smokeping, but if you're doing a DNS lookup you can end up testing to a different endpoint.
The Petpoint.com graph looks much more like a congestion problem, or a server that's too busy to keep up with requests. Because other sites don't display that behavior I'd say the problem is closer to Petpoint than it is to you.
Generally what you want to look for is problems that affect multiple endpoints, as they will point to issues closer to your end. If you see sky-high latency or increased loss to multiple destinations you should report it to your ISP. You should also try to monitor your upstream usage - it's not hard to max out a 3 Mbps link.
Overall, those results aren't bad. If a customer of mine sent those to me I would be curious why the latency is so variable, but I wouldn't say they have a network problem. Petpoint needs to do some investigating, though.
Disclaimer: I am a network engineer, but I've never worked on cable/DSL/FIOS type services, so there may be something about those that explains the latency variations.
posted by five toed sloth at 1:58 PM on May 19, 2014 [2 favorites]
First, you should be much more concerned about packet loss than latency. If you're not a gamer the difference between 30 ms and 80 ms is minimal. When you get to 150ms and up it starts to become noticeable for something like web browsing. You're below those threshholds.
Routers make poor test points because they prioritize moving packets over responding to pings, so the results tend to vary a lot. They also lack processing power compared to servers. That's probably the main reason for the spikiness in the Frontier graph. If Frontier gave you an IP address for a local DNS server, you would probably see more consistent results.
Regarding the Gmail and Blackbaud graphs; that looks like a path change. Latency due to congestion tends to fluctuate up and down, since it's caused by packets queueing on a busy interface. The latency changes based on the depth of the queue. Typical network usage doesn't slam a router queue for hours on end, at least not without causing significant loss. I would guess that someone upstream from you changed from a short path to Gmail to a longer one, and then switched back. If that happens regularly they are probably using some sort of route optimization product that adjusts routes based on a variety of factors (cost being a big one, so it's not always the "best path" from a performance perspective).
Cybersource must be on a different path from Gmail and Blackbaud. If you compare a traceroute to the three sites you will probably see a divergence early on in your ISP's network, so it isn't affected by whatever is rerouting the other two.
The other thing to keep in mind is that sites that use a Content Delivery Network (CDN) can jump around in latency. CDN's have caches in multiple locations and use geolocation tricks to send you to the "closest" uncongested server. As load goes up your traffic can get shifted to another node that is further away but isn't busy. Not sure if you are using IP addresses or DNS names in smokeping, but if you're doing a DNS lookup you can end up testing to a different endpoint.
The Petpoint.com graph looks much more like a congestion problem, or a server that's too busy to keep up with requests. Because other sites don't display that behavior I'd say the problem is closer to Petpoint than it is to you.
Generally what you want to look for is problems that affect multiple endpoints, as they will point to issues closer to your end. If you see sky-high latency or increased loss to multiple destinations you should report it to your ISP. You should also try to monitor your upstream usage - it's not hard to max out a 3 Mbps link.
Overall, those results aren't bad. If a customer of mine sent those to me I would be curious why the latency is so variable, but I wouldn't say they have a network problem. Petpoint needs to do some investigating, though.
Disclaimer: I am a network engineer, but I've never worked on cable/DSL/FIOS type services, so there may be something about those that explains the latency variations.
posted by five toed sloth at 1:58 PM on May 19, 2014 [2 favorites]
This thread is closed to new comments.
posted by k5.user at 10:46 AM on May 19, 2014