Sudden, odd network slowdowns
July 25, 2021 7:53 AM   Subscribe

At my small office we've been having very odd, sudden, network slowdowns - they seem to come out of nowhere and are a real nuisance. It slows down the network CCTV (10 second lag), totally incapacitates the IP phone system, causes very slow web browsing etc etc. Restarting various switches and the main router - eventually something seems to help and normal service is resumed but if I'm not mistaken it isn't instant and takes an hour or so to 'recover' back to normal operation...

How would I go about diagnosing WHERE the bottleneck is? Is it possible the main (DSL) router is getting overheated and causing a slowdown in all traffic passing through it? How would I diagnose that? Thank you
posted by dance to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
Automated backups of some computer running on a schedule and flooding the network? Some other scheduled task?

Do you see other WiFi network names showing up? It could be interference.

Can you provide any more details about the network (wireless or wired?) and clients (Mac, Windows, IoT, point-of-sale, cameras....)?
posted by wenestvedt at 8:13 AM on July 25, 2021 [1 favorite]


Response by poster: Sorry!

Ethernet network.
Clients: Windows, IoT, POS, cameras, printers, IP PBX

All has run fine for years - but we've had two slowdowns in the last week and they're very disruptive and I'm figuring another will probably happen this week.

Not a backup flooding the network as there's nothing like that happening.

Could it be an external DDOS?

If it's something like a device (switch or router) overheating or simply getting (weirdly) overloaded - how would I diagnose? Pings from point to point?

thanks!!
posted by dance at 8:33 AM on July 25, 2021


It is absolutely possible that the router overheating is causing problems. I have experienced that before with cheap consumer-grade routers. I don’t know if I’d say it’s the most likely scenario, but it does happen. If it’s in an enclosed space, you could try taking it out, or moving it away from other heat-generating devices. Or even point a fan at it.

There’s an app called Ping Plotter that can be helpful in finding exactly where in path between you and the internet that things are getting bad (but if it’s within your internal network, it won’t really help with that).

Is it possible that your ISP is oversubscribed, and you’re seeing this problem at their peak time? You mention “CCTV” cameras; are those wholly within your network, or do they use some cloud service?
posted by primethyme at 8:36 AM on July 25, 2021 [1 favorite]


If you have DSL Internet, and you're routing all your traffic through the DSL modem, than I would feel pretty confident that that box is your bottleneck.

Have you segmented your network out, so the IoT devices are on one (wired!) switch, and your cameras are on another switch, and your money-related devices are on another? That shifts some traffic off of the router, and also improves security.

(If you haven't segmented your network yet, you really ought to retain a local nerd who can overhaul your network. These days, it's very dangerous to mix traffic, and also may not be PCI-compliant.)
posted by wenestvedt at 8:54 AM on July 25, 2021 [2 favorites]


If you are on DSL, maybe talk to your ISP and have them run some tests to make sure it's not the line that's having issues.

We used to only have access to DSL. When this happened in my small office, it was the main modem/router slowly going bad. Replacing it fixed most of the issues we were having. I think that being very systematic with how you shut down things when it slows down can tell you where the issue lies. So don't just reboot everything in an attempt to get back online. Just do the main modem/router and then see what happens. If that doesn't work, do the switches one at a time. That should help pinpoint it.

And yea, what wenestvedt said.
posted by gemmy at 9:08 AM on July 25, 2021 [1 favorite]


Do your switches have web GUIs that you can log into? Some switches will show you real-time information on network utilization by port number. You can see if there is some device on your network that is chewing-up all your bandwidth.

However, my suspicion is with the DSL line. DSL is a crappy, outdated technology. Certainly, my own experience with it has been pretty bad. You may want to consider switching to something else -- coaxial cable or fiber. Granted, it might be more expensive, but it's probably a worthwhile investment.
posted by alex1965 at 9:33 AM on July 25, 2021 [1 favorite]


I am a network technician but I am not your network technician.

I can't even begin to give you what would would be considered a proper answer with this information. How many clients? Are there servers? How many cameras? How many phones? How many switches? What make and model of switches? What are their capabilities? How are they connected? How's the network laid out? Are there VLANs? I'm not asking you to provide this information now in this AskMe, because I'm not your network technician, but those are the questions I would ask you in the first 15 minutes as you're showing me around the site if I was. I've been on sites with problems like you describe and I'm in full-on inquisitor mode for a bit until they lead me upstairs to the back attic office that has a 3Com networking hub from 2006 with 2 phones, 2 PCs, a camera, and a 802.11ac AP plugged into it. "well there's part of your problem"

There's a ton of things that could be happening. It's a crap shoot. That said, DSL does suck, but as it stands we don't know if the problem is inside your network, outside your network, or somewhere at or near the demark. Maybe your DSL modem is flaking out. You can start investigating where the problem lies by setting up some constant pings to various locations inside and outside your network, and look at them to see if you can identify a pattern. Like on one PC I would start pings running to couldflare and google DNS (1.1.1.1 and 8.8.8.8 respectively, on Windows use the -t option to run the ping continuously), and pings to some of the stuff on your internal network, like a server, a printer, and a workstation, (and if you have multiple switches, ping some internal stuff that has to traverse those other switches as well) and just pay attention to them and see if you get lag spikes affecting everything, or just the external sites, or just things connected to a different switch than the one on the pinging machine, etc.

Reiterating, there's a ton of potential things it could be. Maybe there's a short in a patch cable that only gets jostled once a week when someone takes the vaccum cleaner out of the "networking closet" where you also keep the janitorial supplies. I had a site where several machines were randomly dropping packets all at once. Beat my head against the wall on that for a while, checking patch cables, trying alternate NICs, until the receptionist pointed out that on one of the desks there was a VOIP phone that had "DO NOT USE" written on it with magic marker that I hadn't noticed. Turns out that phone had been struck by lightning, I was the one who had wrote on it, and I had replaced it a month previously, but someone pulled the phone back out of the trash after I left and hooked it back up. The fried NIC in it was sending out junk that was making the switch have a damn fit and knocking my other clients offline intermittently. As soon as I took that phone offline (and physically destroyed it so it couldn't be hooked up again) the problem went away for good.

That said, DSL does suck.
posted by glonous keming at 11:43 AM on July 25, 2021 [15 favorites]


Another issue could be windows update on multiple machines trying to download all at once.
Look at pausing updates to see if it helps, and limit updates to at night.
posted by nickggully at 11:44 AM on July 25, 2021 [1 favorite]


If you have an on-site PBX and your phones are losing connectivity to it (either you can't call the desk next to yours, or the phones just reboot all the time), then it's unlikely the problem is upstream since all the traffic from your phone to the PBX is taking place on the local network. If your DSL router is also the main switch between you and the PBX it could still be the router.

Since you imply multiple switches, I'm guessing you have an Ethernet loop somewhere in your office. Anything with two network cables plugged into it could be causing it - it's very easy to turn a PC into a virtual switch if you're using it to run VM's for example.

The way to diagnose this would be to install Wireshark on a computer that you have admin rights to. (Even better if it's a laptop you can move around the office, plugging in to different switches/wall ports.) Wireshark is a network sniffer - it shows you what packets your computer is receiving and sending. Make a capture for thirty seconds or so and look at the traffic - you should see your own IP address as either the source or destination in most packets, and a little bit of traffic called "ARP" or "MDNS" or a few other things, addressed to "255.255.255.255" or "0.0.0.0". Those are used for discovery of other IP addresses or services.

When the loop is going again, do another capture. If the rate of packets has gone way way up, and most of them are NOT addressed to you, you have a loop. It's not easy to narrow these down, but generally you just unplug cables until it stops, being careful to note what you removed and where to put it back. Eventually you narrow it down to "when I unplug this cable everything is OK" and then you focus on what is on either end of that cable to see why it's part of a loop.
posted by five toed sloth at 11:50 AM on July 25, 2021 [2 favorites]


Another issue with DSL I have had: local network was throwing around large “giant” packets (big MTU Max Transmission Unit), and forcing the poor modem to do the extra work of fragmenting it. Try forcing the MTU of all clients to the same size as the DSL router can handle.
posted by nickggully at 12:00 PM on July 25, 2021


If somebody inside your network has installed a BitTorrent client on their PC, that could easily bring a DSL-based Internet service to its knees on an irregular and unpredictable basis.

ADSL splitter/filters also regularly get eaten by lightning-induced voltage spikes on phone lines, and when they do, they really do a number on DSL connection quality.

If you can find the DSL SNR Margin and Line Attenuation numbers somewhere in your DSL modem's admin web interface, those would be worth keeping an eye on as well. See if the attenuation number gets worse when your network is having issues.

But really, there are endless faults that can cause this kind of thing and your fastest path to remediation is almost certainly going to involve paying some money for a glonous keming of your very own to come and investigate this on site.
posted by flabdablet at 3:35 PM on July 25, 2021 [1 favorite]


One culprit might be bufferbloat, which you can test for. This was a particular problem on older generations of network hardware, which incorporated large cache sizes but didn't manage them very intelligently. Try running the test at a time when the network isn't heavily loaded.

The solution generally involves tweaking your router configuration (which may or may not involve replacing the router if it's a crappy one), enabling QoS and prioritizing latency-sensitive traffic like VoIP, and ensuring that the router isn't trying to shove more packets down the modem's throat than it can actually handle.
posted by Kadin2048 at 5:51 PM on July 25, 2021


Diagnosing intermittent network latency is hard! In 2011, I spent 6 months trying to figure out why my service's 99th-percentile latency was out of SLO. Before I found out what it was (rack switches getting overwhelmed and dropping packets when too many instances of a different, high-traffic service were located on the same rack) I found like six things that it could have been but wasn't.

The key thing with this is you have to log the hell out of everything. Good routers and switches will be able to send you continuous diagnostics, which you will then need a system to store and display. Because you don't know what it is, you want to get everything -- upstream and downstream traffic per client, dropped packets, latency, queue lengths -- and then hopefully you can find out which graphs spike when this happens which will give you a bit of a clue.

If you are running your office on consumer grade network hardware, it is time to upgrade it for three reasons:
  • Consumer hardware won't have enough logging tools to figure out what's going on;
  • It's entirely possible that your problem will go away once you change out your routers/switches;
  • Spending thousands of dollars of your time to save a few hundred bucks in equipment cost is a losing proposition.
Good luck!
posted by goingonit at 6:43 PM on July 25, 2021 [2 favorites]


« Older Why is the Olympic Rings logo different this year?   |   Help me find this pixel art (?) how-to video Newer »
This thread is closed to new comments.