Help me help the machines talk!
December 14, 2012 12:58 PM   Subscribe

Networking filter: I'm a young (still learning much) SysAdmin for a pretty large site. I inherited the network configuration, and now I'm trying to troubleshoot some strange issues: workstations losing connectivity without rhyme or reason, and slow web browsing.

We are having issues with some workstations on our domain losing connection to servers on the domain. We are also having issues browsing the web efficiently – pages load quickly after a 4-6 second delay.

The configuration is as follows:
- Cable Modem connected directly to Hardware Firewall
- Hardware Firewall – set as the Router / Primary Gateway
- Domain Controller – set as the DNS for the domain (running DNS and DHCP)
- Member Servers (4x) – each with a static IP, no DNS or DHCP roles
- 3x gigabit network switches (unsure of the config, done before my time at the site)
- Workstations – we have approx 90 workstations that get IPs from DHCP on the DC

The DHCP scope is set from to Devices with static IP are outside of the DHCP range. Some devices have DHCP reservations.

Domain resolution is slow in web browsers on client PCs. Seems to be an issue with DNS, because after about 4-6 seconds of stall time everything loads very quickly. If I run ipconfig /all on a client PC, I see the primary gateway as the firewall IP, and the DNS as the domain controller IP.

On the Domain Controller, for the active NIC, DNS is set manually to the DNS provided by the ISP.

Forwarding zone on the domain controller is set with the ISP DNS IPs and Google’s DNS IPs.

On the hardware firewall, DNS is set to the ISP DNS IPs.

How can I troubleshoot? Are there error logs I can check to diagnose our network issues? Are there common points of failure I can check? Might this be an issue with our network switches?

Thanks MeFi! I much appreciate the help.
posted by roygbv to Computers & Internet (9 answers total) 1 user marked this as a favorite
Use nslookup to check the DNS servers in order. It sounds like the first one is down, so a new lookup is trying the down server, timing out, and then moving on to the second. You'd do this by:

>server ip.for.dns.server
>server other.dns.ip.addr

If the first has a timeout and the second doesn't, drop it.
posted by bfranklin at 1:16 PM on December 14, 2012

Is the firewall doing web caching?
posted by empath at 2:22 PM on December 14, 2012

This sounds like a case where Wireshark is your friend. (Actually, Wireshark is almost always your friend, but this is an especially good time to take advantage of the friendship.)

You have a case where Something is happening for 4-6 seconds before traffic really starts to flow. The best way to understand what that Something is is to load Wireshark on an affected workstation and capture the traffic on the interface while it is occurring. You'll be able to see what DNS server it's trying to go to first, whether it gets a reply or gets redirected, whether it's sending a SYN but taking a long time to get the SYN/ACK back -- all the different ways a session can get held up. Packets are timestamped so you can see where the delays are coming from.

Packet capture is a powerful tool, and can be used for much mischief, so definitely check your organization's policies before you fire it up. However, that same power makes it the best resource for troubleshooting all sorts of network issues.
posted by five toed sloth at 2:48 PM on December 14, 2012 [1 favorite]

Sounds like DNS, and as mentioned above the first DNS lookup is failing and it's falling back to a secondary server.
posted by iamabot at 3:07 PM on December 14, 2012

Please note, Wireshark is going to not give you terribly much more than nslookup in this case. DNS runs over UDP for lookups, and only uses TCP for transfers, so you'll just be seeing UDP packets disappearing into the ether.
posted by bfranklin at 5:18 PM on December 14, 2012

Make sure the DHCP server is setting your 046 Node Type to 0x8. And then make sure the computer browser service is disabled on all workstations. Then run a WINS server on the domain controller, and add that in the DHCP scope option 044. This way you'll be telling all your machines to use the WINS server for any windows network lookups. Instead of leaving them to randomly 'electing' something on the network as a browse master. This will likely cut down quite a lot on the network lookup voodoo.

You probably don't 'need to' set your DNS server on the DC to use the ISP servers. It can look things up on it's own. This way you're not at the mercy of how well the outside DNS servers are at responding. And it's one less debugging step. In RARE circumstances an ISP might not allow DNS lookups through it's network, forcing you to use their DNS servers. But it's pretty unlikely because it adds a degree of fragility to their infrastructure for pretty much no benefit.

And, of course, all internal client machines should be configured to use in the internal DNS server and NOT anything external. Using internal/external servers will cause headaches as the external servers won't know about your internal hosts and this will make for confusing come-and-go problems with the internal hosts.
posted by wkearney99 at 1:56 PM on December 15, 2012 [1 favorite]

There's nothing wrong with fiddling with wireshark but the firehose of data it collects will likely do more to waste your time than actually tell you anything you'd know how to read.

That and the best way to run it is to use a network 100mb HUB (not a switch) and a spare machine that has an extra ethernet port just for wireshark traffic. This way you put the hub in-between the points you want to monitor and then use the other machine to handle the captures. That way you can also a browser or other tools on the machine via the normal network. Just make sure you use a ethernet connection that's supported by wireshark.

A hub makes this possible whereas a switch won't. Just don't leave the hub in there long term as you'd likely want to be using a direct connection at gigE speeds.
posted by wkearney99 at 2:01 PM on December 15, 2012

Many manged switches, including relatively inexpensive ones, let you mirror one port to another for the purposes of using a packetsniffer (like wireshark) to diagnose problems. Also, wireshark provides tools for filtering captured packets and stringing them together to help deal with the firehose of data.

People seem to be ignoring the first set of symptoms described: that workstations are loosing their connections to local servers.

What do you mean when you say "loosing" and what are the "connections" being lost? How long does it last? What rectifies the situation? If these are fileservers, and the workstations are complaining they can't access shares that they've already established connections to then either the DNS issue is a symtom of a deeper problem, or it is a separate problem, it isn't the cause of the workstation disconnections.

If DNS is, somehow, implicated in the workstation disconnection problems, then it isn't going to be a problem with a DNS server outside your control. Why would an ISP server involved in the resolution of local server names, unless your dns namespace and internal DNS server configuration is bizzarre.

My starting assumption would be that there is one underlying problem that explains both sets of symptoms.

Where I'd look:
The switches, starting with the switch the PDC is connected to, then any other switches servers are connected to, then any switches that the clients having problems with dropouts have in common.
The NIC in the PDC.

What I'd look for: signs of intermittantly failing equipment. Sings of temporarily overloaded network links (if you have 90 workstations behind a cable modem with an undocumented network config, I'd assume that the network wasn't particularly well engineerred, or engineered at all). Signs of misconfigured equipment, equipment that hasnt negotiated link-speed or duplex operation properly.

Techniques: review event logs on PDC and servers and workstations effected by the dropouts, set up ping probes between servers, effected workststions. also ping firewall internal and external interface, and first hop to ISP. Review switch logs and metrics. Set up perfmon on PDC and servers to track network counters and review. Packet sniffer traces.
posted by Good Brain at 7:50 PM on December 15, 2012

I should add: If it were just the DNS issues, I'd do what people have already suggested and do something to regularly test resolution of both internal and external dns addresses, then review the logs for patches of timeouts or much higher response times. Among the causes, I'd consider that your firewall and or cable connection was getting backed up by traffic spikes.

But I really think The browser symptoms are symptoms of a broader issue, given the details we've been given.
posted by Good Brain at 7:57 PM on December 15, 2012

« Older insert various Charades gestures here   |   Best book for 8 year old who loves to draw? Newer »
This thread is closed to new comments.