I have a new Windows 2008 server (64bit, running on a big Dell rack) that acts as a file server for about 70 people - standard windows SMB, people hit it at \\servername on their XP desktops via login scripts.
Twice now, the thing has randomly 'frozen' and caused mass panic - all connected user machines freeze up when attempting to access their various shared drives on \\servername. Also, all their applications that rely on files on \\servername\mount also die/freeze/kill themselves/crash etc.
The thing is:
- All eventlogs on this box are clean. Zero. Nothing bad, no errors, application, system or otherwise.
- The network itself is fine. No high-bandwidth intruders, no heavy loads, no disconnects, no bad pings, nothing like that. Bandwidth is normal.
- Server processes are normal, nothing's eating up the local system's network or processor or HD load. No antivirus has kicked in, no special timed thing is happening, processor shows 99% idle, network is 1%usage, and so forth.
The *only* symptom:
.. a netstat shows many, many, many CLOSE_WAIT connections being held by all connected users. So desktop 'UserGuy' might have the following entries when the thing is 'frozen':
(a dozen similar CLOSE_WAITS ...)
TCP 192.168.1.201:445 UserGuy:1300 CLOSE_WAIT 4
TCP 192.168.1.201:445 UserGuy:1302 CLOSE_WAIT 4
TCP 192.168.1.201:445 UserGuy:1304 CLOSE_WAIT 4
TCP 192.168.1.201:445 UserGuy:1306 CLOSE_WAIT 4
TCP 192.168.1.201:445 UserGuy:1308 CLOSE_WAIT 4
TCP 192.168.1.201:445 UserGuy:1310 ESTABLISHED 4
... and that sort of CLOSE_WAIT repeat occurs for all 70+ connected users. We reboot, log back in, and when it's fine and behaving we don't have this long litany of CLOSE_WAITs.
My initial hunch is that there's a similar open-tcp-connections limit of some sort that is being met, which then freezes out further TCP connections. XP users used to up this kind of internal Windows limit when torrenting and so on.
My hours of Googling shows you can up this number with registry hacking, but to me that's treating a symptom - not the cause.
So what's holding the connections open in the first place? Why won't they die? What am I missing here?
Only reason I ask is that I recently became aware that you could indefinitely postpone a forced shutdown on a Windows box by pushing the system date back into the past; and I wonder if the Windows TCP stack is also using some dodgy time-of-day clock calculation for connection timeouts.
Don't waste time on my jumped-to conclusions if somebody else comes up with something more plausible.
posted by flabdablet at 7:50 PM on August 13, 2008