Why does Apache stop serving one file after I request another?
March 10, 2005 8:35 PM Subscribe
I have run into an unpleasant Apache concurrency problem, whereby the ongoing download of a large file spontaneously and silently hangs ~90 seconds after another request to the server is made. Error logging set at the highest level reveals nothing. Any idea what may be causing this? (Apache/2.0.52, Win2K Professional)
1. This problem doesn't occur when the server is accessed via localhost/internal IP.
2. The format of the large file is irrelevant. The method of downloading is irrelevant. Mozilla, IE, and wget all fail in the same manner. No error is thrown in any of these clients; the data just stops serving.
3. The size or type of the second file is also irrelevant. The problem occurs whether I request a text file, a binary, a PHP script, or even an indexed directory.
4. If I never make the secondary request, the primary request completes without error. This indicates that the problem is not a timeout issue.
5. Most of the settings in httpd.conf are factory default.
6. I am behind a firewall, but the http port (334) is open to both tcp and udp traffic.
1. This problem doesn't occur when the server is accessed via localhost/internal IP.
2. The format of the large file is irrelevant. The method of downloading is irrelevant. Mozilla, IE, and wget all fail in the same manner. No error is thrown in any of these clients; the data just stops serving.
3. The size or type of the second file is also irrelevant. The problem occurs whether I request a text file, a binary, a PHP script, or even an indexed directory.
4. If I never make the secondary request, the primary request completes without error. This indicates that the problem is not a timeout issue.
5. Most of the settings in httpd.conf are factory default.
6. I am behind a firewall, but the http port (334) is open to both tcp and udp traffic.
Well, if it works fine with internal (do you mean same box or same local network?) IPs, then that would point the finger at your firewall. What are you using for your firewall? Are we talking about a hardware firewall, a NAT gateway, or a software firewall? Does the 2nd request get served properly?
posted by boaz at 9:02 PM on March 10, 2005
posted by boaz at 9:02 PM on March 10, 2005
Response by poster: maschnitz: I'm not sure how one would alternately verify or eliminate your first two points. That said, access.log was showing the request, but with only a portion of the total bytes served.
I did notice, however, that Apache threw a warning into error.log during my last attempt:
posted by Danelope at 9:24 PM on March 10, 2005
I did notice, however, that Apache threw a warning into error.log during my last attempt:
[info] (OS 10053)An established connection was aborted by the software in your host machine. : core_output_filter: writing data to the networkboaz: I'm using a Cisco 678 router in NAT mode with everything but the essentials filtered. Portscanning my IP from the outside shows only the ports I opened for Apache and BitTorrent (no torrents running at the moment.) I've never run into similar failures during either upload or download of files to/from remote machines, so I'm leaning away from a straight-up connectivity issue. The second request is served without incident.
posted by Danelope at 9:24 PM on March 10, 2005
"An established connection was aborted by the software in your host machine" == "some other program next to me killed it". That means something in the kernel on the box, or something with kernel-level access, killed the connection.
I'm guessing it's a virus checker or a software firewall. Try turning all such things off. See if it goes away.
posted by maschnitz at 9:34 PM on March 10, 2005
I'm guessing it's a virus checker or a software firewall. Try turning all such things off. See if it goes away.
posted by maschnitz at 9:34 PM on March 10, 2005
Response by poster: I'm running neither a virus checker or a software firewall at the moment. Googling for the last part of the aforementioned error led to this page, wherein they recommending disabling EnableSendfile and/or EnableMMAP as potential solutions. I'll give it a go.
posted by Danelope at 9:37 PM on March 10, 2005
posted by Danelope at 9:37 PM on March 10, 2005
Response by poster: Sadly, disabling one or both of the parameters above had no effect on the hanging.
posted by Danelope at 9:45 PM on March 10, 2005
posted by Danelope at 9:45 PM on March 10, 2005
this is a long shot, but do you have a server licence? back on nt, when i last cared about this, workstation licences would only allow a certain number of connections to the 'net. there was a registry hack workaroud, iirc, but the idea was that you had to buy a more expensive server licence.
this is from very hazy memory, so treat with scepticism.
posted by andrew cooke at 4:58 AM on March 11, 2005
this is from very hazy memory, so treat with scepticism.
posted by andrew cooke at 4:58 AM on March 11, 2005
(related to that, have you looked in the windows event log, as well as the apache logs - sorry if that's completely, totally obvious :o)
posted by andrew cooke at 5:06 AM on March 11, 2005
posted by andrew cooke at 5:06 AM on March 11, 2005
Have you updated the firmware on the 678? Older revisions were quite buggy, and caused all sorts of strange errors. I'm actually surprised you are still using one, I haven't seen them in widespread use since Qwest was sending them to business DSL customers a few years ago.
It is sounding to me like the NAT (or PAT on the Cisco, I think) is failing, and the connection is being killed because of it.
posted by bh at 7:15 AM on March 11, 2005
It is sounding to me like the NAT (or PAT on the Cisco, I think) is failing, and the connection is being killed because of it.
posted by bh at 7:15 AM on March 11, 2005
From The Cisco 600 Series Installation and Operation Guide (pdf):
The Cisco 67x supports two timeout values: session and idle. The session timeout is based on the total uptime of the session...
To verify these values, enter:
show timeout
I trimmed a lot out, but you get the idea. Pages 114 and 115 of the PDF.
posted by bh at 7:41 AM on March 11, 2005
The Cisco 67x supports two timeout values: session and idle. The session timeout is based on the total uptime of the session...
To verify these values, enter:
show timeout
I trimmed a lot out, but you get the idea. Pages 114 and 115 of the PDF.
posted by bh at 7:41 AM on March 11, 2005
This is almost certainly your firewall; it sounds as if your firewall is maintaining a single record of a machine that's allowed to use the mapped port to your webserver, and when a second machine tries to use the port, then the firewall gives the second machine the mapped connection. You can verify this by cutting the firewall out of the equation, at least temporarily; the easiest way is to put three machines on the local network (behind the firewall), #1 being the webserver and #2 and #2 being clients. Start a download from #2, and then start a download from #3. If it works fine, then it's your firewall.
To expand a bit, Ciscos in NAT mode maintain total separation of the outside and inside network, and then create a translation table that establishes short-term links between the two. When a machine on the outside network tries to reach a machine on the inside network, the Cisco checks its rules governing such access, and if the machine's request matches the rules, then an entry is created in the translation table and the access is allowed. This connection has limits -- time limits and idle limits (as bh points out); most versions of the Cisco OS also have limits on the number of entries in the translation table. It sounds like your problem is similar to when a translation table fills up -- the oldest translation is dropped in order to create the newest translation, and thus the first machine's connection between the two networks dies. And I agree that Cisco is unlikely to only be able to handle one translation -- my 501 can handle hundreds -- so I doubt it's that and bet it's more likely to either be a buggy firmware or some misconfiguration of the firewall. If you can watch the logs on the firewall, that'd be the key.
posted by delfuego at 8:04 AM on March 11, 2005
To expand a bit, Ciscos in NAT mode maintain total separation of the outside and inside network, and then create a translation table that establishes short-term links between the two. When a machine on the outside network tries to reach a machine on the inside network, the Cisco checks its rules governing such access, and if the machine's request matches the rules, then an entry is created in the translation table and the access is allowed. This connection has limits -- time limits and idle limits (as bh points out); most versions of the Cisco OS also have limits on the number of entries in the translation table. It sounds like your problem is similar to when a translation table fills up -- the oldest translation is dropped in order to create the newest translation, and thus the first machine's connection between the two networks dies. And I agree that Cisco is unlikely to only be able to handle one translation -- my 501 can handle hundreds -- so I doubt it's that and bet it's more likely to either be a buggy firmware or some misconfiguration of the firewall. If you can watch the logs on the firewall, that'd be the key.
posted by delfuego at 8:04 AM on March 11, 2005
fwiw, are you running the latest PHP? we traced down an obscure bug with PHP and "document contains no data" errors on Windows - some versions (forget which, but the latest are fixed) when used with Apache on Windows crash. the web server never actually dies completely, since it's just the thread or child process that dies and gets restarted by the parent process. we migrated to an (at the time) unstable version of PHP and the problem stopped. (this could also come up if you're using a threading MPM and you have PHP mods that aren't thread-safe enabled - not all of them are, and that's why they don't recommend you use PHP with Apache 2.) I agree though that it's probably the firewall, but PHP is an option too if you're running it.
posted by mrg at 9:51 PM on March 11, 2005
posted by mrg at 9:51 PM on March 11, 2005
Response by poster: mrg: I'm using PHP 5.0.3, which appears to be the latest stable version. In fact, while trying to diagnose this issue yesterday, I tried the latest nightly, and it was crashing a child process every other request. Oof.
As far as the firewall goes, I'm still not clear on what I need to tweak. (I'm a designer/developer by trade, not a sysadmin.) I upgraded the 678 to the latest firmware, which had no effect, and have been reading various pieces of documentation and online chatter since, with nary a clue as to the solution.
Thanks to everyone for your help thus far.
posted by Danelope at 10:09 PM on March 11, 2005
As far as the firewall goes, I'm still not clear on what I need to tweak. (I'm a designer/developer by trade, not a sysadmin.) I upgraded the 678 to the latest firmware, which had no effect, and have been reading various pieces of documentation and online chatter since, with nary a clue as to the solution.
Thanks to everyone for your help thus far.
posted by Danelope at 10:09 PM on March 11, 2005
Response by poster: The 678 is running in DMT mode, by the way, which apparently means that the individual connection timeout periods are disabled (and thus unchangeable.)
posted by Danelope at 10:11 PM on March 11, 2005
posted by Danelope at 10:11 PM on March 11, 2005
Backing up a step: Googling "an established connection was aborted by the software in your host machine" reveals a couple of things:
1. Pretty much everyone agrees that the "software" here is WinSock, the Windows TCP/IP stack.
2. The Windows error number corresponding to this message is 10053. There is a wealth of forum material off a "10053 Windows error" search.
3. MSDN is next to no help.
4. Assorted notes:
- "10053 errors are actually quite rare usually."
- One possible cause is the premature termination of the server.
- Another is a virus check error.
- "This error can occur when the local network system aborts a connection, such as when WinSock closes an established connection after data retransmission fails (receiver never acknowledges data sent on a datastream socket). Possibly due to a data transmission timeout or protocol error."
- An interesting solution to a 10053 - the guy reconfigured his network card and it went away.
- There are a lot of stories of upgrading the software to get rid of the error.
- This guy seems to know what he's talking about. He's got interesting pages on the virus check angle, but that's not that useful for you.
Signs still point to the firewalling because of the different behavior inside the network and outside. If I had to bet, your firewall is still misconfigured.
I'm not thoroughly convinced it's the firewall though. Packets look different to your server when they come outside the firewall. The firewall could be doing its job correctly, but Windows, Apache, or the hardware on your machine is just confused by these new-fangled packets. Personally, I'm still suspicious of Apache 2 on Windows. It's not the most stable platform of all time. Other possibilities include timeout, faulty networking hardware, and a badly configured network card.
posted by maschnitz at 12:04 PM on March 12, 2005
1. Pretty much everyone agrees that the "software" here is WinSock, the Windows TCP/IP stack.
2. The Windows error number corresponding to this message is 10053. There is a wealth of forum material off a "10053 Windows error" search.
3. MSDN is next to no help.
4. Assorted notes:
- "10053 errors are actually quite rare usually."
- One possible cause is the premature termination of the server.
- Another is a virus check error.
- "This error can occur when the local network system aborts a connection, such as when WinSock closes an established connection after data retransmission fails (receiver never acknowledges data sent on a datastream socket). Possibly due to a data transmission timeout or protocol error."
- An interesting solution to a 10053 - the guy reconfigured his network card and it went away.
- There are a lot of stories of upgrading the software to get rid of the error.
- This guy seems to know what he's talking about. He's got interesting pages on the virus check angle, but that's not that useful for you.
Signs still point to the firewalling because of the different behavior inside the network and outside. If I had to bet, your firewall is still misconfigured.
I'm not thoroughly convinced it's the firewall though. Packets look different to your server when they come outside the firewall. The firewall could be doing its job correctly, but Windows, Apache, or the hardware on your machine is just confused by these new-fangled packets. Personally, I'm still suspicious of Apache 2 on Windows. It's not the most stable platform of all time. Other possibilities include timeout, faulty networking hardware, and a badly configured network card.
posted by maschnitz at 12:04 PM on March 12, 2005
BTW, One possible solution to all this is to replace a piece of the system, one a time, until it works:
1. Swap out the firewall with a friend's firewall.
2. Try simulating your application using Apache 1.3 on the same machine
3. Swap network cards.
4. Install Apache 2 on a second machine and retry.
5. Install Apache 2 on an XP machine and retry.
6. Take the entire setup over to a friend's house or to work, reconfigure the networking, and retry.
etc.
A very work-intensive and sometimes impossible task, but it will eventually get the job done.
I think #1 in particular is worth the effort at this point.
posted by maschnitz at 12:14 PM on March 12, 2005
1. Swap out the firewall with a friend's firewall.
2. Try simulating your application using Apache 1.3 on the same machine
3. Swap network cards.
4. Install Apache 2 on a second machine and retry.
5. Install Apache 2 on an XP machine and retry.
6. Take the entire setup over to a friend's house or to work, reconfigure the networking, and retry.
etc.
A very work-intensive and sometimes impossible task, but it will eventually get the job done.
I think #1 in particular is worth the effort at this point.
posted by maschnitz at 12:14 PM on March 12, 2005
This thread is closed to new comments.
1. The network is getting severed. This could be for a couple of reasons: your firewall could be doing it, or the Windows firewall on the box could be doing it. God knows why.
2. The thread servicing the request is dying. This is most likely. I'm not entirely sure how to check, but there's gotta be a active thread count somewhere. This could be because of a misconfiguration, a bug, or a disagreement between Windows and Apache.
3. For some reason, Apache thinks it's done. This is easy to tell - does the log report a successful request, with all bytes returned?
posted by maschnitz at 8:59 PM on March 10, 2005