Help me find and resolve a traffic bottleneck at my colo.
February 14, 2006 4:16 PM Subscribe

Help me find and resolve a traffic bottleneck at my co-location. We launched a new web-app (it registers users and delivers flash content) at my work and the server fell over at about 17 conn/sec. Help me prevent this from happening again.

The network is at a colo on a T1 -- our bandwidth WAS, as best as I can tell, completely saturated when the server started to drop connections*, however, I'm concerned that there may be other issues (it seemed like it was possible to hit our other servers -- even hit other sites on that server, while connections were getting dropped to this application site) in play.

I had a top running on the machine the entire time and it never seemed to go about 10%. Running netstat, when I was first called about dropped connections, showed tons of "TIME WAIT" connections. Our application is pretty simple and does not have a lot of crazy queries -- it basically handles logins and does registrations. We deliver about 4 megs of flash content (in 100k swf blocks) after the login. MySQL connections are not explicitly closed after queries in the script.

The app seemed to still run pretty quickly *after* one was able to establish a connection (if you couldn't, you'd get a server timeout -- and a message saying that the server could not be found). After about a half hour the server settled down (conns dropped below 10-12/sec) and started running nicely again.

As far as I can tell, either apache is hobbling itself for some stupid reason or there's a bandwidth bottleneck. We've got the machine firewalled by a PIX 501 on which the CPU and mem usage were steady throughout the traffic spike (supposedly it should be able to do 60 mbps throughput, anyhow). There is a linksys prosumer switch behind the PIX.

Here's the specs on the server hardware:

1 GIG RAM, CPU: Intel(R) Xeon(TM) CPU 2.40GHz

df returns:

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md1 9614052 1741272 7384412 20% /
tmpfs 518068 0 518068 0% /dev/shm
tmpfs 518068 12588 505480 3% /lib/modules/2.6.12-10-386/volatile
/dev/md0 45037 19160 23474 45% /boot
/dev/md2 66295352 355780 62571952 1% /var

and software:

LAMP - Breezy Badger Ubuntu/Apache 2/MySQL 4.0/PHP 4.0.

We're running ISPConfig on this machine. The directory that serves the site in question is set up via a vhost with ISPConfig.

We have root on this machine.

Here's the relevant (?) Maxclients entries from apache2.conf

StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 20
MaxRequestsPerChild 0

StartServers 2
MaxClients 150
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0

What can I do? What should I look at? I've already checked out the other server load question, but my situation is slightly different (I am planning on trying out the benchmarking utils after the current traffic dies down). Again, I want to think that more bandwidth will solve the problem, but we'll be in a hell of a spot if it's not, so I need to check all the potential problems.

Oh yeah, we've got about a week to get this right.

(CUE CRAZY MONTAGE MUSIC GO GO GO)

* We are definitely going to try to increase the bandwidth to the server -- we're on percentage now, but I think our speeds are capped at 1.5 mbps or something, because we never do much above that, even during spikes.

posted by fishfucker to Computers & Internet (13 answers total)

stupid.. posting. thing.

those two apache.conf code fragments should read as:

<IfModule prefork.c>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 20
MaxRequestsPerChild 0
</IfModule>

<IfModule worker.c>
StartServers 2
MaxClients 150
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</ifmodule>

posted by fishfucker at 4:21 PM on February 14, 2006

More on TIME_WAIT. Sounds like you're running out of TCBs. Bad, bad, bad. SO_LINGER may help.
posted by kcm at 4:22 PM on February 14, 2006

I manage a server doing 3 million dynamic requests per day and have hit problems like this before.

Running netstat, when I was first called about dropped connections, showed tons of "TIME WAIT" connections.

That's not necessarily bad. There's a timeout for how long these sit around for. My server routinely has a couple hundred of these at any one time.

Also, are you in worker or prefork mode? I don't think you mentioned it, despite showing both portions of your Apache config. The prefork looks, possibly, a little low on the MaxClients given the situation.

MySQL connections are not explicitly closed after queries in the script.

That's not really good. MySQL does have a limit on max clients, and it can be a mega PITA when you hit them. Luckily they're easy enough to change, but you need to take memory usage into consideration when you do.
posted by wackybrit at 4:25 PM on February 14, 2006

Unless you've broken something on the apache or system configuration (and I've no reason to believe you have), it's simply lack of bandwidth. A well-optimized apache can handle several orders of magnatude over a t1.
posted by sohcahtoa at 4:28 PM on February 14, 2006

Thanks for the answers so far :

Here's a sample netstat (Taken a minute ago -- server is performing "normally")

My apache2.conf

My vhost conf for the site.

Anything else that would be helpful in analyzing the problem?
posted by fishfucker at 4:58 PM on February 14, 2006

Mysql doesn't seem to be eating up a lot of processor so I can't say for sure whether or not it is causing an issue -- in fact, apache and mysql hardly even popped up on top during when we were getting dropped connections.
posted by fishfucker at 4:59 PM on February 14, 2006

Also, small bandwidth chart from our colo for today.
posted by fishfucker at 5:03 PM on February 14, 2006

MySQL connections are not explicitly closed after queries in the script.

That's not really good. MySQL does have a limit on max clients

You don't have to explicitly close your mySql connections when you're using PHP4+, since the connection is closed automatically after script execution.
posted by Civil_Disobedient at 5:30 PM on February 14, 2006

First: you don't say, but you HAVE to use Apache2's prefork MPM with PHP. If you used the worker or any other MPM, that could possibly be the problem. Issue "/usr/sbin/apache2 -V" to find out.

Second: Assuming you're using the prefork mode, you have a MaxClients of 20. That means Apache can serve only 20 requests at a time. You're serving 4Mb of data. On a 56k modem, that's going to take, what, ten minutes to download? So each 56K user is going to take up one of your 20 Apache clients for ten minutes... Broadband users will be done with quicker, but it will only take a few dialup users to stop everyone else from connecting to your app while they....slowly....receive...4MB....of....data. You need to set MaxClients - the one in the prefork block, the others are ignored - way higher. Try 100 to start with.

Then you need to keep an eye on your RAM, which will probably go before the CPU does. Each Apache client will use up X memory. If you have 100 clients active and you still have plenty of RAM, crank it up higher.

You should also set the MaxRequestsPerChild to something, say 1000. After 1000 requests, Apache will kill that child and start a new one to replace it. Avoids children growing to huge size due to memory leaks.

Third: keeping an eye on MySQL. That ten-minute download could also cause problems for MySQL, same basic reason - hold a connection open for ten minutes for each dialup user, eventually you run out of connections to MySQL. Check and see, make sure MySQL isn't your bottleneck (it ISN'T, now, with Apache having MaxClients of 20, but it might become the bottleneck if you go to MaxClients 100 or 200). You can either close MySQL connnections quickly (in your PHP script) or make sure MySQL can have a lot of connections open at once (in MySQL configuration).

Fourth: bandwidth. You need to give 4000 kilobytes of data to each user who logs in, right? Your T1 bandwidth is 1500 kilobits/second. What this means is that you can completely serve ONE user in about 22 seconds (assuming his download speed is up to it). One user per 22 seconds. So if you ever hope to serve more than one user per 22 seconds, you need more bandwidth.

I hope this is a good primer to what you need to do to solve these problems. Nutshell: your machine is loafing. Bandwidth and Maxclients are limiting how much you can serve. If there's any way to shrink your Flash content below 4MB, that would be a good thing for several reasons.
posted by jellicle at 6:01 PM on February 14, 2006

I don't have any help to offer, but I'm intrigued as to what you mean by this:

We deliver about 4 megs of flash content (in 100k swf blocks)

Everyone gets four megs of Flash? The average user gets four megs? And how do you mean, 100k blocks? Each user gets forty different SWF files, each around 100k? And how do those files get called, directly by users' browsers, or by other bits of Flash?

Half the Flash I see online these days seems tiny when you look at the SWF file mentioned in the source, but when you look, it's a 4k file named "loader.swf" and is obviously calling up tons of other Flash, the location of which is thereby obscured.
posted by AmbroseChapel at 12:21 AM on February 15, 2006

Unless you've broken something on the apache or system configuration (and I've no reason to believe you have), it's simply lack of bandwidth.

I agree. It looks like it has nothing to do with software, sockets, TIME_WAIT, maxclients¹, or any of that other stuff. Your netstat output confirms this. You're simply saturating your pipe, and as a result some packets get dropped.

[1] Although a setting of 20 is ridiculously small as others have pointed out.
posted by Rhomboid at 12:33 AM on February 15, 2006

Yeah, it looks like we are using prefork -- and it looks like that needs to be upped. The 4 megs doesn't come all at once -- presumedly, any user would dl up to four megs or more of swf content -- however, they do dl about 100k swf files at once, which contain links to the other 40 swf files. I can't really get more specific about the application.

We can def. take the size of the flash content down -- I found out our designer wasn't compressing the swf (which, without any other optimizing, would take the total content size down about 20%!). I'm glad to find out that the Maxclients is def. also a culprit -- we just doubled our bandwidth with our colo, and it would suck to find out next week that we're still dropping clients after 20.

There are very few MySQL connections after the content is delivered (1 short script doing one insert every time a swf is loaded), so I don't think we'll have a MySQL problem, but I guess we could if there's 100 people trying to do swf inserts at once.

Thanks for all the help everyone. FWIW, here is the apache -v that jellicle asked about.

Lastly -- any recs for load testing? How is apache benchmark, etc? Will this give me some realistic info? I have 6mb/s of dl power at my office, so theoretically i could try to swamp our server at the colo, right?

My apache -v info:

Server version: Apache/2.0.54
Server built: Dec 5 2005 16:34:11
Server's Module Magic Number: 20020903:9
Architecture: 32-bit
Server compiled with....
-D APACHE_MPM_DIR="server/mpm/prefork"
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_SYSVSEM_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D HTTPD_ROOT=""
-D SUEXEC_BIN="/usr/lib/apache2/suexec2"
-D DEFAULT_PIDLOG="/var/run/httpd.pid"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_LOCKFILE="/var/run/accept.lock"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="/etc/apache2/mime.types"
-D SERVER_CONFIG_FILE="/etc/apache2/apache2.conf"
posted by fishfucker at 11:00 AM on February 15, 2006

Ok -- I've updated the Maxclients and MaxRequestsPerChild -- I'm going to read up on the prefork directive and see what other optimization can be done there. I'll let you guys know how it goes. Thanks for the troubleshooting!
posted by fishfucker at 11:03 AM on February 15, 2006

« Older Looking for business cards to be printed in the UK... | Could what I don't know, kill me? Newer »

This thread is closed to new comments.

Ask MetaFilter

Help me find and resolve a traffic bottleneck at my colo.
February 14, 2006 4:16 PM Subscribe

Tags

Share

Help me find and resolve a traffic bottleneck at my colo. February 14, 2006 4:16 PM Subscribe

Tags

Share

Help me find and resolve a traffic bottleneck at my colo.
February 14, 2006 4:16 PM Subscribe