Surviving the Slashdot effect
September 21, 2005 6:02 PM   Subscribe

Help us survive a major-but-short-term traffic spike.

We're conducting a joint promotion with a MajorOnlineAuctionHouse. They have kindly offered to include our link in the newsletter which is sent to their entire (subscribed) user base. If even 10% click on the link, we could be looking at hundreds of thousands of visitors in a very short space of time. We were planning on hosting the site on a Verio VPS (!) but now I'm beginning to wonder if it can take the strain...

Last year we were simultaneously Slashdotted and Register-ed. The server just gave up - it was exactly as if they'd launched a DDoS attack on us. I don't want this to happen again.

What precautions can we put in place? I know about Coral Cache but I'm not sure it's appropriate for a commercial site. Are there any other load-balancing techniques we could use? Or is there a way to buy short-term bandwidth from a provider? We'd only need it for a month or two. The site/server is LAMP: I realise that Apache isn't ideal for this sort of thing, would something like Zeus be a better solution? Would this require significant code changes/setup?

All advice would be appreciated. Thanks.
posted by blag to Computers & Internet (15 answers total) 1 user marked this as a favorite
 
Are you looking at needing to replicate all your dynamic content elsewhere? Could you make the primary page they go to static (or precached), thus reducing your shock-load to a bandwidth problem?
posted by effugas at 6:09 PM on September 21, 2005


Response by poster: We’ve already trimmed it down to the bare minimum of dynamic pages – about 70/30 static/dynamic content – but since it's a competition, there have to be some database writes in there. Maybe we could put the static and dynamic content on different servers, though. Good thought.
posted by blag at 6:20 PM on September 21, 2005


blag,

Have you experimented with load testing? There's all sorts of ways to simulate ungodly spikes of data. This does sound like a really valuable opportunity...
posted by effugas at 6:24 PM on September 21, 2005


I hate to be rude, but...

If you have to ask if a dynamic web site can handle the traffic, it can't.

If you are trying to retrofit your existing infrastructure to handle large bursts of traffic, it isn't going to work. Unless things were not set up properly in the first place.

Apache is very well suited for this, it is your backend that is going to have problems. If it is important that your web pages stay up no matter what, find a way to go static on every possible page. If it is more important that things are dynamic, get used to your web site being down.

I've survived three /. attacks with static pages, on hardware some people might laugh at. All of my LAMP stuff is set to block anything from /., mefi, boingboing, etc. LAMP scales, but it takes a lot more hardware than people think.
posted by bh at 6:57 PM on September 21, 2005


Best answer: Are your images optimised?

I'm a bit of a compression junkie so I tend to overestimate the impact, but I did have one occasion where the web application (a course enrolment tool) was completely bogged down. It turned out that a series of "small" PNG files were 30k each rather than the 2-4k they should have been. After I fixed that the change in load was particularly dramatic. We had no load problems for the rest of the enrolment period.

If you want to email me your website address, I can quickly see if there's any room for improvement when it comes to optimising your images.

I'm not sure if I've got any room left on the disposable account currently listed on my website, so feel free to email me at Sep05.5.krishaven@spamgourmet.com. (You have until five pieces of spam abuse that address ;)
posted by krisjohn at 6:59 PM on September 21, 2005


Best answer: You should work on adding some amount of caching to your dynamic pages. Example: If you have a complex DB query, store the results along with a timestamp, so that you can reuse that result for the next e.g. 5 minutes without having to re-run the query. Mysql has a query cache that can sort-of achieve this, but it's limited because its hands are tied.

Also, move all images and static content off to another web server that is not the same as your main Apache. The idea here is that each Apache worker thread will take a certian memory footprint and occupy a database connection. If you use up a lot of these worker threads for simple static content, you will needlessly exhaust memory. Install a simple static-only server on a different IP or alternate port, and use it to serve all the static stuff, leaving your Apache worker threads to do the dynamic content. This way, you will be able to serve more consecutive long-running dynamic queries at once.

And ideally, you should really consider the possibility of going 100% static if things get out of control. It is better to have a page that lists ~30 minute old data versus having no page because the server crashed.
posted by Rhomboid at 7:08 PM on September 21, 2005


Also, how optimised are your database calls?

Here's a mysql example:

select foo.fookey,bar.barkey
from foo
left join bar on foo.foobar=bar.foobar
where foo.set='blah' and bar.subset='whatever'

Is slower than

select foo.fookey,bar.barkey
from foo
left join bar on foo.foobar=bar.foobar and bar.subset='whatever'
where foo.set='blah'

...though they may not always return the same set. You'd have to do some testing. I saved a heap of CPU time adjusting a bunch of SQL statements to use the second technique.

Back when I was first doing development on PCs, the "rule" was: "Develop on high-end, test on low-end". Does anyone do this anymore? If you setup a copy of your server on some old PC that spends most of its time swapping to the hard drive, you'd be able to detect bottlenecks with a much smaller load.
posted by krisjohn at 7:39 PM on September 21, 2005


If it's short term, and you have money to spend on the problem, you could consider Akamai ; they've got a large distributed caching network you can redirect through, and the process is pretty seamless to set up. A couple years ago, the company I worked for had a similar opportunity, and just a couple days to ramp up; it turned out to work well for us.

Needless to say, of course, you need to make sure that what you're directing these folks to is almost all static content; nothing will help you if it's dynamically generated for each user.
posted by bemis at 7:49 PM on September 21, 2005


Best answer:
bemis: Needless to say, of course, you need to make sure that what you're directing these folks to is almost all static content; nothing will help you if it's dynamically generated for each user.
That's not entirely true. It's unclear from the details given as to why the system would collapse under high load. Sometimes it's purely network bandwidth, sometimes it's the lack of CPU power to compute the dynamic elements at high load, and sometimes it's the inability to handle so many concurrent users, either due to the tcp stack not handling so many concurrent connections, or because the thread management of having so many concurrent users.

In either of the last two cases, an Akamai or Savvis type service can be pricey, but they may have reasonable prices on short-term (month long) deals. These outfits can not only handle static content (use a separate DNS for anything static, so it can be off-loaded/load balanced differently than the dynamic stuff, with more attention to bw and holding everything in memory on the server(s) ) but also do TCP aggregation and distributed network load. This has the benefit that you effectively reduce your dynamic server to a much more efficient device serving dynamic fragments to the same few Akamai IPs in the provider's distributed cloud, leaving them and their thousands of nodes to handle the overhead TCP connection creation/tear-down, keepalives, static content caching and delivery, etc, and other overhead issues from many thousands of users. When testing dynamic apps, it's amazing how often people throw a single or few servers running high-load generators at the server, negating the hidden high costs of network level overhead to a server's CPU.


blag: We haven't heard what your budget is, though, or what kind of dynamic generation you do (is it read-only, and thus easily possible to set up redundant copies of the DBs to handle high traffic?), and how long a lead time you have to put remediation in place. It's hard to assess what the real bottleneck is, but if it's a concern that your server tips over and 500 errors or just times out, you could almost do as well to go buy/rent a bunch of very inexpensive PCs and whip out a load balanced farm.

From a purely optimizing perspective, things to do include:
  • Make sure all static content is well-organized and easily kept in its own folders or preferrably own DNS, so that that load is separate from the dynamic load. The server(s) procs should not compete between sending a simple .gif and doing a DB call. Users will more likely notice a full page time-out or slow page render of key text content, than the fact that the footer image or some other element took longer to download and render.
  • Use content expiration to ensure client-side caching, reducing repeat calls for the same gif as they navigate around your site.
  • If bw is part of the bottleneck, consider higher compression on gif/jpg elements, and pre-compressing static content for http 1.1 so that the download size is smaller without the CPU hit in real-time.
  • Reduce keepalive times to a bare minimum so that user's still get the benefits on the repeated calls for additional page elements after the html during page load, but in such a way the connections are not held open unnecessarily. Combined with moving static content to another DNS, it's possible to turn off keepalives on the dynamic servers only, since users likely won't be getting a new dynamic page within the next few seconds anyway.
  • If your front-end is a simple rendering layer and your back end more complex, but there's reason to believe the load will hit your front end more (TCP conns, page downloads, html rendering of backend content, etc), there's no crime against buying/renting a bunch of cheapie PCs to handling the load.
  • If your backend is used as read-only to the end-user (probably unlikely if you're doing auction-related work), consider spinning up multiple BEs on those same cheapie PCs, and load balancing them to the FE web servers.
Those are just some thoughts. Without more information, it's kind of shot-in-the-dark as to what you should do.
posted by hincandenza at 8:40 PM on September 21, 2005


I'm with bh here. If you have a few weeks to rearchitect your dynamic site software, I'm sure you can make it scale. But if you don't know how to load balance and get some failover it's gonna be not so good. The only quick fix is to make as much of your site static as possible and get a few Apache hosts running in parallel with some simple load balancing.

You estimated a 10% clickthrough rate on your email promotion. I think that's awfully optimistic. Particularly since the first thing everyone is told joining on joining the Internet is not to click on links in email from MajorOnlineAuctionHouse, since 99% of the time it's a phishing attack.
posted by Nelson at 12:22 AM on September 22, 2005


Response by poster: Chaps - I'm incredibly grateful for all of your input. As requested, here are some more details about the application. We've got a month to get this sorted so there's time and space to work out a good solution, thankfully.

Basically, it's a quiz which visitors will answer and submit. All of the intro/FAQ pages are static HTML. Currently, all of the question pages are PHP with (hashed) variables passed between pages and a single write to a MySQL DB right at the end. I realise that this isn't the most secure method but we're trying to limit DB interaction as much as possible. Since the content of the quiz won't change, we may be able to make these pages static and just have one dynamic page which does the writing - I'd need to go back and look over the code, though. It's not up yet so I'm afraid I can't show you the final app. Thanks for the offer though, krisjohn.

As per recommendations, we will certainly be splitting the dynamic and static content between different servers/DNS. Since there are going to be few (if any) DB reads and only one DB write per user, I suspect the static content will be the bottleneck. Caching, compression and keepalives will all be important but I'm still wary about the number of concurrent threads. Do you guys know of any load testing applications that could simulate a huge influx of visitors?

Nelson: point taken about the optimistic clickthrough rate but we're trying to plan for a worst case scenario...

Thanks again for all your help
posted by blag at 7:29 AM on September 22, 2005


For load testing, I can't speak to *nix tools, since I'm a Windows guy. But you could use the free windows tools, such as the Web Capacity Analysis Tool, to simulate load of hundreds of browsers hitting your site at the same time, and "clicking through" a quiz.


Like Nelson, I do question if you've actually done the math to see just how big a spike you're looking at: you say "hundreds of thousands" of users, but you also say this could last a month or so. In terms of concurrent users or users per second, that's actually very little. A lightweight, dynamic server-side page doing a basic quiz, serving up static content, should have little problem handling a couple of hundred users simultaneously on a fairly lightweight server/decent PC, if not more.

Do a back-of-the-envelope calculation based on possible total user visits over time, skewing to expect the bulk of users during certain times of day (lunch hours, etc, or whenever they're likely to show up), to get a ballpark of your hopeful/expected peak traffic in MB/s as well as concurrent users/sessions (you'll have to guesstimate how long a quiz takes to complete, etc) and decide if your server's NIC, or your ISP uplink, can handle the b/w at peak time.

As I mentioned, you might be surprised to find that you can handle a lot of traffic of this type easily. Even changing nothing in your code right now, the users will hit the page, then spend a little while filling out that page's quiz, then clicking next, etc. Let's say your quiz has 3 pages and averages around ~60 seconds to complete, with ~20 seconds between each page click... you could have 100 new users arriving every second, and at peak you'd have a sustained max of ~6000 users working on your quiz at any one time- at any given second, people who first came in about a minute earlier will be done and leaving your site, while a fresh ~100 or so have arrived to replace them. Optimistically, each of those users will click 3 times in their minute-long quiz experience, thus each new 100 users per second will generate 300 requests over the next minute. That's 18,000 pages a minute, and at 10KB a page (assuming it's clean HTML, minimizing unnecessary page elements like arrow.gif when a simple | > is perfectly good instead) the bandwidth will be considerable.


Even with those numbers, your lone server will only have around 9000 TCP connections open at a time (assuming time for the connection to get broken down after the users leave your site and don't return), and 300 requests per second. This might run a decent server (say dual 2+GHz, 1GB RAM) hot, but it probably won't tip over. The bandwidth of 18,000 pages of 10K each per minute is a lot- 24Mb/s. But then, if you're in a hosted environment they can probably handle that level of traffic, although at a pretty high price.

All that said, while those bandwidth numbers may seem daunting, those are pretty high expectations- 100 new users per second means 3.6 million people in a single hour. This means that if this MajorOnlineAuctionHouse sent out an email newsletter to 120 million people in a single day, and 10% clicked through during their lunch hour from EST to PST... then yes, you'd have those traffic rates as 12 million people took your quiz in one 4-hour period.

If you have those numbers, nice problem to have, but highly unlikely. You'd easily be able to afford using an Akamai or Savvis right now. Much more likely is that those numbers are at least an order of magnitude lower, which means you could host 10 new users per second rate, with 600 concurrent users, on a couple of Dell Dimensions using a DSL 2MB up/down line. :)
posted by hincandenza at 6:49 PM on September 22, 2005


Best answer: Good luck, blag. Get a solution in place for having several machines running your PHP server and you're in good shape. Cookie based load balancing could help you here, but if your quiz app stores all state client side (ie: no server side sessions) then you can just do any kind of load balancing.

There are a couple of load testing tools I like because they are simple. Not comprehensive, but so what? One is apache bench, the other is Microsoft's stress test tool. Load testing is a subtle art, but this will give you some idea how you'll do.
posted by Nelson at 7:49 AM on September 23, 2005


Response by poster: Thanks again. Will try out some of these suggestions and come back to mark best answers. Good work.
posted by blag at 3:40 PM on September 25, 2005


Response by poster: Hi folks - in case anyone is interested, we survived with nary a hitch. I suspect the trick of placing all static content on one server and letting the main server handle the dynamic content alone was the biggest factor. Thanks again to all who helped.
posted by blag at 9:05 AM on December 2, 2005


« Older Pay as you go or land line?   |   When is paying extra for organic worth it? Newer »
This thread is closed to new comments.