Can I get high availability with a modest budget?
December 20, 2008 9:34 PM Subscribe

How can I set up the virtual server environment I want, using as much open source and as little proprietary software as possible?

I'm the technician / sysadmin at a primary school, which will be updating its on-campus server boxes early next year.

Both our existing boxes run Windows Server 2003. Admin1.admin.local runs SQL Server Express and the school administrative database and serves files to four admin workstations; curricserver.curric.local is pretty much a pure file server for 70 classroom workstations, making extensive use of NTFS permissions to manage access control. All workstations run Windows XP.

Admin1 is the PDC for admin.local (10.145.172.0/23), and curricserver is the PDC for curric.local (10.129.172.0/23). An upstream-managed router connects both subnets to the Internet, and also allows admin.local hosts to send UDP datagrams and establish TCP connections to curric.local, though not the other way around. No broadcast traffic transits the router. The router's link to the admin.local subnet is ten mbits/s, which is fine for internet but sucks for cross-subnet file serving.

Admin1 and curricserver each have a UPS. I use external USB drives for backup. I am pretty happy with the way each of the servers is currently set up.

Next year, we'll also be replacing our existing MS Access-based student reports package with a new web-based one from the same company. I have no reason to believe that this will go well. The company recommends hosting this thing on a dedicated box, which will basically be running SQL Server Express and IIS and not much else. I have no wish to install IIS on either admin1 or curricserver, so I'm happy to agree with them.

Now, rather than buy three new boxes and a new UPS and another set of backup drives next year, I'd rather buy two new boxes with i7's and loads of RAM, run a Linux on them with something like Heartbeat in it, and create the three W2k3 servers I need virtually.

It seems to me that doing this would allow me to (a) keep using the same virtual servers year after year after year, while updating the underlying Heartbeat cluster as often as necessary to suit the school's hardware management policy (b) avoid single points of failure for all servers (c) centralize my backup task (d) use solid Unix system administration tools for disk snapshot and backup management instead of whatever some random commercial vendor claims to have invented this year (e) save the school some money (f) easily bypass the upstream-managed router's connection between our two subnets, allowing me to set up the same safe one-way routing policy at gigabit speeds.

Questions:

1. Is this idea Wrong in any important respect? I haven't had much hands-on VM experience, but I have enough Windows expertise to migrate our existing server setups to other hardware (even virtual hardware), probably without needing to do a Windows reinstall, a long enough beard and enough open source happy drink to see the project through, and the intention to document it thoroughly enough to stop it turning all white-elephant for the next guy.

2. Will Heartbeat in fact let me set up three VM's that will normally run on Tweedledum but reboot themselves automagically on Tweedledee if Tweedledum dies?

3. Which VM environments will let me put a Windows VM in charge of a physical network adapter, so I can serve files from a Windows VM over gigabit Ethernet without undue performance penalty?

4. Am I correct in assuming that I want N+2 physical network adapters in each physical box (one for each of N virtual machines to connect to an appropriate network switch, plus one to do a point-to-point link for DRBD, plus one to talk to the host OS via an appropriate network switch? Or, since the virtual replacement for admin1 is only going to be talking to four workstations, do I actually need a dedicated physical network adapter for that VM?

5. Is there any good technical reason for my visceral unwillingness to install IIS on anything except its own dedicated (physical or virtual) box?

6. What's the obvious question I've completely forgotten to ask?

posted by flabdablet to Computers & Internet (11 answers total) 5 users marked this as a favorite

Why not use Xen or spring for VMWare ESX, either of those will work very well for swapping out the underlying hardware as needed.

Virtualization has brought a number of new challnges to network security and vlan isolation, there a number of ways to go about this, I recommend dedicating physical interfaces to VM's that require dedicated hardware performance, but most of the time you can get away with vlan tagging down to the host machine and handling it from there.

I don't recommend running anything really I/O intensive and CPU intensive on virtual hardware (usually your DB), everything else runs pretty smooth. I'm a big fan of VMware, their product is very mature, makes backup and recovery a snap and migration to new hardware or even older hardware that isn't getting used properly, very very very easy.

Vmware isn't open source, but you can do much of what you can do in Vmware products with the Xen hypervisor and some other tools.

If/When you buy the bix new boxes make sure you also get some fast disk for them as well, the RAM and CPU is shiny and nice, but with virtualization I think you're really going want to make sure you have quick disk underneath it as well. In our utility/virtualization environments we encourage folks to go to very fast SAN, but you can get plenty of performance off of 10k or 15k rpm disks if you keep your disk size sane and go with multiple spindles.
posted by iamabot at 10:51 PM on December 20, 2008

Can Xen or VMWare ESX, on their own, do the kind of auto-failover stuff that Heartbeat does? If not, can I configure Heartbeat to control Xen-based instances? I'm only at the school one day per week, and I would really not like to end up with something that can take the entire school down due to a single hardware failure.

I'm a little disheartened by the idea that I/O intensiveness and CPU intensiveness are both yellow lights for virtualization - I mean, in a server, what else is there?

On the disk speed issue: both our existing boxes have enterprise-grade SATA drives in RAID-1 (admin1 has a single RAID-1 pair, curricserver has two) and performance has been acceptable; our old 100mbit/s backbone has been the bottleneck more often than not. If I make sure that the files being served by the new virtual curricserver are all on a separate RAID-1 array from everything else, and that the physical machine has bags of RAM so that the host OS will cache the virtual system partitions nicely, shouldn't that end up performing at least as well as what we have now?

Also, are you suggesting I won't I get enough new grunt from shiny new i7 boxes to offset the inefficiencies of virtualization? Admin1 currently has only 512MB RAM in it, which is clearly not enough for W2k3 + SQL Server Express + proprietary admin software, but it only has to serve four workstations so it's actually quite acceptable. We're not replacing the old boxes because they fail to perform - we're retiring them solely because running two-years-out-of-warranty hardware is against policy.
posted by flabdablet at 12:12 AM on December 21, 2008

What strikes me is that in order to add redundancy under each sever you are adding multiple new layers that have to function properly in order to have any of your services available.

You are trading the straightforward and well-understood failure-modes of a few independent servers, for a a cluster with a bunch of interdependent layers that you don't have any experience with. Instead of worrying about your free-standing machines failing, you have to make sure that you've got virtualization, volume snapshots & volume management, shared storage, maybe a clustered filesystems all down solid, along with failover and recovery.

If going that route would let you consolidate a dozen or so servers down to just two, it would seem worthwhile, but it doesn't seem like a big win with 2-3 servers. It seems like it would be a better use of time to optimize your backup/recovery & re/installation procedures. Similarly, converting your existing installations into virtual machines seems like the wrong path when you just have a few servers. If your configurations aren't well documented enough that you can easily recreate them on a fresh install then this is an opportunity to correct that deficiency, rather than enshrine it.

If you decide to go forward with your plan because your job isn't interesting enough, and you want the challenge and the chance to learn new skills, you have a few choices to consider.

VMWare ESX is a mature, full featured product with good performance, and good support for Windows guests, but it's not open source, it's not inexpensive, and its actually an operating system unto itself. Oh, and my understanding is that the list of officially supported hardware for the host is pretty constrained.

VMWare Server is now free, and I guess the older version is open source, but the community for the open source version seems a little scattered. It seems to be a relatively solid and full featured product, with decent performance, and good support for Windows guests.

VirtualBox is another option. There are commercial & opensource versions. My impression is that performance with windows guests may be a bit below VMWare Server's, but that its otherwise pretty competent.

XEN is opensource. The core technology seems really solid (a lot of web hosts, including Amazon EC2, are using it), and there is a growing amount of opensource management software. Best performance comes when the guest kernel can be "paravirtualized." This isn't possible to do with Windows, but there are paravirtualized drivers available that help. I don't know all the requirements, but under some circumstances it is possible to migrate a running XEN instance to another machine without out shutting it down, which is a pretty cool trick. I think XEN also lets you map PCI devices into guest VMs, so they can access them (more or less) directly.

DBs don't generally run well on virtual machines, but it doesn't sound like your deployment will be that intensive, so it is probably a non-issue.

Things to look into, besides Xen:
* ConVirt provides a management interface for a cluster of XEN hosts, though I'm sure their are other options.
* DRBD Distributed Remote Block Device, allows the mirroring of a block device over a network connection. Virtual hard disks on DRBD could be used to provide failover without the cost of a SAN or some other form of shared storage.
* Linux Volume Manager (LVM) can be used for taking snapshots of filesystems holding virtual hard disks as part of a backup/recovery startegy.
* The XFS filesystem (I think JFS allows this too) can be resized online, which compliments the ability of LVM to provision additional storage to partition/logical volume on the fly.
* I've been curious about whether using iSCSI within the virtual machine would provide better performance than a virtual hard disk on an emulated SCSI driver.
posted by Good Brain at 1:40 AM on December 21, 2008

Oh, Virtual machines seem to do pretty well on CPU intensive stuff, as long as it doesn't involve a bunch of system calls (IO or memory allocation). Unless the OS is paravirtualized, these calls involve a lot of fancy emulation to fake out the guest kernel into thinking that it's running on the bare metal.
posted by Good Brain at 1:44 AM on December 21, 2008

VMWare ESXi (the lower-footprint version of the ESX product that uses busybox instead of a full RHEL distribution for a console) is now free.

It's got a limited list of supported hardware, but lots of people have had luck getting it to work on "white box" hardware.
posted by mrbill at 2:06 AM on December 21, 2008

I don't have any experience with VMware EXS, Heartbeat, or DRBD. I do use VMware Server a lot at work to run Linux (SuSE Linux Enterprise Server) and NetWare, however.

You don't need a separate network card for each VM unless you think that VM will be using a majority of the bandwidth often (say, 80% is a number I feel comfortable with). What happens is that you run the VMs in bridged network mode, so to the switch it looks like there are N+1 MAC addresses on that port (N virtual machines, plus 1 for the physical card itself).

You will, however, want one network card for each subnet.

One thing you've forgotten is what happens when you need to patch your guest OS? All this fancy failover stuff helps if the host dies, but when you need to take the guest down for routine maintenance there will be a service interruption.

I do believe that Windows Server has failover cluster support, but I have never used it. So I'm going to describe what we do with NetWare for our printing.

We setup three VMs: PRINT1, PRINT2, and STORAGE. PRINT1 & PRINT2 are exactly the same: same software, same configurations, just different names and IPs. STORAGE is an iSCSI server that hosts a shared volume where the printing database (current jobs, drivers, etc) will be hosted. PRINT1 and PRINT2 each have two virtual network adaptors. One is bridged to the physical adaptor, and one to a private virtual network. STORAGE then only has an adaptor on the private virtual network so that it isn't visible from the outside.

In case you're getting confused, here's an example setup (NetWare doesn't use ethX, but just indulge me):

STORAGE:
- eth0: 10.10.1.1

PRINT1:
- eth0: 208.77.188.167 (print1.example.org)
- eth1: 10.10.1.2

PRINT2:
- eth0: 208.77.188.168 (print2.example.org)
- eth1: 10.10.1.3

Now, what happens when the cluster starts up? The heartbeat detects that the cluster has failed and one of the two nodes (PRINT1 or PRINT2) claims the master resource. It then starts up the PRINTING resource, which involves the following steps:

1) bind the printing.example.org IP (208.77.188.166) to eth0.
2) mount the DATA volume on STORAGE (which contains current jobs, drivers, etc).
3) start up the print server software.

When we're migrating the PRINTING resource from one node (say, PRINT1) to another (say, PRINT2) then these steps are performed in order:

PRINT1:
1) stop the print server software
2) unmount the DATA volume on STORAGE.
3) unbind the printing.example.org IP.

PRINT2:
1) bind the printing.example.org IP (208.77.188.166) to eth0.
2) mount the DATA volume on STORAGE (which contains current jobs, drivers, etc).
3) start up the print server software.

So why go through all this trouble? Because when we want to apply a NetWare patch or service pack then we can apply it on the old inactive node (PRINT2), migrate the service, and then apply it on the new inactive node (PRINT1). The users see maybe 10 seconds of downtime instead of an hour.

MY POINT: you can't do something like this if you're migrating a single guest OS between physical hardware. You need support for it inside the guest OS itself, which usually means two copies acting in parallel. In our case, we don't even host the nodes on separate physical machines because that type of redundancy isn't important to us. But we do need to be able to minimize downtime while patching.
posted by sbutler at 2:31 AM on December 21, 2008

Couple more thoughts:

We don't do this cluster setup primarily because of patching, although that is a really nice side effect. We do it because NetWare is not the most stable OS ever designed. It's marginally better than Windows 3.1 or OS 9.

But Windows has its faults too. It gets more patches than NetWare, plus more viruses. What happens if IIS, SQL Server, or Windows itself gets compromised? Migrating the VM between machines doesn't help solve that. But migrating IIS or SQL Server between VMs will help you minimize downtime while you rebuild the infected VM.

Ultimately, I guess I'm saying that instead of using high availability clusters (Linux Heartbeat) to manage VMs, you should be using VMs to manage high availability clusters (Windows Server Clustering).

Downside is that with Windows you'll double your licensing costs. :( But this is how I'd do it if I were you.
posted by sbutler at 2:53 AM on December 21, 2008

These are all really helpful and well-considered answers, and thank you all for taking a considerable chunk out of your day to help educate me. Much appreciated.

in order to add redundancy under each sever you are adding multiple new layers that have to function properly in order to have any of your services available

That's quite true, which is why I want to be reasonably sure this stuff is actually going to work before working up a serious budget proposal for this option. My thinking has generally been that if it's good enough for Amazon it's good enough for me :-) The point that they are running thousands of instances while I'm proposing to run three is well taken.

You are trading the straightforward and well-understood failure-modes of a few independent servers, for a a cluster with a bunch of interdependent layers that you don't have any experience with.

Yes. I want to find out whether I can reasonably expect what does strike me as a somewhat Rube Goldberg arrangement to actually improve overall reliability for roughly the same money provided I set it up properly. I'm not fussed about the experience thing. I figure if it does turn out that I'm too dim to make this work, the worst that can happen is a few wasted weekends and having to buy a third box. And the nice thing about being a semi-retired programmer working one day a week as a sysadmin is that I do have time for this.

It seems like it would be a better use of time to optimize your backup/recovery & re/installation procedures

There's obviously no such thing as a backup strategy incapable of improvement, optimization of these things is ongoing, and I'm hoping the VM thing could open up some interesting new possibilities.

If you decide to go forward with your plan because ~~your job isn't interesting enough, and~~ you want the challenge and the chance to learn new skills, you have a few choices to consider

OK, now we're on the same page :-)

under some circumstances it is possible to migrate a running XEN instance to another machine without out shutting it down, which is a pretty cool trick

Sounds like it might be fun to play with, but unless there's also a reverse temporal perception kernel module I don't yet know about, I can't see it helping much with failover :-)

I think XEN also lets you map PCI devices into guest VMs, so they can access them (more or less) directly

That's a tick for Xen, then. I'll start looking there.

Things to look into, besides Xen:
* ConVirt looks very interesting - thanks.
* DRBD is what got me thinking about this in the first place and it's supposed to work in nicely with Heartbeat, so: yes.
* LVM I already use on my home server and am quite impressed by. By and large it Just Works.
* XFS can be resized online... I believe ext3 can now do this too, at least for embiggening. How resilient is XFS compared to ext3 after a crash-stop?
* iSCSI is something I've never played with at all. Should I?

All this fancy failover stuff helps if the host dies, but when you need to take the guest down for routine maintenance there will be a service interruption

That suits me. This is a school, not a 24x365 business backend, so I can easily do my routine maintenance outside school hours as I have been until now. What I'm trying to avoid is more of those panicked phone calls that say "Stephen, nobody can get onto the server and you need to come in and get it going because we've got a video conference with Japan before lunch" when I'm off cleaning viruses out of somebody else's winbox. The mix of half-arsed proprietary gunk we run on these servers used to lock them up with annoying regularity until I added a weekly scheduled restart; memory leaks, I think. Curricserver has a really stupid BIOS that occasionally fails to let it restart without being power cycled. I'm quite looking forward to having something that isn't bloody Windows and will stay up long enough to do effective remote admin on.

Thanks for the migration examples. Lots to think about there.

instead of using high availability clusters (Linux Heartbeat) to manage VMs, you should be using VMs to manage high availability clusters (Windows Server Clustering)

That would indeed make more sense given a close-to-100% uptime requirement, but that's not what I have; I have a close-to-100%-uptime-when-I'm-not-on-site requirement, and a strong desire to give Microsoft as little money as their idiot licensing policy allows.

I think I'll give this a whirl. I'll set up a couple of old machines at home and play with Xen and Heartbeat and see what they can do.
posted by flabdablet at 6:17 AM on December 21, 2008

Yeah, I think your answer is Xen. Despite my notes on the performance constraints the biggest thing to worry about it really the disk i/o, but my guess is you're not even going to be getting close to hitting the i/o limits, and if you are you can probably look at some really speedy flash disks for the stuff that's getting hammered. I think I should make a point of clarification, when I reference virtual infrastructures I am talking about their performance in very large, very intensive environments: e-commerce for major retailers, airlines, hospitals, they get really used.

If you can go Xen and spend some extra money on a solid backup system you'll be in good shape in general.

Xen and ESX run very will on HP Hardware, specifically DL360/380's and my favorite for VMware farms, the 580.
posted by iamabot at 11:35 AM on December 21, 2008

Am I correct in assuming that I want N+2 physical network adapters in each physical box (one for each of N virtual machines to connect to an appropriate network switch, plus one to do a point-to-point link for DRBD, plus one to talk to the host OS via an appropriate network switch? Or, since the virtual replacement for admin1 is only going to be talking to four workstations, do I actually need a dedicated physical network adapter for that VM?

My 2 cents - and I understand that this is not what you want to hear - is that you're adding multiple layers of complexity without enough organizational benefit to justify it. Unless you've got really old servers that aren't worth extending the warranty, it sounds like (at most) you'll need one new server, not three.

There's also no reason you couldn't install IIS on one of the existing boxes. The worst thing that will happen is that the IIS executable will crash. It's not going to bring down the server. You could also stick the IIS stuff into a VMware server instance, which only adds a few services to your existing server.

I'm not sure if this is somewhere you plan to stay forever, but I can tell you that I wouldn't be happy to come into what should be a simple, two or three server setup and have to contend with Linux, Zen, Heartbeat, multiple NICs, etc., etc. in addition to having to contend with Windows, IIS, SQL Express, etc.

I started down this road at a small company where I worked. We had some hardware-related server failures, and I got it in my head that I would virtualize the five servers, with hot failover, etc., etc. I eventually realized that I was just bored and that the complexity and expense weren't worth the benefit. Instead, I bought better warranties, hot spare hard drives and implemented a better monitoring solution for servers and services.

Last thing, if you're intent on doing this (and it sounds like you are) spend the time to document it thoroughly. I've been on the consulting side of things like this, and it would be pretty tough to unravel what you're proposing here if it wasn't fully documented.
posted by cnc at 12:37 AM on December 22, 2008

2 cents of experience gratefully accepted as an informed answer to my very first question. I'm not really intent on doing this yet, but I'm still keen to learn about it. Complexity is certainly an issue, but I have yet to be convinced it's necessarily a show-stopper.

I eventually realized that I was just bored and that the complexity and expense weren't worth the benefit

How far down the implementation road did you get before coming to that point of view?

Unless you've got really old servers that aren't worth extending the warranty

Admin1 is a Viewmaster box about seven years old; curricserver is an Optima box, about six. Curricserver was just being installed as I walked into this job. Neither server is particularly wonderful. Admin1 doesn't even have ECC RAM, and curricserver has the world's most annoying BIOS (for example, it won't even boot if the USB backup drives aren't both plugged in, because the BIOS drive ordering changes and it forgets which drive has Windows on it). They're both old beaters, basically, and will certainly be replaced by mid-2009 with something.

I've been on the consulting side of things like this

Me too.

and it would be pretty tough to unravel what you're proposing here if it wasn't fully documented

You can bet your arse that if I did go ahead with this, I would certainly prepare a printed manual with comprehensive organizational diagrams, a step-by-step bare-metal total disaster recovery procedure, copies of every document I consulted while putting it all together, and a frequently updated page of web links. Basically, the documentation I'd work up is what I would like to have if I were walking into the place cold.

Also, I'd give each virtual Windows server its own dedicated USB backup drive, and the files on that drive would be in standard Windows Backup / ASR format and sufficient to bring up a functionally identical server on dedicated hardware, in case my successor is not inclined to maintain the virtualizing environment.

It still seems to me that if I limit the Linux-side complexity to VM/failover management plus a little bit of inter-subnet routing, and make sure there's a nice GUI available to manage that stuff, but leave the DNS and DHCP and Active Directoty stuff on the Windows boxes where a Windows admin would expect to find them, I ought to be able to build something that a competent Windows admin would find reasonably easy to drive. But I am certainly taking the naysayers' views seriously, and I'm quite prepared to give the idea away if the learning hump looks it would be too high for any of the other local technicians I know.
posted by flabdablet at 3:30 AM on December 22, 2008

« Older There's something on my back... | Why am I losing tunes in my iTunes library? Newer »

This thread is closed to new comments.

Ask MetaFilter

Can I get high availability with a modest budget?
December 20, 2008 9:34 PM Subscribe

Tags

Share

Can I get high availability with a modest budget? December 20, 2008 9:34 PM Subscribe

Tags

Share

Can I get high availability with a modest budget?
December 20, 2008 9:34 PM Subscribe