How can I do a multiple datacenter failover / load balance and have a shared filesystem available in both locations?
July 25, 2007 3:41 PM   Subscribe

How can I do a multiple datacenter failover / load balance and have a shared filesystem available in both locations?

I need to setup my linux servers to access a shared filesystem from multiple datacenters and to have the data redundant should one of the datacenters go offline. Differing technologies seem to handle only half the problem. GFS or AFS will allow the filesystem to be distributed across the datacenters but doesn't provide data redundancy should a datacenter go offline... what does? What product or technology should I been looking at?
posted by dirtylittlemonkey to Computers & Internet (9 answers total) 1 user marked this as a favorite
Speaking from recent experience GFS is a lot more fragile than RedHat makes it up to be. First... GFS2 is what currently ships with el5, but GFS1 is what's supported (it ships too, but patches appear to be slow, and the initial 5.0 release was buggy). You'll need to run the clustering suite in order to get GFS working, and with power fencing (which you want/need) you really want the boxes on the same subnet as the power switches that they'll be using to fence misbehaving nodes.

Additionally, unless you're paying for some really fat pipes between the redundant data centers everything is going to be dog slow when any of the servers updates the shared storage. You can probably achieve what you're trying to do by strapping together multiple GFS clusters, but it's going to be an awful, headache inducing, sticky mess... one that you'll likely own forever or will be cursed for by your replacement's children's children.

A good place to ask this question if you're still not turned off from GFS is here.

In terms of what else I'd consider...
  • Store everything in a database, replicate the master(s) to slaves at each location.
  • MogileFS
  • s3

posted by togdon at 3:58 PM on July 25, 2007

You could do Snapshots and then use some technology like SnapMirror to move all your data over...

But in reality most large systems like this rely on a database replication scheme to do this.

Why? Its a bandwidth issue shipping all that data over constantly and databases are much further advanced than filesystems in terms of figuring out the minimum deltas they need to ship over in order to reconstruct everything at the far end.
posted by vacapinta at 4:13 PM on July 25, 2007

I don't know that much about this but I know that my company uses SRDF and NFS for business continuity purposes similar to what you are describing (you might try googling "business continuity" to get some more general ideas about how people handle this sort of thing).
posted by ch1x0r at 5:27 PM on July 25, 2007

rsnyc is free.

A great deal depends upon what you are syncing, and how much of it you need to sync. Also, are your uptime needs really that great? If you had a day of downtime because a datacenter went out (and it took that long to recover from tape), what would the real business impact be?

Datacenters do not go down very often, and there are very few things in life that need real-time failover between cities. These kinds of recovery times cost a lot of money, for something that probably will never happen. If you're willing to wait a day for your recovery, you can save an astronomical amount of money, in most cases.
posted by popechunk at 6:45 PM on July 25, 2007

As far as I understand, DRBD fulfills your requirements, but I've never used it and can't attest to its reliability and/or performance, especially if the link between the two sites is slow. Personally, I've only ever heard about it in cases where the machines are next to each other in a LAN.
posted by themel at 10:45 PM on July 25, 2007

There has got to be an easier way of accomplishing what you're trying to do. Geographically dispersed filesystems are not easy or simple and it's likely that not everything needs to be shared.

Funny that this question shows up the day after the SF outage.
posted by rhizome at 11:21 PM on July 25, 2007

what is the higher level problem you are trying to solve with your proposed architecture?
posted by Good Brain at 2:26 AM on July 26, 2007

"...What product or technology should I been looking at?"

Object database management systems (ODBMS) are one bleeding edge technology for what you describe. Versant, Progress, and other companies sell specialized commercial systems you might want to investigate, but cost and complexity are not trivial, for the benefits you seek. But these systems are often not Linux based, simply because they require high end hardware capabilities that Linux can't deliver (NUMA, high performance I/O [virtualized storage], etc.) on commodity hardware.

If really high reliability is your issue, you're possibly a mid-range system or a mainframe customer. Seriously. The mainframe has stuck around for several primary reasons, six 9s+ reliability being a principal one. Virtual Linux servers on a mid-range or mainframe machine (depending on your workload), or if multi-site is a must, on mid-range or mainframe partitions at physically separate sites, maybe your best solution. The ability of mid-range and mainframe systems to abstract storage and handle data access efficiently across different systems and locations, as operating system services, vastly simplifies the problem you describe, and ensures reliability better than your data links in multi-site situations.
posted by paulsc at 5:56 AM on July 26, 2007

Response by poster: Thanks everyone, you've reaffirmed my views and brought up many new systems I'd not considered.

The real problem is the way our applications were architected to rely on a common shared filesystem that's causing the issues. If / when we port it to be DB driven the most of these issues are easily resolved.

We are investing in setting up servers overseas to support a customer and are looking to be able to maximize the investment by providing cross datacenter load balancing .... easier said then done.

I'll probably play the 'Modify Expectations' card. They can have fail over, but not load balancing.

rhizome: It's no coincidence.
posted by dirtylittlemonkey at 11:36 AM on July 26, 2007

« Older Proper/reasonable way to deal with water damage in...   |   How can I find a copy of an 80's mexican movie? Newer »
This thread is closed to new comments.