How do I manage a single database with multiple web servers?
July 28, 2011 3:41 PM   Subscribe

How do I ensure data integrity when using multiple load-balancing servers?

I'm sure this is a simple question for some of you, but this is a new area for me. The basic setup is this: I've got a single database server that's used to create giant lists of emails to be sent out (not spam! I promise!). In order to increase throughput, I intend to create multiple load balance servers on my rackspace cloud. So basically each of these servers will do nothing but query the database and get a bunch of information about what to send, then send it, then report back to the db about what's been sent. The question is this: what's the most effective means for keeping the load balanced servers from stepping on each others' toes, datawise (i.e., how do I stop two servers from accidentally grabbing the same data between the SELECT and UPDATE queries)? I'm sure there's a simple answer to this, but I don't know what it is.
posted by vraxoin to Computers & Internet (5 answers total)
 
Transactions and row locking. SELECT ... FOR UPDATE. MySQL has a similar feature on InnoDB.
posted by sbutler at 3:51 PM on July 28, 2011 [1 favorite]


If you've got two web front ends and one database on the backend, you shouldn't have anything to worry about. If your web app can deal with one web front end and multiple simultaneous visitors, there is no difference between "visitor A and visitor B on one web front end doing task A and task B" and "visitor A on web front end A doing task A and visitor B on web front end B doing task B" really close together. If you're worried about task A interfering with task B, then the problem you have isn't load balancer specific, it's a fundamental flaw in your web app. (See transactions and locking as already mentioned.)

Now if you're going to have 1+ web front ends (though I'm assuming at least 2+ because it sounds like you've got significant traffic or maybe you just want HA) and 2+ database servers on the backend, you want replication. If you do MySQL, you'll want to read up on synchronous versus asynchronous replication. Off the top of my head, if you want basic HA, just opt for MySQL master master replication, but don't hold me to that last line.
posted by Brian Puccio at 6:05 PM on July 28, 2011


What do you call "giant"? When things get big, there is nothing faster than a flat file.

Sending emails usually doesn't need to have the record status updated in real time and if that is the case, I wouldn't even bother with having the process(es) connecting to the database.

Give each one its own data to process (the said flat files). If you need to update the status in the db, use separate log files that track progress and have another process read the logs to update the db.

If you really need the mailing process to use the db, provide each process with its own table to process.

Lots of possibilities, once you realize that redundancy isn't always bad.
posted by w.fugawe at 3:03 AM on July 29, 2011


Transactions: FWIW your question comes off as slightly inconsistent; at one point you state:
So basically each of these servers will do nothing but query the database and get a bunch of information...
but then, further on, you state:
(i.e., how do I stop two servers from accidentally grabbing the same data between the SELECT and UPDATE queries)?
Which is it: do you need concurrent read-only access or concurrent read-write access? This is a vital piece of information, but I suspect you need concurrent read-write access, and sbutler already gave the right answer. The main reason people use relational databases is for their awe-inspiring ACID-compliance and transactional abilities.
posted by asymptotic at 4:07 AM on July 29, 2011


Transactions are so fundamental to relational databases, and some of the answers posted are so shockingly poor, that I'm going to copy-paste some choice excerpts about what a transaction is.

From the earlier PostgreSQL link I posted:
Transactions are a fundamental concept of all database systems. The essential point of a transaction is that it bundles multiple steps into a single, all-or-nothing operation. The intermediate states between the steps are not visible to other concurrent transactions, and if some failure occurs that prevents the transaction from completing, then none of the steps affect the database at all.
and from the Wikipedia article:
A database transaction, by definition, must be atomic, consistent, isolated and durable. Database practitioners often refer to these properties of database transactions using the acronym ACID.

Transactions provide an "all-or-nothing" proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.
posted by asymptotic at 4:11 AM on July 29, 2011


« Older Vista attempting to update a phantom program   |   Best cell phone camera? Newer »
This thread is closed to new comments.