Split Brain Server Conundrum

A typical Application Server set up looks something like below :

No alt text provided for this image

In Figure 1, various client applications are accessing application servers via a load balancer and hitting the database. This database can be your single point of failure in this entire architecture. So to minimize risk, you add a failover database. That database can be replicated synchronously or asynchronously depending on the nature of the business and the velocity of transactional data. So, in the above scenario all transactions are hitting the main database and is known as the Master and the replicated db is known as the slave. None of the web servers direct any transactions to the slave db.So when the Master goes down the Slave is commissioned to handle the interim transactions till the Master is commissioned again after replicating the interim transactions

No alt text provided for this image

Now think of the possibility of the web servers having the capability of transacting with both the database servers to avoid congestion. In that case both servers will be replicating each other in real time and exchanging messages(transactions).

No alt text provided for this image

So at a given point of time both database servers are in the same state and if one fails the other takes over the mantle of handling all the transactions with minimal lag time. This event kicks off when one of the server fails and the other stops getting messages from the peer server and assumes that it’s the only working database server now. But imagine a scenario when the router in between the two server fails. In that case both servers loose communication and assume the mantle of primary server and commits transaction without getting transaction receipt from the other server.

Let’s look at 2 scenarios :

Account holder John Doe has a balance of $1000

Scenario 1 – Both servers working

John Doe withdraws $700

Transaction hit database server1 it updates the a/c balance sends message to database server2 which also updates it’s a/c balance and sends receipt back. Both are in same state and commit the transaction.

John Doe tries to withdraw additional $500

Now the transaction hits database server2. It sees that John does not have the requisite balance as it had already updated the copy of its database from transaction of server1

So it fails the request. 

The system works as planned.

Scenario 2 The router between the two database server fails and each assumes other has gone down

John Doe withdraws $700

Transaction hits database server1, it updates the a/c balance and does not send message to db server2 as it has assumed it has gone down and goes ahead and commits the transaction. Right now the database copies are in different state. In one DB Server1John Doe has a balance of $300 and in DB Server2 John Doe has a dance of $1000

John Doe tries to withdraw additional $500

Now the transaction hits database server2. It sees that John does has the requisite balance as it does not have an#Distru updated the copy of its database from transaction of server1So it approves the request. 

The system fails.

This scenario is known as the split brain conundrum where the communication is lost between the 2 brains.

One of the ways to resolve this is as follows:

Introduce a third db server like shown below:

No alt text provided for this image

So you see in the diagram above we introduce a third db server which serves as a slave to both Master 1 and Master 2 and gets replicated from both the servers and are in constant communication in realtime.

So now let’s see how this resolves our problem above.

Scenario 3 The router between the two database server fails and assumes other has gone down but the communication between the 2 master databases and Distribution database is still intact

Let’s say before any transaction is being processed John Doe’s a/c is in state S-0 in all the 3 DB servers.

After Jon withdraws $700, the transaction hits Master1 Db. Master1 communicates the same to Dist db. The Dist DB sees that both are in the same state and a new transaction came in so it updates it’s DB and sends a success communication back to Master1 (commits). Dist db and Master1 are now in state S-1 after commit. But since the Master2 communication with Master1 has failed Master2 is still in state S-0.

Now John tries to withdraw $500. The transaction hits Master2. It updates it’s DB and goes into state S-2. Now it communicates to Dist server to update it’s DB. The Dist server sees that they are in two different state (S-1 & S-2) and communicates the same to Master 2 with a failed receipt. Master2 rolls back it’s transaction. Updates it’s db from Dist db to go to state S1. It now sees insufficient funds and fails the request.

Viola!!! we took care of the split brain problem.This solution is known as distributed consensus (which is whole another topic in itself) and can be resolved with different algorithms like 2 phase commit or Multi Version concurrency control. A topic for another blog.

Hope I resolved some design issues. Have any question ask in the comment section.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s