Distributed Caching

Amlan Dasgupta

What is Distributed Caching

A simple application network architecture looks something like this:

There are user hitting the application server with data requests. The app server in turn hits the database server with those requests and then processes the results to put out those user requests. In this scenario when number of users increase so does the network calls to the db server thus increasing traffic and I/O time. Also computational requests increases which proves to be costly over a period of time.

So to resolve this we use Memory Cache. Below is a simplified architecture for the same:

The first use case is that you get lot of requests to for profile information for a certain set of users resulting in a lot of network calls. In this case you save it in a memory cache with a key pair value and get the profile information directly from the memory cache thus reducing network calls.

The second use case is that you get a lot of common computational requests like quantity of products sold by region. That can prove very costly if you go to the db get different products and group them by region and compute the the totals every time a request comes in. What you can do it is do it once and store the result as a key pair value in the cache and return the results from cache every time a similar request comes in. Thus avoiding network calls and repetitive computations

The third use case if there are lots of application servers hitting the same db server, the load on the db server increases. In that case you can create a distributed cache where you store different result sets and go fetch it from the appropriate cache and reduce load on the db server. But you have to be very careful here because cost of memory cache is costlier than commodity hardware so a thorough cost benefit analysis has to be done I.e. does speed of response outweighs the cost.

Also you cannot store infinite data in your cache as that proves to be counterproductive. You can only store a subset of the data from your database. So you have to have policies of what to store in your cache and when to evict data from your cache. So here are some popular Cache Policies:

Least Recently Used: You create a queue of recently used values like, used in the last 5 seconds/minutes/hours and bump off the bottom of the queue entry when you add a new one. In this way you store the most recent and frequent data values and discard the oldest and least used data
Least Frequently Used: This is a type of cache algorithm that manages computer the cache by keeping track of the number of times a block is referenced in the memory and purging the one with the least referencing frequency, once the cache is full and needs more memory.
Sliding Window Policy:A sliding window policy maintains a queue of a specified size, into which entities flow. When the queue is full and a new entity arrives, the oldest entity in the queue is removed from the window (FIFO).

It is very important to have the right eviction policy in your memory cache. Some of the problems you might face are as follows:

Extra Call: Bad eviction policy the purpose of having a memory cache is defeated if for every request you have to go back to the data store to service that call. Which means you are making that extra call and you are not getting any benefit for the cost incurred
Thrashing: This occurs when you have a small cache. Leading to constantly loading and evicting entries without ever using the results
Data consistency: When a server makes an update on a record which is already in the cache the changes are not reflected in the cache entry unless the record id evicted and reloaded again

Now the question arises where do you place the cache – A) Near the servers or B) Near the database

Near the server makes results faster, but it eats up a huge chunk of memory for the app services resulting in app performance lag. It can lead to data inconsistencies between different server caches/ Most important if any app server goes down it takes the caches data to the graveyard
Second option is implementing a global cache near the server using caching tools like Redis. Where you can scale up and down the memory. Data is more consistent. The cache is managed and less chances of failure. Only drawback is the data arrival is a little late which is very negligible

My personal choice would be option B. This allows for data replication which reduces a memory segment goes down and also helps in data consistency. Tenancy rules can also implemented at an enterprise level.

Lastly how to keep the cache and database values consistent. There are 2 methods to it:

Write Through Cache: When there is an update to a record first check the cache for the record in the cache. If there is update that record and then go ahead and update the database for the same record
Write Back Cache: If there is an update to a record go do it first in the database. Then check cache if it exists. If found write back the same updates in the cache.

These two methods will keep your cache and database consistent.

To SQL or not to SQL (NoSQL)

Since the time humans started to store records electronically they were always looking for the most efficient way of storing and retrieving those records. Then came the era of data processing and Frank Todd published his iconic “ A relational model for Data for large shared data banks”. That gave birth to the datastore phenomenon called Relational Database Management Systems (RDBMS). That became the gold standard for most companies in how they stored and processed their data. To store and process the data in RDBMS we needed a universal language and in 1970 Raymond Boyce and Don Chamberlin came up with a a language SEQUEL which became popular in its abbreviated form SQL. so SQL databases ruled the roost for almost 4 decades. Which meant we created a structure to store our data and then divvied up data into different objects called tables and then built relations between these tables. So the pre-requisite for this kind of data storage and retrieval was that the data had to be structured. Before the data was stored the schema had to be built.

But with the volume and velocity of data exploding over the years and the advent of huge volumes of unstructured data with proliferation of social data platforms and blogs the data was no longer in structured form. So the scaling because of volume and velocity and advent of huge amount of unstructured data RDBMS platform was no longer the de facto solution and we saw the proliferation of data stores which no longer adhered to RDBMS principles and we popularly known as NoSQL databases.

These databases do not adhere to any predefined structure and data could be stored in the form as they arrived instead of divvying them up in a predefined tables with relationships. Most popular being JSON (JavaScript Object Notation) documents. These datastore can be created in a very agile fashion as they do not need an in depth data discovery and minute data structure. RDBMS scaling is not very cost effective while in NOSQL databases scaling is easy and a great ROI.

Here is an example of how data is stored in RDBMS & NOSQL datastore:

RDBMS:

JSON:

var Emp_Tab =

{

Emp_ID : 1

Emp_Name: “John Smith”

City : “Royersford”

State : “PA”

County : “Philadelphia”

}

{

Emp_ID : 2

Emp_Name: “Larry Collins”

City : “Chicago”

State : “IL”

Food : “Pizza”

}

So in the above example you see the Employee information is stored in a defined schema consisting of 2 tables called Employee and address and a cardinal relationship exists between these 2 tables. Each entry of an employee is stored as a record and the information has adhere to the defined structure and cannot be flexible.

But in the second case the information is kept in a JSON document structure and does not adhere to a predefined schema. If you look closely the information details differ in the JSON documents and yet they can be stored which not possible in a predefined schema.

Secondly, to get the entire employee information in the SQL structure I have to join the two tables to extract the entire information and sometimes these joins can be very costly as it might require joining many very large tables and complex join algorithms. But in the case of NoSQL the data is stored in one place and retrieving that information is much easier with no costly joins. Though this can be advantageous for bulk reads and writes but proves to be pain when you want to compare and establish relationship between documents.

Relational data stores scale vertically but NoSQL data stores scale horizontally so scaling for NoSQL is much easier and less expensive, also database sharding can be implemented much better in NoSQL data stores which makes the distribution of workloads much easier and simpler.So if you are collecting petrabytes of data consisting of unstructured social media feeds and going to do so statistical analysis on the collected data it would be very easy to implement in a NoSQL data store.

RDBMS are optimized for transactional systems where you have to perform lots of updates, inserts and deletes and for this you need some data consistencies. RDBMS systems provide you with ACID (Atomicity Consistency Isolation Durability) property which is very essential for critical transactional systems but very hard to achieve in NoSQL system

Now let’s put our solution architect hat on and see when looking at some use cases where you adopt each of these technologies

For all the talk about “schema-less” data which is unstructured is still very small percentage compared to structured data which we analyze or transact in real life. Real life inventory, accounting , personnel, manufacturing, sales data is still very structured and predictable. For every application which requires crawling through social media feeds, web server logs, blog posts there are thousands of applications which manage employee details, inventory, book keeping transactions etc. This real world data is finite and grows in predictable numbers.( Not every corporation has Google Data Deluge)
In real world you need to join information sets to and query the data to make meaningful decisions and that can be done more easily with a Relational database.
Most transactional systems need ACID consistency .
Arguing that RBMS does not allow you rapid development is like saying my hammer could not build the house because it does not have the blueprint. A powerful tool needs a direction

On the flip side the advantages of NoSQL systems are DevOPS related

A well made NoSQL solution will handle massive writes better than RDBMS. I am talking about a magnitude of 300k / second
When you store that much amount of data you have to read/analyze and answer tactical questions. A NoSQL solution is read optimized which can handle triple digits petrabyte of data reads
Traversing graph networks consisting of terabyte of data is not easy with RDBMS solution
Storing unstructured data like web pages, messy lot files, emails, social media feeds is very difficult to store in a RDBMS schema and also difficult to scale.

So coming from a banking background I will give you this simple example.

Your banking has business domains like product, party, accounts, transactional systems, GDPR, FRAUD detection, HR, CAPEX &OPEX, Contact Management &Marketing etc will always have defined and predictable data and will involved a lot of CRUD (Create Read Update Delete) So it’s easier to put them in a defined schema and perform these operations. The percentage increase of data can be predicted and appropriate scaling can be pre planned. So in most such cases you RDBMS is up to the task. On the other hand your bank launches different financial products and they want to listen to the social media feeds if your product hashtag is trending or what people are saying about your products. For this you need to capture all sorts of unstructured social media feeds blog posts etc. This data mostly consists of capturing such feed and not much operations on deleting or updating such data. You just need to read the data and analyze it and NoSQL datastore are optimized for such operations. The volume and velocity of such data in high and unpredictable. So your solution is NoSQL or document based data stores.

As you can see from the above example that you have to choose the tool based on the task. You cannot use an axe to chop your onions neither can you use your chopping knife to cut down a tree. Both these data stores can co exist and twain shall meet sometimes in data lake . That in another blog post.

Split Brain Server Conundrum

A typical Application Server set up looks something like below :

In Figure 1, various client applications are accessing application servers via a load balancer and hitting the database. This database can be your single point of failure in this entire architecture. So to minimize risk, you add a failover database. That database can be replicated synchronously or asynchronously depending on the nature of the business and the velocity of transactional data. So, in the above scenario all transactions are hitting the main database and is known as the Master and the replicated db is known as the slave. None of the web servers direct any transactions to the slave db.So when the Master goes down the Slave is commissioned to handle the interim transactions till the Master is commissioned again after replicating the interim transactions

Now think of the possibility of the web servers having the capability of transacting with both the database servers to avoid congestion. In that case both servers will be replicating each other in real time and exchanging messages(transactions).

So at a given point of time both database servers are in the same state and if one fails the other takes over the mantle of handling all the transactions with minimal lag time. This event kicks off when one of the server fails and the other stops getting messages from the peer server and assumes that it’s the only working database server now. But imagine a scenario when the router in between the two server fails. In that case both servers loose communication and assume the mantle of primary server and commits transaction without getting transaction receipt from the other server.

Let’s look at 2 scenarios :

Account holder John Doe has a balance of $1000

Scenario 1 – Both servers working

John Doe withdraws $700

Transaction hit database server1 it updates the a/c balance sends message to database server2 which also updates it’s a/c balance and sends receipt back. Both are in same state and commit the transaction.

John Doe tries to withdraw additional $500

Now the transaction hits database server2. It sees that John does not have the requisite balance as it had already updated the copy of its database from transaction of server1

So it fails the request.

The system works as planned.

Scenario 2 The router between the two database server fails and each assumes other has gone down

John Doe withdraws $700

Transaction hits database server1, it updates the a/c balance and does not send message to db server2 as it has assumed it has gone down and goes ahead and commits the transaction. Right now the database copies are in different state. In one DB Server1John Doe has a balance of $300 and in DB Server2 John Doe has a dance of $1000

John Doe tries to withdraw additional $500

Now the transaction hits database server2. It sees that John does has the requisite balance as it does not have an#Distru updated the copy of its database from transaction of server1So it approves the request.

The system fails.

This scenario is known as the split brain conundrum where the communication is lost between the 2 brains.

One of the ways to resolve this is as follows:

Introduce a third db server like shown below:

So you see in the diagram above we introduce a third db server which serves as a slave to both Master 1 and Master 2 and gets replicated from both the servers and are in constant communication in realtime.

So now let’s see how this resolves our problem above.

Scenario 3 The router between the two database server fails and assumes other has gone down but the communication between the 2 master databases and Distribution database is still intact

Let’s say before any transaction is being processed John Doe’s a/c is in state S-0 in all the 3 DB servers.

After Jon withdraws $700, the transaction hits Master1 Db. Master1 communicates the same to Dist db. The Dist DB sees that both are in the same state and a new transaction came in so it updates it’s DB and sends a success communication back to Master1 (commits). Dist db and Master1 are now in state S-1 after commit. But since the Master2 communication with Master1 has failed Master2 is still in state S-0.

Now John tries to withdraw $500. The transaction hits Master2. It updates it’s DB and goes into state S-2. Now it communicates to Dist server to update it’s DB. The Dist server sees that they are in two different state (S-1 & S-2) and communicates the same to Master 2 with a failed receipt. Master2 rolls back it’s transaction. Updates it’s db from Dist db to go to state S1. It now sees insufficient funds and fails the request.

Viola!!! we took care of the split brain problem.This solution is known as distributed consensus (which is whole another topic in itself) and can be resolved with different algorithms like 2 phase commit or Multi Version concurrency control. A topic for another blog.

Hope I resolved some design issues. Have any question ask in the comment section.