Application Modernization - Part 2: Transactions and events in microservices

Cloud usage has steadily increased in recent years, with most enterprises adopting cloud solutions to deliver new applications and migrate workloads out of proprietary data centers. However, application modernization towards the cloud has not yet started for a lot of mission-critical systems and workloads. What's slowing this down? reasons for this are in the complexity of interactions of Business Enterprise Platforms, and at the heart of it, we have the transactionality challenge of handling critical business data.

An effective move to the cloud needs to tackle this issue in some way to be effective. As a first step, let's start with defining the problem in the context of existing legacy platforms.

Traditional Transactional applications

Some examples of transactional systems I have worked with include:

Financial and banking systems that move money between accounts. Every operation must complete atomically, and failures should not cause side effects
Commercial platforms for Travel and Transportation, which ensure the consistent completion of the ticket purchase process. Either the ticket is issued, the seat reserved and the payment is finalized as a single action or no effect at all should happen.

To achieve this traditional transactional applications are based on some key enabling technologies such as the Mainframe or the Relational Database, which provides facilities (e.g., 2-phase commits, ACID properties) to simplify the widespread usage and creation of transactional logic.

What's the commonality between a DB and Mainframe? it's the presence of a centralized component that ensures the consistency of the system, but what happens if we move toward cloud-native solutions with multiple interacting components on a globally distributed scale?

Why not Distributed Transactions?

Having a distributed and partitioned system introduces new failure possibilities and additional network delays and this makes the scenario more complex:

Distributed transactions in SOA/WebServices solutions proved to be complex, and brittle with additional pain points created by tightly coupling systems.
Middleware such as relational databases become vulnerable to networking issues (e.g: Split Brain)

Like Newtonian physics encountered one of its limits with handling the speed of light, here also the speed and availability of the communication network forces us to rethink the approach in broader terms.

Enter the CAP (or Brewer's) Theorem that tells us that "any distributed system or data store can simultaneously provide only two of three guarantees: consistency, availability, and partition tolerance (CAP)"

CAP Theorem

A new theory must improve on an old one, so how does CAP Theorem apply to legacy systems?

As seen above, these are based on centralized, high-performing systems, such as Relation DBMSs or Mainframe boxes. The key aspect of these solutions is the very limited use of partitioning and a strong focus on ACID consistency and replication for High Availability and Disaster Recovery.

So, these systems are examples of the Consistent-Available (CA) type of solution allowed by the CAP Theorem.

Towards Eventual Consistency

Now, with the cloud, we are choosing to increase the partitioning and distribution of our systems.

So we are left with two types of possible systems:

Consistent-Partition Tolerant (CP)
Available-Partition Tolerant (AP)

While it would be nice to guarantee Consistency, to avoid compromising end-user service in a cloud-distributed solution composed of many moving parts, each of which can fail at any time, forces us to favor Availability.

But not all is lost on the Consistency side. While there is no guarantee that all clients see the same data at the same time, it is possible to quickly converge toward a share data consensus (a.k.a. Eventual Consistency).

Tools to achieve it

Let's see how to combine some techniques to achieve this.

Component internal transactions

Fortunately, like General Relativity did not invalidate Newtonian physics tools when applied in specific contexts, we can often continue to use traditional ACID transactions inside each component, since each component is tightly coupled in its implementation, with limited internal networking and partitioning.

This is a key criterion in Microservice Decomposition, which can be guided by the transactional boundaries of specific business entities to effectively partition a monolithic system. Domain-Driven-Design (DDD) techniques are based on defining atomic operations on "aggregates" that focus on business entity data that should be updated transactionally (see also).

How to break down a system thus become a critical architectural decision, to evaluate trade-offs in coarse-grained vs fine-grained service decomposition, since services doing transactional coordination might be more effective with the former approach.

Idempotency

When communicating across components it's necessary to handle and resolve networking errors. Since messages can be lost either before or after the remote systems receive them a mechanism for safe retries is needed. This can be implemented using techniques to ensure idempotency. In short, Idempotency means that duplicate calls are not processed multiple times, preventing errors or unwanted side effects in the system (e.g. the same payment must be processed only once).

This can be implemented with some steps like this:

The system receiving a message checks whether the message has been processed before. (e.g. by comparing its unique identifier to a list of previously processed messages).
If the message has not been processed before, the system processes the message and adds its outcome indexed by a unique identifier to the list of processed messages.
If there's a need for a synchronous response, the outcome (generated by the current or the previous call) is returned.

This pattern is particularly effective for asynchronous notifications since:

no response is needed (no step 3)
messaging systems (e.g., JMS, MQ, or Kafka) only-once/at least-once delivery guarantees can be used to simplify retries, limiting impacts on the overall process latency.
messaging systems usually automatically provides a unique identifier for messages

Compensations

This technique is typical of Business Process Modelling (BPM). BPM solutions target process integrations across heterogeneous systems, and this is a scenario in which distributed transactions are not possible because of the loose coupling f the systems. In instances of a Business Process, compensations occur when the process cannot complete a series of operations after some of them have already been completed, so the process must backtrack and undo the previously completed steps.

The price to pay is the additional complexity of implementing the reversal operations and applying them to execute the backtrack correctly.

SAGA pattern

This pattern combines the three previous techniques to implement a cloud-native BPM implementation, building a sequence of local transactions where each one of these updates the data of its competence (based on the service model decomposition), with each subsequent step activated by the completion of the previous one. Failures and interruptions in the process are identified (e.g., using timeouts) and trigger compensation logic, while external systems are notified of updates via event notification.

SAGA pattern can be implemented with different approaches using:

a central process Orchestrator service
a Choreography based on message passing
a combination of both

Here's an example of how I approach this in designing a solution for selling transportation Tickets:

Implementing a SAGA for purchasing transportation tickets

I based the main transaction loop on a more coarse grain service orchestrating the completion of the transactional process, but I leveraged idempotency and event streaming to trigger notifications to update target systems that could tolerate more delays in reaching an eventually consistent state.

Conclusion: designing applications

In the end, it is a trade-off between requirements and complexity. Different client contexts might require different solutions, even in the same industry (e.g., Banking in Europe tries to achieve ACID consistency, while other countries, such as the US, are more tolerant towards eventual consistency).

So, a crucial skill for Lead Architects and Designers is being able to work to shape requirements and solutions, together with other business and technical stakeholders. This can be facilitated by adopting a structured approach such as Enterprise Design Thinking to converge on mutually agreed expectations regarding solutions.

Transactional cloud-native systems based on microservices can be realized effectively, but the complexity increase had to be managed. On the requirement side, Design Thinking and Domain-Driven-Design are good tools to enable this, but implementation complexity should also be managed. I'll look in the future at techniques such as Async API and DevSecOps that enables managing larger and more complex solutions.