Application Modernization - Part 3: Unravel the data

Data in applications has a powerful gravitational pull, so transforming and modernizing applications is heavily impacted by existing data. Digital Transformations have focused on enabling new cloud applications and frontends but remained connected to existing transactional backends (see also). This kind of approach highlighted potential issues in handling traffic load shifts in reading vs writing transactional data. I had some direct experience with this on projects:

Online ticketing: moving from traditional sales at stations to online apps and services leads to a significant reduction in the conversion rate from travel searches to tickets sold (e.g. an online user tries to plan several trip options before committing to a purchase ), increasing resource consumption.
Online Banking: The adoption of online banking greatly increased the usage of query operations (e.g. end user can read their account statements online) vs actual financial transactions. To contain the increased load and cost on the backend transactional systems a data replication approach (a.k.a Copy Banking) can be used to offload query traffic from the expensive Mainframe towards a hybrid cloud containerized infrastructure.

To continue the cloud journey, we need to refactor also backend transactional systems, with their persistent data (of mission-critical, financial nature), and make the new services able to scale adequately and cost-effectively.

Breaking down the monolith

Domain- Driven-Design (DDD) is a very good tool to break down the Business domain (see also), to identify parts of the business applications that can be modernized and operated as independent microservice components. Analysis of the data is very helpful in this process, to identify contexts based on different data relationship types:

Foreign key relationships: Group tables that are related to the same entity
Transactional relationships: Group tables that are updated in the same transaction

These groups of related tables are indications of candidate DDD aggregates. A good candidate is when the tables of the group are tightly coupled within the group but have few relationships with other tables.

Candidate contexts identified and extracted from data analysis

This is such a good strategy, that the largest chapter of the "Monolith to Microservices" book by Sam Newmann focuses explicitly on patterns for managing data handling.

The identified contexts can then be extracted (code and data together) using variations of the "Strangler Pattern", to create the replacement independent components.

Coexistence architecture

While on the front-end and application components, the strangler pattern can be applied using HTTP proxy/routing techniques to dynamically replace sections of the application, the need of seeing consistent data requires other techniques.

Rolling out a modernized application is rarely a big-bang event, with instead long periods of parallel deployment and testing of the mission-critical applications. This "coexistence" need requires a suitable architecture and in IBM we have a reference cookbook that provides some key fundamental patterns.

IBM Coexistence architecture - data transition patterns

These patterns help in moving out the data and managing the transition in either direction:

Current to Modernized
Modernized to Current

Let's have a look at these patterns, focusing on the Current to Modernized direction.

Change Data Capture

Current systems events are discovered by consuming changes in the system's existing artifacts (typically database files and/or data sources). This pattern is used when changes to Current Programs are not affordable or strategically desirable, but in this case, additional effort must usually be spent to adapt the event for the destination systems with on-demand transformations that often duplicate existing legacy logic.

Key tools for implementing this pattern are:

IBM® IBM Change Data Capture (CDC Replication)
Oracle GoldenGate
Debezium, for an open source CDC

Pro	Cons
No changes to Current systems are needed Limited performance impact on Current systems Near real-time capture of events	A High volume of fine-grained and "noise" events Multiple sources might need to be aggregated to form a business event Complex data transformation might be needed to match the destination data model

Application Event Streaming

This Pattern is used to replicate the Current system application's state, through application changes to expose all business events to an event stream(s). Events are sourced to destination systems with on-demand transformation (usually limited to adapting model schemas, but not replicating business logic).

Events are exposed after the business operation is completed to avoid phantom events. To support this it's advised to use a message-passing middleware with transactional guarantees (either local transaction or exactly-once/at-least-once guarantees). Such middleware can be:

Queue-based systems (JMS, IBM MQ, etc...)
Topic/Partition based (Kafka)

Further articles in this series will go more in-depth on events/messaging.

Pro	Cons
Real-time business event discovery No coupling with the destination solution Possibility to define events in a new domain structure, reducing the need for transformation	Requires changes to Current Applications Complex business logic in the Current application might require aggregation of multiple business events (i.e. compensation of failed transactions) Implementing complex business logic might impact Current application performance

Filtering & Transformation engine

This pattern is used to reduce event streams relevant events by applying transformation and filtering rules and caching processed events

Filtering rules are used to:

Drop irrelevant events (i.e. CDC sends everything)
Route to different/multiple event destinations

Event cache is used to:

Filter out duplicate events (i.e. multiple sources)
Guaranteed idempotent retries of event transformations
Aggregate low-level events

Transformation can become complex both in CDC and Event Streaming scenarios, especially if transformations need to duplicate the business logic of the Current system. EAI pattern-aware tools such as Camel/Fuse are particularly useful for these kinds of transformations.

Pro	Cons
Transformation logic is decoupled from the event source If co-located with the source, reduces the transferred data amount and can use source data for correlation If co-located with the destination, employs modernized tools and platform capabilities	High effort for implementing error handling and compensation in case of missing events Adds extra latency for low-level events aggregation Complexity and latency worst cases when implementing event correlation

Choosing an approach

Ultimately, CDC is not a silver bullet for modernizing systems. All these patterns are tools for modernizing applications, each with its own strengths and complexity, and each, in the wrong context, can become an anti-pattern.

Strategic approaches in application modernization

We need to choose the right balance of complexity based on the solution we want to address. As a criterion for this Litmus test, I would use the target envisioned architecture, that can either completely replace the old system or partially reuse it:

Replace: The old system no longer has strategic value, so the goal is moving out of legacy and transitioning completely to a new solution. In this scenario, CDC with Filtering and Transformation logic eases the transition to the new data model, with the goal of removing these steps with duplicated logic as soon as the transition is complete. This scenario would fit better with a monodirectional data flow.
Coexist: The old system still has strategic value, so the goal is to get back ownership of the old legacy application and integrate it with the new components. In this scenario, Event Streaming will work best at decoupling the systems and replication data based on a shared semantical model, but CDC might still be effectively used if the two data models do not require significant transformation logic to be maintained. It's also a more appropriate target solution for bidirectional data flows.

Especially with the second approach, It would be better to avoid a 2-speed architecture organizational model, but rather have the legacy application teams included in the overall IT process, to foster an effective Agile/DevSecOps environment in maintaining the two systems and their data flows.