Automating Observability of Complex Systems with GenAI

Large cloud and microservices solutions are challenging due to their dynamic and distributed nature. A holistic observability platform, integrated into DevOps and augmented by GenAI, can help to handle the operational and security challenges introduced by the EU’s CRA and the USA's CISA regulations.

Automating Observability of Complex Systems with GenAI

Observability in managing large complex systems based on cloud and microservices can be challenging due to the dynamic and distributed nature of these systems. The traditional approach to CMDB (Configuration Management Database) which focuses on a slow-changing picture of IT infrastructure based on physical assets is not suitable for modern, continuously changing systems based on cloud and microservices. This is because these systems are highly dynamic and distributed, with numerous components that can be added, removed, or updated at any time. Therefore, it is difficult to keep track of the state of the system at any given point in time.

However, the challenge extends beyond simply having visibility into the underlying infrastructure. It also encompasses understanding how the application components are linked and utilizing that infrastructure. Without this holistic view, it can be difficult to identify and diagnose issues that may be impacting multiple components or services. Additionally, it can be challenging to optimize performance and ensure reliability without a clear understanding of how the application components are utilizing the infrastructure.

Another challenge is the sheer volume of potential data generated by these systems. With numerous components and services, there can be a large volume of code, documentation, reports, logs, metrics, and traces to sift through. Without the right tools and processes in place, it can be difficult to extract meaningful insights from this data. This can make it impossible or lead to delays in preventing, identifying, and resolving issues, which can impact the overall performance and reliability of the system.

Solving the Data Collection problem

Industry or de facto standards in collecting metrics and data on IT systems provide a way to solve the data collection problem from multiple heterogeneous sources by establishing a common format and methodology for data collection, making it easier to aggregate and analyze data from different sources. These standards can include data formats, data collection methods, and data transmission protocols.

Examples of data-collecting standards include:

  • Security CVE (Common Vulnerabilities and Exposures) which provides a dictionary of standardized names for vulnerabilities and security exposures
  • Static source code scans (such as Sonarqube) provide a standard way to analyze and report on potential security issues in source code
  • Application Logs, which provides a standard format for logging application events and errors
  • Telemetry data such as Open Telemetry, which provides a standard way to collect and transmit data about application performance and behavior
  • Monitoring data from APM (Application Performance Management) systems, which provide a standard way to monitor and report on application performance metrics
  • Infrastructure and networking information, which can be collected using standards like SNMP (Simple Network Management Protocol) and NetFlow.

Standardization in data collection is being pushed by regulations like the EU Cyber Resiliency Act (CRA) or by the USA's Cybersecurity and Infrastructure Security Agency (CISA). These regulations aim to improve the visibility of the digital supply chain through an SBOM (Software Bill of Materials) approach, which requires the disclosure of all software components used in a system, including their dependencies and known vulnerabilities. This approach can help organizations better understand the risks associated with their software supply chain and take necessary steps to mitigate those risks.

Structuring Semantic Relationships

A Hub and Spoke platform leveraging the data collection approach described earlier can provide a powerful solution for visualizing and understanding the state of IT infrastructure, applications, and operations. By using the semi-structured information provided by SBOMs and other input data, the platform can correlate and organize information creating a complex graph of relationships that can be utilized and analyzed using semantic web principles and techniques.

Displaying and navigating system structure graph (based on IBM Concert)

This approach can enable the platform to build an end-to-end view of the IT landscape, from network and infrastructure to application code.  Specific use cases can be tailored to be useful at multiple levels of the IT hierarchy, from operatives to managers and executives, by applying different analysis lenses (e.g. Security vs Resiliency vs Compliance focus) on the common data that can help organizations make informed decisions, optimize performance, and ensure reliability.

Leveraging GenAI for Insight Generation

A centralized data platform can serve as an ideal environment for integrating Generative Artificial Intelligence (Gen AI) due to several reasons, particularly in terms of summarization, trend extraction, and action proposal. Here's why:

  • Unified Data Access: A centralized platform provides a single source of truth, allowing Gen AI to access all relevant data in one place. This holistic view helps the AI in generating more accurate summaries and trend analyses, as it doesn't miss any vital data points scattered across different systems.
  • Data Quality and Security: The data collection integration process comes with data format standardization, quality checks, and robust security measures for sensitive data. Gen AI, operating in this context, can achieve better quality and guarantee privacy while performing its tasks.
  • Efficient Processing: By having all data in one place, the computational efficiency of the Gen AI is significantly increased. It can process and analyze data faster, leading to quicker generation of summaries and trend detection.
  • Customized Analysis: Applying multiple analysis lenses, smaller, tailored models like Granite can be trained to focus on specific areas of analysis. This specialization allows for more precise summarization and identification of trends in those specific domains.
  • Enhanced Decision Making: The AI, after understanding the data trends, can propose actionable insights. It can suggest strategies based on historical data, predict future outcomes, and recommend preventive or corrective measures, thereby aiding in better decision-making processes.
  • Automating Audit Trails and Compliance: Centralized platforms maintain comprehensive audit trails, which is crucial for regulatory compliance. Every action taken by the Gen AI can be tracked and reviewed, ensuring transparency and accountability.

In essence, integrating Gen AI into a centralized data platform provides a conducive environment for efficient, secure, and accurate data analysis, paving the way for informed decision-making and strategic planning.

DevOps as a key tool to achieve an application CMDB

To make this approach effective, the key element is a seamless integration into the DevOps chain, that provides key aspects to facilitate the creation of an application-aligned Configuration Management Database (CMDB):

  • Data Quality Assurance: Incoming data is standardized and validated before entering the system.
  • Real-time Data Processing, since in a cloud-native environment, data is generated at an unprecedented rate.
  • Automated Configuration Management: An application-aligned CMDB with accurate, up-to-date information about the components in your IT environment and their relationships can trigger automated remediation actions, positively contributing to the DevOps loop.

Weaving Gen AI into the DevOps chain observability platforms (e.g. such as IBM Concert ) enable organizations to effectively manage the complexity and speed of cloud-native development and operations.