Today, the hype surrounding generative artificial intelligence (GenAI) is dominating software engineering. Technologies such as AI-assisted programming and testing are already changing the way we build and deliver software. Developers use intelligent tools which generate code automatically or identify and eliminate errors in real time.
Holistic AI-driven software delivery, however, requires more than this. If we only focus on development, we miss out on the enormous opportunities in the areas of verification and operation, which are crucial for stable and reliable applications.
AIOps (artificial intelligence for IT operations) means the use of AI and machine learning (ML) to automate and optimize IT operations. This involves analyzing large volumes of data from various sources, recognizing patterns, identifying anomalies, and suggesting or automatically introducing measures. The aim is to increase efficiency, stability, and scalability for the operation of applications and systems. Among others, the following data sources are crucial here: analysis results from (functional and non-functional) tests, observability and monitoring data from test and production systems, as well as information from IT service management and incident management.
Without AIOps, this amount of data is difficult to handle, since manual analysis and monitoring of the data is not only time-consuming, but also error-prone. By recognizing trends and patterns that indicate potential problems and, ideally, their cause, AIOps enables proactive monitoring of the IT infrastructure. In addition, by analyzing log data or metrics, the utilization and performance of systems can be improved and capacity bottlenecks avoided. This leads to significantly shorter downtimes and higher system availability and reliability.
To introduce AIOps successfully, careful planning and an iterative approach are required. First, the use cases that can be improved or expedited with AIOps are identified and prioritized. The relevant data must then be collected from various sources such as logs, metrics, traces, and events and, if necessary, transformed and consolidated in big data systems. Now ML algorithms come into play. They analyze the collected data, identify patterns and anomalies by means of learning models. The next step consists of implementing automation workflows to solve identified problems and optimize system performance. This approach is carried out iteratively for the individual use cases. In order to ensure continuous improvements, it is essential to regularly review and adapt the AIOps strategy based on the findings and feedback from the implemented use cases.
By using big data and ML, AIOps is changing the processes, tasks, and requirements for IT operations organization and the roles involved. An understanding of ML algorithms and data analysis, as well as the ability to apply the findings to problem analyses in complex IT systems, are essential for IT operations professionals. Therefore, the use of AIOps tools must go hand in hand with measures aiming to build up knowledge and to review the processes and responsibilities in operations and in the preceding test stages due to the changed degree of automation.
A particularly interesting application of AIOps is continuous verification (CV), an automated process that constantly runs system checks. This ensures performance, security, and compliance while increasing the efficiency and speed of software development processes.
What does it take to successfully introduce AIOps in this context?
First of all, powerful ML models must be developed and continuously trained to distinguish between normal and abnormal operating states. These models should be updated regularly to adapt to changes in the IT environment and thus always deliver precise results.
In addition, close integration with existing DevOps and CI/CD tools is required. This is the only way to seamlessly embed CV into the software development cycle. To automatically run tests and perform rollbacks if necessary, the tools are enhanced by AIOps. By monitoring service level indicators (SLI) and service level objectives (SLO), teams can ensure that the quality and reliability of the applications meet the defined standards also in case of faster delivery.
Lastly, comprehensive visualization and accurate reporting are crucial. Dashboards and reports need to provide clear insights into current system statuses and trends to support IT teams in their decision-making. This transparency helps to maintain an overview of compliance with SLOs and to take action if necessary.
In particular combined with CV, AIOps overall enables proactive monitoring and assurance of software quality, resulting in a faster and more reliable software delivery. Thanks to the use of advanced AI and ML technologies, IT teams can work more efficiently and ensure that new software releases are stable and secure.