AI for Debugging and Incidents

AI can automate parts of the incident lifecycle, including triage, root cause analysis, and resolution. This is particularly valuable for building self-healing systems that automatically recover from runtime failures and for augmenting human operators in high-volume environments like Security Operations Centers (SOCs) and cloud support.

Agentic Operations Support

AI agents can serve as support tools for on-call engineers and technical operators by ingesting and reasoning over diverse operational data sources.

The Archi framework is an open-source system deployed for the Computing Operations team at CERN's LHC to support technical operators Archi: Agentic Operations at the CMS Experiment.
It uses configurable agents to retrieve and analyze information from heterogeneous sources, including documentation, historical data, and live monitoring systems Archi: Agentic Operations at the CMS Experiment.
The system proved effective at resolving real-world queries from operators during production usage Archi: Agentic Operations at the CMS Experiment.
A key finding was that locally-hosted open-source-models performed competitively, enabling fully private management of sensitive operational data Archi: Agentic Operations at the CMS Experiment.

Automated Diagnosis and Root Cause Analysis

Differential Debugging for Silent Errors

Silent errors, where system output quality degrades without explicit error signals, are a significant challenge in complex software like LLM serving frameworks Ekka: Automated Diagnosis of Silent Errors in LLM Inference.

The Ekka system automates the diagnosis of these errors by framing it as a differential debugging problem Ekka: Automated Diagnosis of Silent Errors in LLM Inference.
It identifies root causes by systematically aligning and comparing intermediate execution states between a target system and a semantically correct reference implementation Ekka: Automated Diagnosis of Silent Errors in LLM Inference.
On a benchmark of real-world silent errors from popular serving frameworks, Ekka achieved 80% pass@1 diagnosis accuracy Ekka: Automated Diagnosis of Silent Errors in LLM Inference.
Ekka also diagnosed 4 new silent errors in serving frameworks that were subsequently confirmed by developers Ekka: Automated Diagnosis of Silent Errors in LLM Inference.

AI for Debugging and Incidents

Agentic Operations Support

Automated Diagnosis and Root Cause Analysis

Differential Debugging for Silent Errors

Key References