KubeCon NA 2025 - Salesforce's Approach to Self-Healing Using AIOps and Agentic AI
Briefly

KubeCon NA 2025 - Salesforce's Approach to Self-Healing Using AIOps and Agentic AI
"The operational scale of their K8s platform includes 1400 K8s clusters, millions of pods, thousands of compute nodes, 40+ operators and integrations, and 200+ monitoring plugins. The speakers highlighted that they estimate the capacity to increase five times in the next couple of years. The overall goal of the solution is to let application teams focus on business requirements, not get bogged down with infrastructure overhead."
"They discussed the approaches to Kubernetes platform operations, leveraging generative AI and multi-agent collaboration to create a cluster management system to troubleshoot Kubernetes clusters, reducing mean time to identify (MTTI) and mean time to resolve (MTTR) for critical cluster issues. Agentic AI solution consists of AI Agents that have specific goals to help with AIOps platform and tools to retrieve data from telemetry platform. Agents perform actions against their K8s environment like rolling back upgrades in case any issues during the upgrade process."
Salesforce developed an AIOps architecture to enable self-healing for the Hyperforce Kubernetes Platform, a managed multi-cloud namespace-as-a-service. The platform operates at scale with about 1,400 clusters, millions of pods, thousands of nodes, 40+ operators and integrations, and 200+ monitoring plugins, with projected fivefold capacity growth. The solution uses generative AI and multi-agent collaboration to automatically analyze cluster health, diagnose issues, and orchestrate remediation actions. AI Agents retrieve telemetry, execute actions such as rollbacks during faulty upgrades, and reduce mean time to identify and resolve critical cluster issues while enforcing communication protocols, guardrails, and security permissions.
Read at InfoQ
Unable to calculate read time
[
|
]