The rapid growth of applications and frequent changes in clusters added further layers of complexity. Engineers often experienced alert fatigue due to the overwhelming volume of data sources and alerts.
To enhance detection capabilities, Intuit implemented a system called 'Cluster Golden Signals,' which provides a consolidated view of a cluster's health by filtering out noise and focusing on critical signals for alerting.
Core components of Kubernetes clusters are monitored through dashboards that aggregate metrics into a single health indicator-Healthy, Degraded, or Critical-using Prometheus expressions.
For deeper debugging, Intuit integrated an open-source tool called K8sGPT. This tool scans Kubernetes clusters to diagnose and triage issues by leveraging knowledge codified from Site Reliability Engineers.
Collection
[
|
...
]