QCon London 2026: OntologyDriven Observability: Building the E2E Knowledge Graph at Netflix Scale
Briefly

QCon London 2026: OntologyDriven Observability: Building the E2E Knowledge Graph at Netflix Scale
"End-to-End (E2E) Observability is defined as the ability to monitor, understand, and debug an entire state of a complex system from the frontend user experience on one end, through backend services, down to the underlying cloud infrastructure on the opposite end."
"In a recent incident investigation at Netflix, it took four hours from the initial alert of the incident to its resolution. In between, there was triage, debugging and identification of the root cause. Resources included a total of nine teams of more than 30 engineers to resolve this incident and three related incidents."
"The concept of Connectedness includes bridging gaps and breaking silos. At Netflix, connected data in its E2E observability includes: enriching data for a single source of truth; minimizing duplication of effort; the ability to triage and troubleshoot complex issues that deliver aggregated insights and root causes; and improved accuracy with diagnostics."
Netflix engineers Prasanna Vijayanathan and Renzo Sanchez-Silva presented an ontology-driven observability system that creates an end-to-end knowledge graph modeling Netflix's entire ecosystem. End-to-end observability enables monitoring and debugging complex systems from user frontend through backend services to cloud infrastructure. A recent incident required four hours and 30+ engineers across nine teams to resolve. Key challenges include siloed data sources, disconnected alerting, complex troubleshooting, and inadequate detection methods. The solution emphasizes connectedness through unified data enrichment, eliminating duplication, enabling complex issue triage, and improving diagnostic accuracy. The MELT Layer (Metrics, Events, Logs, Traces) provides a unified observability framework across users, devices, and services.
Read at InfoQ
Unable to calculate read time
[
|
]