
"Metrics showed the forest but not the trees; logs showed individual trees but made it nearly impossible to trace a path between them. Distributed tracing with OpenTelemetry filled that gap by providing a hierarchical structure that groups events from a single operation, offering visibility into cause and effect across service boundaries."
"While HTTP tracing often works automatically, queues require custom work to maintain context. The team implemented OpenTelemetry's context propagation standard by creating wrappers for their queue clients, attaching trace IDs and parent span IDs as message metadata ensuring the full journey of a request remained intact."
"Just as Google Maps alerts drivers based on expected delay rather than the number of cars on the road, the Gearset team shifted to alerting on latency. A thousand items on a queue might be processed instantly, while five items could be significantly delayed. Latency is more stable and directly reflects customer experience."
Gearset engineers faced challenges diagnosing a delayed backup job despite having comprehensive dashboards, metrics, and logs. Metrics lacked granularity while logs couldn't trace cause-and-effect relationships across services. Distributed tracing with OpenTelemetry solved this by providing hierarchical structure for single operations across service boundaries. For queue-based systems, custom context propagation using trace IDs and parent span IDs as message metadata maintained request visibility. The team shifted from infrastructure-focused alerting to Service Level Objectives based on customer experience, specifically latency rather than queue size. This approach proved more stable and directly reflected actual customer impact, reducing constant re-tuning needs as system characteristics evolved.
#distributed-tracing #opentelemetry #service-level-objectives #observability #queue-context-propagation
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]