
Logs provide timestamped records of discrete events such as function calls, requests, and errors, but high-traffic systems create massive volumes and require discipline and tooling to correlate across services. Metrics provide aggregated numerical data like request rate, error rate, latency percentiles, and CPU usage, enabling cheap storage, alerting, and dashboards, but aggregation removes detail needed to identify where and why issues occur. Traces provide a causal record of a single request’s path through services, including hop timing and error locations, and standards such as OpenTelemetry can speed root-cause analysis. Together, these tools work well for common failures, but they struggle with less canonical failure modes in modern distributed systems.
"It's a tidy framework. Yet it turns out to be incomplete in ways that only become obvious once you're actually trying to debug a production incident with it. This article isn't an argument against logs, metrics and traces; you need all three. However, there's a growing set of failure modes in modern distributed systems that the three-pillar model struggles to explain - and understanding why is the first step toward building observability that actually works."
" give you a timestamped record of discrete events: A function was called, a request came in, an error was thrown. They're rich in detail and easy to add to code. The challenge is volume - a high-traffic service can generate millions of log lines per minute, and correlating across services requires discipline and tooling."
" give you aggregated numerical data over time: Request rate, error rate, latency percentiles, CPU usage. They're cheap to store, easy to alert on and ideal for dashboards. The tradeoff is that aggregation loses information - a p99 latency of two seconds tells you something is slow, but not where or why."
" give you a causal record of how a single request moved through your system - which services it touched, how long each hop took, where errors occurred. Distributed tracing, using standards like OpenTelemetry , has matured considerably and can dramatically accelerate root cause analysis. Together, these three tools cover a lot of ground. For the canonical failure modes - a slow database query, a misconfigured cache, a crashing pod - they work well. The question is what happens when the failure mode is less canonical."
Read at DevOps.com
Unable to calculate read time
Collection
[
|
...
]