Flipkart Scales Prometheus to 80 Million Metrics Using Hierarchical Federation
Briefly

Flipkart Scales Prometheus to 80 Million Metrics Using Hierarchical Federation
"Flipkart engineers recently published a detailed case study describing how they overcame severe scalability limits in monitoring by adopting a hierarchical federation design in Prometheus. The migration was driven by their API Gateway layer, where approximately 2,000 instances each emitted roughly 40,000 metrics, resulting in a staggering 80 million time-series data points being produced simultaneously. Initially, Flipkart used StatsD for metrics aggregation, but they found it failed to scale. Queries over longer durations choked storage and made historical analytics impractical."
"The heart of their scaling solution lies in hierarchical federation: local Prometheus servers ingest metrics from services, apply recording rules to drop high-cardinality instance labels, and expose aggregated series via /federate endpoints. Federated servers then scrape selected aggregated metrics upward, writing them to long-term storage and dashboards. This tiered design dramatically reduces metric cardinality and load on central servers. To reduce cardinality further, Flipkart adopted strategies such as dropping the instance label for stable dimensions like service or cluster."
Flipkart experienced extreme metric cardinality when about 2,000 API Gateway instances each emitted roughly 40,000 metrics, producing about 80 million simultaneous time-series. StatsD aggregation failed to scale and long-duration queries overloaded storage, hindering historical analytics. Flipkart migrated to Prometheus and implemented hierarchical federation: local Prometheus servers ingest service metrics, apply recording rules to drop high-cardinality instance labels, and expose aggregated series via /federate endpoints. Federated servers scrape selected aggregated metrics and write them to long-term storage and dashboards. Additional tactics included dropping per-instance labels for stable dimensions and publishing latency summaries instead of per-instance percentiles, collapsing raw series to tens of thousands while acknowledging tradeoffs for per-instance debugging and small deployments.
Read at InfoQ
Unable to calculate read time
[
|
]