LinkedIn Re-Architects Service Discovery: Replacing Zookeeper with Kafka and xDS at Scale
Briefly

LinkedIn Re-Architects Service Discovery: Replacing Zookeeper with Kafka and xDS at Scale
"In a recent LinkedIn Engineering Blog post, Bohan Yang describes the project to upgrade the company's legacy ZooKeeper-based service discovery platform. Facing imminent capacity limits with thousands of microservices, LinkedIn needed a more scalable architecture. The new system leverages Apache Kafka for writes and the xDS protocol for reads, enabling eventual consistency and allowing non-Java clients to participate as first-class citizens. To ensure stability, the team implemented a "Dual Mode" strategy that allowed for an incremental, zero-downtime migration."
"The team identified critical scaling problems with the legacy Apache ZooKeeper-based system. Direct writes from app servers and direct reads/watches from clients meant that large application deployments caused massive write spikes and subsequent 'read storms,' leading to high latency and session timeouts. Additionally, since ZooKeeper enforces strong consistency (strict ordering), a backlog in read requests could block writes, causing healthy nodes to fail health checks. The team estimated that the current system would reach its maximum capacity in 2025."
"The new system separates the write path (via Kafka) from the read path (via an Observer service). The Service Discovery Observer consumes Kafka events to update its in-memory cache and pushes updates to clients via the xDS protocol, which is compatible with Envoy and gRPC. The use of the xDS standard enables LinkedIn to deploy clients in many languages beyond Java."
LinkedIn replaced a ZooKeeper-based service discovery platform to address imminent capacity limits and scaling failures caused by write spikes and read storms. The legacy design with direct writes and watches led to high latency, session timeouts, and potential write blockage due to ZooKeeper's strong ordering. The new architecture adopts eventual consistency, separating writes via Kafka and reads via an Observer service that serves clients over xDS. Observers maintain an in-memory cache updated from Kafka and push updates to clients, enabling non-Java clients, future Envoy/service-mesh integration, and zero-downtime Dual Mode migration. Benchmarks show strong Observer performance at scale.
Read at InfoQ
Unable to calculate read time
[
|
]