Yelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes

"Yelp's upgrade of its Apache Cassandra infrastructure involved more than 1,000 nodes and was executed without any service downtime, showcasing a successful approach to managing stateful systems at scale."

"The rolling upgrade strategy allowed the team to incrementally upgrade nodes while ensuring that applications continued to read and write data uninterrupted, addressing one of the most complex challenges in distributed systems."

"By upgrading nodes in controlled batches and allowing the cluster to rebalance and repair between steps, Yelp minimized the risk of cascading failures, aligning with best practices in Cassandra upgrades."

"The investment in automation and observability ensured that each phase of the upgrade could be monitored and validated in real time, significantly reducing the likelihood of human error."

Yelp completed a significant upgrade of its Apache Cassandra infrastructure involving over 1,000 nodes without service downtime. The Database Reliability Engineering team implemented a rolling upgrade strategy to maintain cluster availability and data consistency. Strict adherence to compatibility and incremental change principles minimized risks. Automation and observability were crucial, allowing real-time monitoring and validation. Continuous health checks helped detect anomalies early. The upgrade demonstrated the complexities of managing distributed databases and the importance of careful coordination during such processes.

#yelp #cassandra #database-upgrade #infrastructure #automation

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Yelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra NodesYelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes Briefly

Yelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes
Yelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes
Briefly