Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches
Briefly

Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches
"At QCon San Francisco 2025, Jimmy Morzaria, Staff Software Engineer at Stripe, presented the company's Zero-Downtime Data Movement Platform, a system enabling petabyte-scale database migrations with traffic cutovers that typically complete in milliseconds. The platform supports Stripe's infrastructure, handling 5 million database queries per second across 2,000-plus MongoDB-based shards while maintaining 99.9995% reliability for $1.4 trillion in annual transactions."
"A data migration starts with a "migration registration" step that updates the routing metadata service to register new target shards and their key ranges. This step establishes the intended destination for data before any movement occurs. The bulk data import phase then transfers the primary dataset using an optimized service that achieves tenfold performance improvements over standard imports. Morzaria explained that the team reordered inserts to align with MongoDB's B-tree storage engine, sorting items by the most-used indexes in each shard to improve write performance by 10x. Next, during async replication, a dedicated replication service maintains bidirectional synchronization between source and target shards. This crucial phase captures ongoing changes to source data while simultaneously replicating modifications back to source shards. The bidirectional approach enables complete migration rollbacks if issues emerge, providing a critical safety mechanism for financial data."
Zero-downtime data movement platform enables petabyte-scale database migrations with traffic cutovers that typically complete in milliseconds. The platform handles 5 million queries per second across more than 2,000 MongoDB shards while maintaining 99.9995% reliability for $1.4 trillion in annual transactions. Migrations follow a six-phase blueprint built around three principles: keep downtime shorter than node failover, minimize impact on live queries, and support shards from small datasets to tens of terabytes. Stages include migration registration, an optimized bulk import that reorders inserts for MongoDB B-tree to achieve roughly 10x throughput, and bidirectional async replication that permits full rollback and captures ongoing changes.
Read at InfoQ
Unable to calculate read time
[
|
]