Yelp Publishes Blueprint for Managing S3 Server-Access Logs at Massive Scale
Briefly

Yelp Publishes Blueprint for Managing S3 Server-Access Logs at Massive Scale
"In essence, Yelp now writes terabytes of daily access logs but converts them into compact, parquet-formatted archives that are easy to query with tools like Amazon Athena. Through a process of periodic "compaction," raw plaintext log objects are merged into fewer, larger Parquet files, reducing storage usage by about 85% and cutting the number of objects by more than 99.99%. This transformation makes analytics efficient and cost-effective, enabling quick lookups for permission debugging, cost attribution, incident investigation, and data retention analysis."
"Behind the scenes, the architecture leverages AWS Glue Data Catalog for managing schemas across multiple AWS accounts, and a mix of scheduled batch jobs, Lambda functions, and partition-projection-based tables for robust, automated log ingestion. The system is designed to tolerate delayed or duplicate log delivery, something SAL inherently allows, by making inserts idempotent, and tagging old log objects for lifecycle expiration once their contents are safely archived."
"Yelp's system also supports key operational use-cases. For debugging, engineers can query whether a particular object was accessed (or denied) at a given time. For cost analysis, it is possible to aggregate API usage by IAM role to understand which services or teams generate the most traffic. For data hygiene, combining access logs with S3 inventory allows the team to identify and safely delete objects that haven't been accessed for defined periods."
Yelp writes terabytes of daily S3 server-access logs and converts them into compact Parquet-formatted archives to enable efficient querying with tools like Amazon Athena. Periodic compaction merges raw plaintext log objects into fewer, larger Parquet files, reducing storage by about 85% and object count by more than 99.99%. The ingestion pipeline uses AWS Glue Data Catalog for multi-account schemas, scheduled batch jobs, Lambda functions, and partition-projection tables. The system tolerates delayed or duplicate deliveries by making inserts idempotent and tagging original log objects for lifecycle expiration after archival. Engineers use the logs for permission debugging, cost attribution, incident investigations, and safe data deletion.
Read at InfoQ
Unable to calculate read time
[
|
]