The author faced a challenge in validating large volumes of data and sought an automated solution compatible with a language he was familiar with, preferably Scala. He discovered Deequ, a library by Amazon designed for data quality validation within distributed environments. By creating a framework on top of Apache Spark and Deequ, the author established a robust validation layer suitable for various data processing pipelines. This solution simplifies complex validations, making it reusable for multiple teams and capable of effortlessly scaling to handle massive datasets.
The solution I developed was based on creating a small internal framework built on top of Apache Spark and Deequ, using Scala as the main language.
This framework acts as a validation layer that can be easily integrated into any data processing pipeline.
Collection
[
|
...
]