
"In this series of blog posts, we explore Data Quality from both a theoretical perspective and a practical implementation standpoint using the Spark framework. We also compare several tools designed to support Data Quality assessments. Although the commercial market for Data Quality solutions is broad and full of capable products, the focus of this series is on open-source tools.In this part, we continue exploring the Airline dataset using the same Data Quality checks, this time with the Deequ library."
"Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. In short, it is a Spark library built by Amazon for expressing and evaluating data quality checks at scale. Besides "regular" checks and verifications, it ships some interesting features such as profiling, analyzers, and automatic suggestions, which will be demonstrated later in this post."
"Profiling provides the possibility to get a high-level view of a dataset without prior heavy lifting.All is needed is to pass the dataframe to the profile to com.amazon.deequ.profiles.ColumnProfilerRunner and run it.For the sake of brevity, the resulting profiles are limited to certain columns."
Deequ is a Spark library created by Amazon for defining unit tests for data and evaluating data quality at scale. The library supports profiling, analyzers, verifications, and automatic suggestions, with a main Scala implementation and a Python wrapper (PyDeequ). Implementation examples require JDK 17, sbt, and Scala 2.12. Profiling via com.amazon.deequ.profiles.ColumnProfilerRunner produces high-level column profiles, with more detailed profiling available for string and numeric types. Analyzers provide controlled statistical views of dataset content. The focus remains on open-source tools and practical, scalable Data Quality assessments using the Airline dataset as an example.
Read at Medium
Unable to calculate read time
Collection
[
|
...
]