Data Quality on Spark, Part 4: Deequ

"In this series of blog posts, we explore Data Quality from both a theoretical perspective and a practical implementation standpoint using the Spark framework. We also compare several tools designed to support Data Quality assessments. Although the commercial market for Data Quality solutions is broad and full of capable products, the focus of this series is on open-source tools.In this part, we continue exploring the Airline dataset using the same Data Quality checks, this time with the Deequ library."

"Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. In short, it is a Spark library built by Amazon for expressing and evaluating data quality checks at scale. Besides "regular" checks and verifications, it ships some interesting features such as profiling, analyzers, and automatic suggestions, which will be demonstrated later in this post."

Data Quality can be addressed both theoretically and through practical implementations using the Spark framework. Open-source tools are emphasized for evaluating data quality within large datasets such as the Airline dataset. Deequ is a Spark library developed by Amazon that enables defining unit tests for data, expressing and evaluating data quality checks at scale. Deequ provides regular checks and verifications plus profiling, analyzers, and automatic suggestions; the main library is Scala with a Python wrapper (PyDeequ). Profiling offers a high-level view of dataset columns via com.amazon.deequ.profiles.ColumnProfilerRunner, and analyzers give controlled dataset summaries. Examples target Scala 2.12 with JDK 17 and sbt prerequisites.

#data-quality #deequ #apache-spark #profiling

Read at Medium

Unable to calculate read time

Collection

[

...

]

Data Quality on Spark, Part 4: DeequData Quality on Spark, Part 4: Deequ Briefly

Data Quality on Spark, Part 4: Deequ
Data Quality on Spark, Part 4: Deequ
Briefly