Data Quality on Spark, Part 4: Deequ

"Introduction In this series of blog posts, we explore Data Quality from both a theoretical perspective and a practical implementation standpoint using the Spark framework. We also compare several tools designed to support Data Quality assessments. Although the commercial market for Data Quality solutions is broad and full of capable products, the focus of this series is on open-source tools.In this part, we continue exploring the Airline dataset using the same Data Quality checks, this time with the Deequ library."

"Deequ Deequ, as the documentation describes it, is: Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. In short, it is a Spark library built by Amazon for expressing and evaluating data quality checks at scale. Besides "regular" checks and verifications, it ships some interesting features such as profiling, analyzers, and automatic suggestions, which will be demonstrated later in this post.The main library is written in Scala, although a Python wrapper (PyDeequ) is also available.To keep the focus on a single implementation, the examples in this post are written in Scala."

"Profiling Profiling provides the possibility to get a high-level view of a dataset without prior heavy lifting.All is needed is to pass the dataframe to the profile to com.amazon.deequ.profiles.ColumnProfilerRunner and run it.For the sake of brevity, the resulting profiles are limited to certain columns. This application outputs the following profiles: The library supports more detailed profiling for strings and numerical types."

The content presents Deequ as an Apache Spark library for defining "unit tests for data" to measure and evaluate data quality at scale. Deequ provides regular checks, verifications, profiling, analyzers, and automatic suggestions. The main implementation is in Scala with a Python wrapper (PyDeequ) available. Required setup includes JDK 17, sbt, and Scala 2.12. Profiling uses ColumnProfilerRunner to produce high-level summaries of dataset columns, with support for more detailed profiling of string and numerical types. Analyzers offer a controlled, high-level overview of dataset content. Examples are implemented in Scala.

#deequ #data-quality #apache-spark #profiling

Read at Medium

Unable to calculate read time

Collection

[

...

]

Data Quality on Spark, Part 4: DeequData Quality on Spark, Part 4: Deequ Briefly

Data Quality on Spark, Part 4: Deequ
Data Quality on Spark, Part 4: Deequ
Briefly