Data Quality on Spark, Part 4: Deequ

"Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. In short, it is a Spark library built by Amazon for expressing and evaluating data quality checks at scale. Besides "regular" checks and verifications, it ships some interesting features such as profiling, analyzers, and automatic suggestions, which will be demonstrated later in this post.The main library is written in Scala, although a Python wrapper (PyDeequ) is also available.To keep the focus on a single implementation, the examples in this post are written in Scala."

"Profiling provides the possibility to get a high-level view of a dataset without prior heavy lifting.All is needed is to pass the dataframe to the profile to com.amazon.deequ.profiles.ColumnProfilerRunner and run it.For the sake of brevity, the resulting profiles are limited to certain columns. This application outputs the following profiles: The library supports more detailed profiling for strings and numerical types."

Data Quality is approached both theoretically and practically using Apache Spark and open-source tools, with comparative assessments across solutions. Deequ is a Spark library by Amazon that enables defining "unit tests for data" to express and evaluate data quality checks at scale. Features include regular checks, verifications, profiling, analyzers, and automated suggestions. The core implementation is Scala-native and a Python wrapper (PyDeequ) is available. Required environment components include JDK 17, sbt, and Scala 2.12. Profiling uses ColumnProfilerRunner to produce column-level summaries and supports detailed string and numeric profiling.

#data-quality #deequ #apache-spark #profiling #pydeequ

Read at Medium

Unable to calculate read time

Collection

[

...

]

Data Quality on Spark, Part 4: DeequData Quality on Spark, Part 4: Deequ Briefly

Data Quality on Spark, Part 4: Deequ
Data Quality on Spark, Part 4: Deequ
Briefly