
"Using df.count() to Check for Data Existence This is a classic mistake often made by entry-level engineers when writing data to a sink. It seems intuitive to check if a DataFrame is non-empty using df.count() > 0 before triggering a write. However, because Spark relies on lazy evaluation, calling .count() forces Spark to execute a full job just to count the rows."
"Applying dropDuplicates on All Columns While cleaning up duplicate records is standard practice in big data pipelines, using dropDuplicates() without specifying target columns is a massive performance killer. It forces a complete shuffle of the entire dataset across the cluster, leading to massive, unnecessary shuffle stages and high I/O overhead."
"Unnecessary or Excessive Caching It might sound strange to call caching an anti-pattern since avoiding re-evaluation is generally a good practice. However, caching can easily backfire. If the cached DataFrame is too large, it can trigger Out Of Memory (OOM) errors. It can also cause stale data issues if the underlying data updates during job execution."
"Leaving df.count(), df.show(), or display() in Production Pipelines Methods like .show(), .count(), and display() are incredibly helpful during active development in a DEV environment. However, they are actions that trigger new Spark stages. Leaving them inside production pipelines increases runtime and wastes expensive compute resources. For production tracking, rely instead on standard logging frameworks (like Log4j in Java/Scala)."
Spark includes built-in optimizations, but common anti-patterns can negate them and add overhead. Calling df.count() to check whether data exists forces Spark to execute a full job due to lazy evaluation. Using dropDuplicates() without specifying columns causes a complete shuffle across the cluster, increasing I/O and shuffle stages. Caching can become harmful when datasets are too large, leading to out-of-memory errors, or when underlying data changes during execution, causing stale results. Leaving actions like df.count(), df.show(), or display() in production triggers extra Spark stages and wastes compute. Creating tables with very wide schemas can also hurt performance and should be avoided.
Read at Medium
Unable to calculate read time
Collection
[
|
...
]