Apache Spark is an effective tool for big data processing, but unit tests can be cumbersome due to slow execution times. An innovative solution introduced a shared Spark session across multiple tests, drastically reducing test durations—from 4 minutes to 2 minutes for key jobs. This approach not only enhances efficiency but also minimizes idle time due to session initialization, allowing for more frequent testing and quicker bug resolution, ultimately improving overall productivity in Spark-based data pipelines.
If your tests take too long, you'll run them less often. And if you run them less often, bugs get shipped more often.
We implemented a SharedSparkSession trait that encapsulates a reusable Spark context, allowing multiple test suites or test cases to share the same SparkSession instance.
Collection
[
|
...
]