#spark-scala

[ follow ]
frommedium.com
1 month ago

Spark Scala Exercise 22: Custom Partitioning in Spark RDDsLoad Balancing and Shuffle

In this exercise, we will explore the lower-level RDD API in Spark Scala, focusing on implementing a custom partitioner. While the DataFrame API handles partitioning automatically, the ability to customize partitioning using RDDs is crucial for specific scenarios such as co-locating related keys, balancing skewed loads, optimizing reduce-side joins, and controlling task distribution.
Data science
#data-analysis
frommedium.com
1 month ago
Data science

Spark Scala Exercise 7: Advanced Group By and Aggregations (with Rollup, Cube, and Multi-level

frommedium.com
1 month ago
Data science

Spark Scala Exercise 7: Advanced Group By and Aggregations (with Rollup, Cube, and Multi-level

frommedium.com
1 month ago

Spark Scala Exercise 9: Joining Two Datasets in SparkMastering Inner, Left, Right, and Outer

Join operations are foundational for data analysis, helping to connect disparate datasets in Spark Scala, which is critical for enhancing business logic and data insights.
Data science
[ Load more ]