#rdd

[ follow ]
fromMedium
1 week ago

RDD vs DataFrame vs Dataset in Apache Spark: Which One Should You Use and Why

Spark offers three main APIs—RDD, DataFrame, and Dataset—each with unique advantages: RDD provides low-level control, DataFrames optimize performance, and Datasets bring type safety.
Data science
frommedium.com
2 months ago

Spark Scala Exercise 22: Custom Partitioning in Spark RDDsLoad Balancing and Shuffle

Implementing a custom partitioner in Spark Scala allows for co-locating related keys, balancing skewed loads, and optimizing reduce-side joins, giving control over task distribution.
Data science
[ Load more ]