Resurrecting Scala in Spark : Another tool in your toolbox when Python and Pandas suffer
Briefly

Using Pandas UDFs in Spark can handle complex logic for grouped records, but can lead to performance issues with a high number of groups and small record counts.
The performance issue arises from excessive data movement and serialization/deserialization between the JVM and Python processes, reminiscent of marshalling challenges in older programming paradigms.
Despite the flexibility of applyInPandas for complex operations in Spark, developers may encounter difficulties when dealing with large quantities of small groups, which degrade efficiency.
Optimization may be necessary when using Pandas UDFs in Spark environments like Databricks on AWS, particularly with single-node clusters that can exacerbate performance concerns.
Read at Medium
[
|
]