I wrote a book for O'Reilly on scaling machine learning with Spark specifically. My second book is coming out on how to improve high-performance Spark, the second edition. Started my career in the machine learning space 15 years ago, moved into data infrastructure, batch processing, and a year and a half ago I moved into the data streaming space, which I think it's what's going to help us pave the future in the data.
In a two-part blog series, Soam Acharya, Rainie Li, William Tom and Ang Zhang describe how the Pinterest Big Data Platform team considered alternatives for their next-generation massive-scale data processing platform as the limits of the existing Hadoop-based system, known internally as Monarch, became clear. They present Moka as the outcome of that search, and as their EKS based cloud native data processing platform, which now runs production workloads at Pinterest scale.
Microsoft has bought Osmos, an AI-assisted data engineering platform, in a bid to enrich its Fabric data platform, encroaching on so-called partners' markets. Founded in 2019, Osmos was already making its pipeline and upload products available on Fabric, based around open source Apache Spark. In a blog post, Bogdan Crivat, Microsoft corporate veep for Azure Data Analytics, said the purchase will support Fabric's mission to give customers an approach to "unify all data and analytics into a single, secure platform."
Customers have been using Spark for a long time to process data and get it ready for use in analytics or in AI. The burden of running in separate systems with different compute engines creates complexity in governance and infrastructure.
Data skew in Apache Spark is a performance issue where a few keys dominate the data distribution, leading to uneven partitions and slow queries, especially during operations that require shuffling.