#apache-spark

[ follow ]
fromTechzine Global
1 week ago

Snowflake launches Snowpark Connect to run Spark code natively

Snowpark Connect facilitates Apache Spark code execution directly within Snowflake warehouses, eliminating the need for separate Spark clusters and associated complexities like data movement.
Data science
fromTheregister
2 weeks ago

Snowflake builds Spark clients for its own analytics engine

Customers have been using Spark for a long time to process data and get it ready for use in analytics or in AI. The burden of running in separate systems with different compute engines creates complexity in governance and infrastructure.
Data science
fromInfoQ
1 month ago

Databricks Contributes Spark Declarative Pipelines to Apache Spark

Databricks is contributing the technology behind Delta Live Tables (DLT) to the Apache Spark project as Spark Declarative Pipelines, simplifying the development of streaming pipelines.
Data science
fromMedium
2 months ago

Leveraging Broadcast Joins in Apache Spark (Scala)

Broadcast joins optimize Spark for faster dataset joins by broadcasting smaller datasets, avoiding costly shuffle operations.
Scala
fromMedium
1 month ago

From Frustrating to Fast: Speeding Up Spark Tests Using Shared Sessions

Using a shared Spark session significantly reduces the execution time for unit tests in Spark jobs.
fromMedium
2 months ago

RDD vs DataFrame vs Dataset in Apache Spark: Which One Should You Use and Why

Understanding Spark's APIs—RDD, DataFrame, and Dataset—saves time and boosts efficiency in big data processing.
fromMedium
2 months ago

Frequent Spark Interview QuestionsPart 2

Both cache() and persist() store an RDD/DataFrame/Dataset in memory (or disk) to avoid recomputation. cache() is shorthand for persist(StorageLevel.MEMORY_ONLY), while persist() offers more control.
Scala
#data-engineering
fromMedium
2 months ago
Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

fromMedium
3 months ago
Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

fromMedium
2 months ago
Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

fromMedium
3 months ago
Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

fromMedium
3 months ago

Apache Spark: Fix data skew issue using salting technique (practical example)

Data skew in Apache Spark is a performance issue where a few keys dominate the data distribution, leading to uneven partitions and slow queries, especially during operations that require shuffling.
Data science
fromMedium
3 months ago

Scala #15: Spark: Text Feature Transformers

Tokenization and HashingTF are essential steps in preparing text data for machine learning in Spark.
fromMedium
3 months ago

Scala #15: Spark: Text Feature Transformers

Tokenization is a crucial step in natural language data processing, enabling the breakdown of sentences into individual tokens essential for machine learning applications.
Scala
Scala
fromMedium
3 months ago

Data Quality Verification with Deequ: A Practical Approach Using Scala

Utilizing Deequ and Scala for efficient and automated data validation is highly effective for managing large datasets.
Scala
fromMedium
4 months ago

Apache Spark and the Big Data Ecosystem

Apache Spark simplifies Big Data processing with its robust architecture, enhancing efficiency in managing vast data resources.
Data science
fromMedium
4 months ago

Big Data for the Data Science-Driven Manager 03- Apache Spark Explained for Managers

Apache Spark is crucial for efficiently processing large datasets in modern enterprises.
[ Load more ]