#apache-spark tag

Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

fromInfoQ

5 days ago

Data science

Reliable Data Flows and Scalable Platforms: Tackling Key Data Challenges

Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

more#data-engineering

Software development

fromInfoQ

1 week ago

How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework

Apache Spark can be used unconventionally to perform planetary-scale, multi-year data regression testing for billing system migrations and scalability rewrites.

#scala

Software development

Java Developers, Here Are 4 Superpowers You Gain by Learning Scala

Software development

Why Scala is preferred for Big Data Processing over Java?

Data science

Basics of Big Data and Streaming

Software development

Java Developers, Here Are 4 Superpowers You Gain by Learning Scala

Software development

Why Scala is preferred for Big Data Processing over Java?

Data science

Basics of Big Data and Streaming

more#scala

fromTechzine Global

3 months ago

Snowflake launches Snowpark Connect to run Spark code natively

Snowpark Connect facilitates Apache Spark code execution directly within Snowflake warehouses, eliminating the need for separate Spark clusters and associated complexities like data movement.

Data science

fromTheregister

3 months ago

Snowflake builds Spark clients for its own analytics engine

Customers have been using Spark for a long time to process data and get it ready for use in analytics or in AI. The burden of running in separate systems with different compute engines creates complexity in governance and infrastructure.

Data science

fromInfoQ

Databricks Contributes Spark Declarative Pipelines to Apache Spark

Databricks is contributing the technology behind Delta Live Tables (DLT) to the Apache Spark project as Spark Declarative Pipelines, simplifying the development of streaming pipelines.

Data science

Leveraging Broadcast Joins in Apache Spark (Scala)

Broadcast joins optimize Spark for faster dataset joins by broadcasting smaller datasets, avoiding costly shuffle operations.

Scala

From Frustrating to Fast: Speeding Up Spark Tests Using Shared Sessions

Using a shared Spark session significantly reduces the execution time for unit tests in Spark jobs.

Data science

RDD vs DataFrame vs Dataset in Apache Spark: Which One Should You Use and Why

Understanding Spark's APIs—RDD, DataFrame, and Dataset—saves time and boosts efficiency in big data processing.

Frequent Spark Interview QuestionsPart 2

Both cache() and persist() store an RDD/DataFrame/Dataset in memory (or disk) to avoid recomputation. cache() is shorthand for persist(StorageLevel.MEMORY_ONLY), while persist() offers more control.

Scala

Apache Spark: Fix data skew issue using salting technique (practical example)

Data skew in Apache Spark is a performance issue where a few keys dominate the data distribution, leading to uneven partitions and slow queries, especially during operations that require shuffling.

Data science

Scala

Scala #15: Spark: Text Feature Transformers

Tokenization and HashingTF are essential steps in preparing text data for machine learning in Spark.

Scala #15: Spark: Text Feature Transformers

Tokenization is a crucial step in natural language data processing, enabling the breakdown of sentences into individual tokens essential for machine learning applications.

Scala

Data Quality Verification with Deequ: A Practical Approach Using Scala

Utilizing Deequ and Scala for efficient and automated data validation is highly effective for managing large datasets.

Scala

7 months ago

Apache Spark and the Big Data Ecosystem

Apache Spark simplifies Big Data processing with its robust architecture, enhancing efficiency in managing vast data resources.

Data science