Basics of Big Data and Streaming
Briefly

Basics of Big Data and Streaming
"Scala - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Scala = "Scalable Language"Runs on the Java Virtual Machine (JVM) → can interoperate with all Java libraries.Designed to combine the best of object-oriented programming (OOP) and functional programming (FP).Invented by Martin Odersky (also the guy behind Java generics) in 2004. Why relevant: Spark's internals and most advanced features are built in Scala, so using Scala gives you the most direct, efficient access.Often preferred in production Spark jobs for performance and type safety."
"Apache Spark - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -What it is: A distributed data processing engine (batch + streaming).Why it's used: Handles large-scale data processing with speed (in-memory computation).Languages: APIs in Scala, Java, Python, R; Scala is the "native" language since Spark is written in Scala. Kafka - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -What it is: A distributed messaging system for high-throughput event streaming.Why it's used: Acts as a data ingestion layer. Spark can consume data from Kafka in real time (Spark Structured Streaming)."
"Amazon EMR - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -What it is: AWS's managed big data platform.Why it's used: Lets you run Spark clusters (and Hadoop, Hive, Presto, etc.) without worrying about cluster setup/maintenance.Spark + Kafka on EMR → scalable pipelines for both real-time streaming and batch analytics."
Scala is a JVM language combining object-oriented and functional programming, interoperating with Java libraries and offering performance and type safety. Apache Spark is a distributed engine for batch and streaming workloads that uses in-memory computation and exposes APIs in Scala, Java, Python, and R, with Scala as its native language. Kafka is a distributed messaging system for high-throughput event streaming and serves as a data ingestion layer that Spark Structured Streaming can consume in real time. Amazon EMR is a managed AWS platform that runs Spark clusters and simplifies cluster setup and maintenance, enabling scalable pipelines.
Read at Medium
Unable to calculate read time
[
|
]