Basics of Spark

September 29, 2024

SparkContext vs SparkSession

While SparkContext is used for accessing Spark features through RDDs, SparkSession provides a single point of entry for DataFrames, streaming, or Hive features including HiveContext, SQLContext or Streaming Context. Also, with the introduction of Spark 2.0, SparkSession was introduced as a new entry point for Spark applications.

Lazy Evaluation

When you apply a transformation like sort to your data in Spark, it's crucial to understand that Spark operates under a principle known as Lazy Evaluation. This means that when you call a transformation, Spark doesn't immediately manipulate your data. Instead, what happens is that Spark queues up a series of transformations, building a plan for how it will eventually execute these operations across the cluster when necessary.

Lazy evaluation in Spark ensures that transformations like sort don't trigger any immediate action on the data. The transformation is registered as a part of the execution plan, but Spark waits to execute any operations until an action (such as collect, count, or saveAsTextFile) is called.

Here's a table that summarizes the key differences between RDDs and DataFrames in Apache Spark, along with examples for each:

Feature	RDDs	DataFrames
Abstraction Level	Low-level, providing fine-grained control over data and operations.	High-level, providing a more structured and higher-level API.
Optimization	No automatic optimization. Operations are executed as they are defined.	Uses the Catalyst optimizer for logical and physical plan optimizations.
Ease of Use	Requires more detailed and complex syntax for operations.	Simplified and more intuitive operations similar to SQL.
Interoperability	Primarily functional programming interfaces in Python, Scala, and Java.	Supports SQL queries, Python, Scala, Java, and R APIs.
Typed and Untyped APIs	Provides a type-safe API (especially in Scala and Java).	Offers mostly untyped APIs, except when using Datasets in Scala and Java.
Performance	May require manual optimization, like coalescing partitions. *	Generally faster due to Catalyst optimizer and Tungsten execution engine. *
Fault Tolerance	Achieved through lineage information allowing lost data to be recomputed.	Also fault-tolerant through Spark SQL's execution engine.
Use Cases	Suitable for complex computations, fine-grained transformations, and when working with unstructured data.	Best for standard data processing tasks on structured or semi-structured data, benefiting from built-in optimization.

Search This Blog

The Future of QA Engineering: Harnessing Spark Hadoop in Cloud Environments

Basics of Spark

SparkContext vs SparkSession

Lazy Evaluation

Comments

Post a Comment

Popular posts from this blog

Spark Streaming

SQL Questions - Ready for Interview