Posts

Spark Streaming

Spark is internally a microbatch engine. At every interval, as and when the data arrives, a small microbatch gets created and this gives an illusion of a real-time streaming. However, internally it is getting processed in the form of micro-batches. To answer the following questions :  - What is the Size of the Micro-batch? - When is the Micro-batch triggered? It is required to have an understanding of different types of Triggers. Types of Triggers - 1. Unspecified (Default) In this case, once the first micro-batch (say there are 2 files - File1 & File2 in the first micro-batch) processing is complete, the second micro-batch will be triggered provided there is some data that needs to be processed. The subsequent micro-batch gets triggered only when there are some files that need to be processed. Second micro-batch gets triggered when the File3 arrives. Suppose File4 has arrived and File3 is still in process, then the third micro-batch will be triggered soon after the 2nd micro-b...

Basics of Spark

SparkContext vs SparkSession While SparkContext is used for accessing Spark features through RDDs, SparkSession provides a single point of entry for DataFrames, streaming, or Hive features including HiveContext, SQLContext or Streaming Context. Also, with the introduction of Spark 2.0, SparkSession was introduced as a new entry point for Spark applications. Lazy Evaluation When you apply a transformation like  sort  to your data in Spark, it's crucial to understand that Spark operates under a principle known as  Lazy Evaluation . This means that when you call a transformation, Spark doesn't immediately manipulate your data. Instead, what happens is that Spark queues up a series of transformations, building a plan for how it will eventually execute these operations across the cluster when necessary. Lazy evaluation in Spark ensures that transformations like   sort   don't trigger any immediate action on the data. The transformation is registered as a part of the ...

Spark File Format

Image
  File Formats In the realm of data storage and processing, file formats play a pivotal role in defining how information is organized, stored, and accessed. Here's a table summarizing the characteristics, advantages, limitations, and use cases of the mentioned file formats: Format Characteristics Advantages Limitations Use Cases Comma-Separated Values (CSV) Simple, all data as strings, line-based. Simplicity, human readability. Space inefficient, slow I/O for large datasets. Small datasets, simplicity required. XML/JSON Hierarchical, complex structures, human-readable. Suitable for complex data, human-readable. Non-splittable, can be verbose. Web services, APIs, configurations. Avro Row-based, schema in header, write optimized. Best for schema evolution, write efficiency, landing zone suitability (1), predicate pushdown (2). Slower reads for subset of columns. Data lakes, ETL operations, microservices architechture, schema evolving environments. Optimized Row Columnar (ORC) Columna...