Spark Streaming

Spark is internally a microbatch engine. At every interval, as and when the data arrives, a small microbatch gets created and this gives an illusion of a real-time streaming. However, internally it is getting processed in the form of micro-batches.

To answer the following questions : 

- What is the Size of the Micro-batch?

- When is the Micro-batch triggered?

It is required to have an understanding of different types of Triggers.

Types of Triggers -

1. Unspecified (Default)

In this case, once the first micro-batch (say there are 2 files - File1 & File2 in the first micro-batch) processing is complete, the second micro-batch will be triggered provided there is some data that needs to be processed. The subsequent micro-batch gets triggered only when there are some files that need to be processed.

Second micro-batch gets triggered when the File3 arrives. Suppose File4 has arrived and File3 is still in process, then the third micro-batch will be triggered soon after the 2nd micro-batch processing is complete.

2. Fixed Interval

Each micro-batch will start after a certain specified interval, provided there is some data that needs to be processed. 

Syntax :

.trigger(processingTime = “<time-interval-secs>”)

Example

For instance, Let's consider that the fixed interval is set to 5mins.

Scenario1 - If the 1st micro-batch completes its processing in 2 mins, it will wait for 3 more minutes to meet the fixed interval time set. Soon after mins, if a new file has arrived that needs to be processed, then the 2nd micro-batch gets triggered. If there are no files to be processed, then the 2nd micro-batch doesn’t get triggered.

Scenario2 - If the 1st micro-batch completes its processing in 8mins which has exceeded the specified fixed time interval set, then it will immediately trigger the second micro-batch provided there is some data to process. It won’t wait until 10 mins to trigger the next micro-batch as the current processing has exceeded the time limit.

Note: Triggers are applicable for writeStream.

3. Available Now

In this approach, the trigger automatically stops soon after the processing of the micro-batch. 

Syntax :

.trigger(availableNow = True)

Example - Say a microbatch has 3 files File1, File2 and File3. Then, the Available Now Trigger will process all the files associated with the micro-batch and then it will stop.

Note :

- It seems that Available now is similar to Fixed Interval. However, in a fixed interval trigger, the resources are held up even when it is idle. In Available Now, resources will be released soon after the processing is complete.

- Available Now, seems similar to batch processing as there is a need to schedule the jobs for the next processing. However, Available Now, automatically handles incremental processing which is not the case with Batch processing.

How to trigger a new micro-batch is an important strategy and it depends on the business requirement.

Comments

Popular posts from this blog

Basics of Spark

SQL Questions - Ready for Interview