Apache Airflow Tutorial

July 13, 2024

Apache Airflow is a tool for authoring, monitoring & scheduling pipelines. As a result, it is an ideal case of ETL & MLOps pipelines.

Examples Uses Cases:

Extracting data from many sources, aggregating them, transforming them, and store in a data warehouse.
Extract insights from data and display them in an analytics dashboard
Train, validate, and deploy machine learning models

Key Components:

1. WebServer : Webserver is Airflow’s user interface (UI), which allows you to interact with it without the need for a CLI or an API. From there one can execute, and monitor pipelines, create connections with external systems, inspect their datasets, and many more.

2. Schedular : The scheduler is responsible for executing different tasks at the correct time, re-running pipelines, backfilling data, ensuring tasks completion, etc.

3. Executors : Executors are the mechanism by which pipelines run. There are many different types that run pipelines locally, in a single machine, or in a distributed fashion.

4. Databases: It store all metadata related to pipeline. It supports postgres , mysql & other database as well.

Airflow Basic Concenpts:

1. DAGs:

All pipelines are defined as directed acyclic graphs (DAGs). Any time we execute a DAG, an individual run is created. Each DAG run is separate from another and contains a status regarding the execution stage of the DAG. This means that the same DAGs can be executed many times in parallel.

2. Tasks:

Each node of the DAG represents a Task, meaning an individual piece of code. Each task may have some upstream and downstream dependencies. These dependencies express how tasks are related to each other and in which order they should be executed. Whenever a new DAG run is initialized, all tasks are initialized as Task instances. This means that each Task instance is a specific run for the given task.

3. Operators:

Operators can be viewed as templates for predefined tasks because they encapsulate boilerplate code and abstract much of their logic. Some common operators are BashOperator, PythonOperator, MySqlOperator, S3FileTransformOperator. As you can tell, the operators help you define tasks that follow a specific pattern. For example, the MySqlOperator creates a task to execute a SQL query and the BashOperator executes a bash script.

Operators are defined inside the DAG context manager.

4. Task Dependencies:

To form the DAG’s structure, we need to define dependencies between each task. One way is to use the >> symbol as shown below:

task1 >> task2 >> task3

Note that one task may have multiple dependencies:

task1 >> [task2 , task3]

The other way is through the set_downstream, set_upstream functions:

t1.set_downstream([t2,t3])

5. XComs:

XComs, or cross communications, are responsible for communication between tasks. XComs objects can push or pull data between tasks. More specifically, they push data into the metadata database where other tasks can pull from. That’s why there is a limit to the amount of data that can be passed through them.

6. Scheduling:

Scheduling jobs is one of the core features of Airflow. This can be done using the schedule_interval argument which receives a cron expression, a datetime.timedelta object, or a predefined preset such as @hourly, @daily etc. A more flexible approach is to use the recently added timetables that let you define custom schedules using Python.

Once we define a DAG, we set up a start date and a schedule interval. If catchup=True, Airflow will create DAG runs for all schedule intervals from the start date until the current date. If catchup=False, Airflow will schedule only runs from the current date.

Backfilling extends this idea by enabling us to create past runs from the CLI irrespective of the value of the catchup parameter:

airflow backfill -s <START_DATE> -e <END_TIME> <DAG_NAME>

Connections & Hooks:

Airflow provides an easy way to configure connections with external systems or services. Connections can be created using the UI, as environment variables, or through a config file. They usually require a URL, authentication info and a unique id. Hooks are an API that abstracts communication with these external systems

Airflow Best Practises:

Idempotency: DAGs and tasks should be idempotent. Reexecuting the same DAG run with the same inputs should always have the same effect as executing it once.
Atomicity: Tasks should be atomic. Each task should be responsible for a single operation and independent from the other tasks
Top-level code: Top-level code should be avoided if it’s not for creating operators or dags because it will affect performance and loading time. All code should be inside tasks, including imports, database access, and heavy computations.
Complexity: DAGs should be kept as simple as possible because high complexity may impact performance or scheduling

Search This Blog

The Future of QA Engineering: Harnessing Spark Hadoop in Cloud Environments

Apache Airflow Tutorial

Comments

Post a Comment

Popular posts from this blog

Spark Debugging And performance Optimzation

Spark File Format

Basics of Spark