Apache Airflow Tutorial
Apache Airflow is a tool for authoring, monitoring & scheduling pipelines. As a result, it is an ideal case of ETL & MLOps pipelines.
Examples Uses Cases:
- Extracting data from many sources, aggregating them, transforming them, and store in a data warehouse.
Extract insights from data and display them in an analytics dashboard
Train, validate, and deploy machine learning models
Operators can be viewed as templates for predefined tasks because they encapsulate boilerplate code and abstract much of their logic. Some common operators are BashOperator, PythonOperator, MySqlOperator, S3FileTransformOperator. As you can tell, the operators help you define tasks that follow a specific pattern. For example, the MySqlOperator creates a task to execute a SQL query and the BashOperator executes a bash script.
Operators are defined inside the DAG context manager.
4. Task Dependencies:
To form the DAG’s structure, we need to define dependencies between each task. One way is to use the >> symbol as shown below:
task1 >> task2 >> task3
Note that one task may have multiple dependencies:
task1 >> [task2 , task3]
The other way is through the set_downstream, set_upstream functions:
t1.set_downstream([t2,t3])
5. XComs:
XComs, or cross communications, are responsible for communication between tasks. XComs objects can push or pull data between tasks. More specifically, they push data into the metadata database where other tasks can pull from. That’s why there is a limit to the amount of data that can be passed through them.
6. Scheduling:
Scheduling jobs is one of the core features of Airflow. This can be done using the schedule_interval argument which receives a cron expression, a datetime.timedelta object, or a predefined preset such as @hourly, @daily etc. A more flexible approach is to use the recently added timetables that let you define custom schedules using Python.
Once we define a DAG, we set up a start date and a schedule interval. If catchup=True, Airflow will create DAG runs for all schedule intervals from the start date until the current date. If catchup=False, Airflow will schedule only runs from the current date.
Backfilling extends this idea by enabling us to create past runs from the CLI irrespective of the value of the catchup parameter:
airflow backfill -s <START_DATE> -e <END_TIME> <DAG_NAME>
Connections & Hooks:
Airflow provides an easy way to configure connections with external systems or services. Connections can be created using the UI, as environment variables, or through a config file. They usually require a URL, authentication info and a unique id. Hooks are an API that abstracts communication with these external systems
Airflow Best Practises:
Idempotency: DAGs and tasks should be idempotent. Reexecuting the same DAG run with the same inputs should always have the same effect as executing it once.
Atomicity: Tasks should be atomic. Each task should be responsible for a single operation and independent from the other tasks
Top-level code: Top-level code should be avoided if it’s not for creating operators or dags because it will affect performance and loading time. All code should be inside tasks, including imports, database access, and heavy computations.
Complexity: DAGs should be kept as simple as possible because high complexity may impact performance or scheduling
Comments
Post a Comment