Mastering Apache Airflow, Part 4: DAG Runs, Task Lifecycles, and Execution Dynamics
Understanding DAG Runs, Task Dependencies, Execution States, and Failover Mechanisms
Hello fellow Substackers! 👋
Happy Easter, everyone! 🐣🌸 Whether you're spending the day with family or simply taking some time to recharge, I hope you’re enjoying the holiday to the fullest.
But….Now let’s go without wasting your time!!! In today’s post, we're going to dive relatively deep into some of the core elements of Apache Airflow that, while relatively easy to deploy, are absolutely crucial for keeping your workflows running smoothly. Even though these features might seem simple on the surface, they are the backbone of efficient task orchestration and scheduling. 🌟
So, why should you care about these details? Well, when you understand how DAG Runs work and how to set up task dependencies, you gain total control over your pipeline. This means you can fine-tune the execution, handle failures with ease, and ensure that tasks retry when necessary without worrying about missing a beat. With task retries, SLAs, and failover mechanisms in place, you’ll avoid bottlenecks and optimize your workflows for performance. 🚀
Let’s try to break them down in simple terms, so you can use these tools to make your Airflow pipelines reliable and scalable. Ready? So Let’s get started!
🌀 DAG Runs in Apache Airflow
A DAG Run represents a specific execution of your DAG, associated with a unique timestamp and a fixed time window called the data interval. Each time your DAG is scheduled or manually triggered, a new DAG Run is created. All tasks defined within the DAG will be executed as part of that run, independently from any other runs.
DAG Run Status
The status of a DAG Run is evaluated based on the final state of its leaf tasks—these are tasks with no downstream dependencies. Once all leaf tasks reach a terminal state, the DAG Run is marked accordingly:
✅ Success: All leaf nodes are either
success
orskipped
❌ Failed: At least one leaf node is
failed
orupstream_failed
⚠️ Note: Be cautious of tasks with non-standard trigger rules. For instance, a leaf node with the
all_done
trigger rule executes regardless of previous task outcomes. If it completes successfully, it may cause the entire DAG Run to be marked as successful—even if other tasks have failed.
Starting with Airflow 2.7, the UI provides tabs to view Running and Failed DAGs based on their latest run status.
⏱ Data Intervals and Logical Dates
Each DAG Run operates over a data interval, defining the window of time the DAG is responsible for processing. This interval is separate from the time the DAG actually runs.
For example, a DAG scheduled with @daily
and a start_date
of 2023-01-01
will generate:
a DAG Run with execution date
2023-01-01
covering data from
2023-01-01 00:00
to2023-01-02 00:00
and this run will be scheduled shortly after
2023-01-02 00:00
The execution date (or logical date
) corresponds to the start of the data interval, not when the DAG is triggered. Airflow will never run the DAG for the current in-progress interval.
🧠 Use timetables if cron expressions or
timedelta
-based scheduling isn’t expressive enough for your use case.
🔁 Catchup Behavior
Catchup controls whether Airflow should backfill all missing DAG Runs between the start_date
and the current date.
catchup=True
(default): All past intervals are scheduled. Useful for time-partitioned or batch processing pipelines.catchup=False
: Only the latest interval is scheduled. Ideal for real-time or event-driven use cases.
Example:
dag = DAG(
"my_dag",
schedule="@daily",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
catchup=False,
)
If picked up on 2023-01-05
, Airflow will only create a run for 2023-01-04
.
💡 Disabling catchup avoids flooding your scheduler with multiple historical runs if you're only interested in the latest.
🕳️ Backfilling DAG Runs
Backfilling manually runs a DAG for historical intervals, even when catchup=False
. This is commonly used when:
A dataset was reloaded or updated
A DAG was introduced late but needs historical data processed
airflow dags backfill \
--start-date 2023-01-01 \
--end-date 2023-01-05 \
my_dag
This command schedules and executes all missing runs for the specified range.
🔄 Re-running DAGs and Tasks
DAG Re-runs
A failed DAG Run can be retried by:
Clicking Clear on the DAG Run from the UI
Triggering a backfill or manual run via CLI
Task Re-runs
Tasks can be retried individually:
Navigate to the Graph or Tree view
Select the failed task
Click Clear to rerun it
Options include:
Upstream / Downstream reruns
Past / Future runs of the same task
Failed tasks only
Recursive clearing (for SubDAGs)
airflow tasks clear my_dag \
--task-regex "^my_task$" \
--start-date 2023-01-01 \
--end-date 2023-01-03
📂 When a task is cleared, a new TaskInstance is created with
try_number=1
, and it will rerun.
Task Instance History
Every time a task is retried, Airflow preserves the history. The Grid view allows inspecting previous tries via a dropdown. You can also access logs for each attempt.
However:
XComs, rendered templates, and some runtime artifacts are not persisted between tries.
Only task metadata and logs are saved.
⚡ Triggering DAGs Manually (External Triggers)
You can manually trigger a DAG Run via:
airflow dags trigger --exec-date 2023-01-01 my_dag
This creates a DAG Run with the specified logical date, useful for event-driven workflows or ad-hoc processing.
Passing Parameters to Triggered DAGs
You can pass configuration parameters as JSON:
airflow dags trigger \
--conf '{"region": "us-east", "run_mode": "test"}' \
my_parameterized_dag
Inside your DAG, these values are accessible via the dag_run.conf
dictionary:
BashOperator(
task_id="print_param",
bash_command="echo {{ dag_run.conf['region'] }}",
dag=dag,
)
In the UI, you can define params in your DAG for a more user-friendly form-based input experience.
🔍The Life of an Airflow Task
Picture this: you’re a conductor orchestrating a complex symphony. Each instrument plays a different part—some keep rhythm, others handle solos, and many wait for their cue. In Airflow, Tasks are your instruments, and the DAG is your sheet music. You decide who plays, when, and in what order.
Welcome to the secret life of Airflow tasks—where Python meets orchestration, and every second counts.
🎭 What is a Task in Airflow?
At its core, a Task is just a blueprint for doing something—run a script, query a database, send an email, wait for a file to land on S3. But behind this humble façade is a powerful runtime mechanism.
A Task becomes real only when a DAG run is triggered. Then it turns into a TaskInstance, which is like a live actor performing their role in a particular show (run). Multiple actors may play the same role on different nights—just as multiple task instances can run the same task logic on different dates.
🧱 The Three Faces of a Task
Airflow lets you define tasks in three primary ways, each suited to different use cases:
1. Operators: The Lego Blocks 🧩
Operators are predefined, configurable building blocks for common task types.
Want to run a shell command? Use BashOperator
.
Call a Python function? Use PythonOperator
.
Send an alert? There’s an EmailOperator
for that too.
Operators make it ridiculously fast to build DAGs, but they’re also flexible—you can subclass them to create custom logic if needed.
2. Sensors: The Watchdogs 🕵️
Sensors are special operators that wait. They don't do; they observe.
For example:
S3KeySensor
waits for a file to show up.ExternalTaskSensor
waits for another DAG to finish.
Sensors can operate in:
poke mode (check every X seconds—resource-intensive), or
reschedule mode (yield CPU while waiting—efficient, recommended).
They’re perfect for data readiness workflows.
3. TaskFlow API: Native Python Magic ✨
With the @task
decorator, Airflow introduced a more intuitive way to define tasks as pure Python functions.
No need to wrap logic in an operator—just write a function, decorate it, and Airflow turns it into a task.
Example:
@task
def transform_data():
# your logic here
It’s clean, readable, and integrates well with type hinting and XComs (more on those later).
🔗 Dependencies: Who Waits for Whom?
Airflow is a Directed Acyclic Graph system. That means no task can loop back on itself—and every task must know who it’s waiting for.
You define task dependencies with:
extract >> transform >> load
Or:
transform.set_upstream(extract)
This creates an implicit contract: "I will not run until all my upstream tasks succeed."
Unless you override it with Trigger Rules like all_failed
, none_skipped
, or one_success
.
And when you want conditional logic, Airflow gives you branching, so you can say:
If a file exists → run Path A
Else → run Path B
All with native branching operators or custom logic in Python.
🧬 From Task to TaskInstance: When the DAG Runs
Once a DAG is triggered, every task becomes a TaskInstance—a runtime object representing one execution of that task on a specific date and data interval.
Each instance has its own state, tracked in the metadata database:
State Meaning none
Not yet evaluated scheduled
Chosen for execution but not yet queued queued
Waiting for an executor to pick it up running
Currently executing success
Finished successfully failed
Failed with an exception skipped
Skipped (manually or via branching) up_for_retry
Failed, but eligible to retry upstream_failed
Didn’t run because a parent task failed deferred
Paused, waiting for a trigger (e.g., async sensor) up_for_reschedule
Waiting (e.g., sensor in reschedule mode) removed
No longer exists in the DAG definition
These states are crucial for scheduling logic, failure recovery, and dashboard visibility.
⏱️ Timeouts vs. SLAs: Two Kinds of Deadlines
You can tell Airflow:
“If this task takes too long, kill it.”
That’s a timeout.
Example:
BashOperator(
task_id="crunch_numbers",
bash_command="run-heavy-computation.sh",
execution_timeout=timedelta(minutes=20)
)
You can also say:
“If this task takes more than 5 minutes after the DAG starts, let me know.”
That’s an SLA (Service Level Agreement).
SLAs don’t kill the task—they notify you of performance degradation. You can configure Airflow to send SLA violation emails or run a custom callback like:
def alert_sla_violation(dag, task_list, ...):
# send to Slack or PagerDuty
🚨 Failing Fast, Skipping Smart
Sometimes you want to manually control what happens during runtime:
Use
AirflowSkipException
in your task to gracefully skip it.Use
AirflowFailException
to fail the task immediately.
Example:
@task
def check_inputs():
if not os.path.exists("/data/input.csv"):
raise AirflowSkipException("No input file today!")
These exceptions integrate tightly with branching logic and dependencies.
🧟 Zombie and Undead Tasks
Not all tasks die gracefully. Sometimes the worker process crashes, the pod disappears, or a task gets stuck without heartbeats.
Airflow has two cool mechanisms to keep things clean:
Zombie tasks: Task is marked as running but its job is gone. The scheduler will fail or retry it after checking heartbeats and job status.
Undead tasks: Task is running but shouldn’t be—like a ghost from a past DAG definition. The system will terminate it forcibly.
These are detected through:
Missing heartbeats (scheduler checks every 60s).
Invalid job IDs or orphaned processes.
Logic inside
LocalTaskJob.heartbeat()
andSchedulerJob._check_zombies()
.
This safety net is crucial in long-running DAGs, Kubernetes environments, or when scaling.
✨ TL;DR – Tasks Are Your DAG’s Pulse
Every DAG is a story. Every task is a scene in that story. Knowing how they behave—when they run, how they fail, how they interact—gives you the power to write reliable, scalable, and elegant data pipelines.
Whether you’re launching a quick Bash script or orchestrating an ML model training flow with 50+ tasks, everything in Airflow starts with a task.
🏁Conclusion
And there you have it, folks!!
We've taken a deep dive into some of the essential components that help Apache Airflow maintain order in what could otherwise be a chaotic sea of tasks and dependencies. From DAG Runs and task retries to catchup behavior and task states, these features are critical to keeping your workflows running smoothly and efficiently.
By understanding how task instances, dependencies, and trigger rules interact, you're setting yourself up to build robust, scalable, and easily maintainable pipelines. Whether you're working with sensors, operators, or leveraging the TaskFlow API with Python, it's all about orchestrating those tasks to work together like a finely tuned orchestra. 🎶
As your workflows grow, mastering these tools will ensure your system can handle failures gracefully, track performance with SLAs, and scale without hitting a bottleneck. So, keep these principles in mind as you continue to craft your Airflow pipelines—they’re the foundation for success.
Wishing you all a joyful Easter filled with happiness and maybe a bit of time to relax and refresh. 🌟🚀
Feel free to reach out with any questions or feedback! Until next time, happy orchestrating! 🚀💻