Data Engineering Space

Data Engineering Space is a Medium.com publication that provides high-quality content and resources related to data engineering. Our website feature articles, tutorials, and educational content that provide insights into best practices for data engineering.

Follow publication

Explore Airflow KubernetesExecutor on AWS and kops

Chengzhi Zhao
Data Engineering Space
3 min readOct 9, 2018

--

With the recent launch of Apache Airflow 1.10, we saw some exciting changes. One impressive feature is the Kubernetes Executor, which allows users to execute tasks within the Kubernetes environment; in this way, you will get a self-healing pod and quickly scale your DAGs using Kubernetes.

The Bloomberg team originally started the KubernetesExecutor and contributed back to the Airflow community.

From Airflow official docs:

The kubernetes executor is introduced in Apache Airflow 1.10.0. The Kubernetes executor will create a new pod for every task instance.

The Airflow team also has an excellent tutorial on how to use minikube to play with KubernetesExecutor on your local environment. But you may want to have a different setup than minikube for your production, and I hope this blog can provide you with some ideas for running Kubernetes Executor on production.

After exploring the new Kubernetes Executor, two things to notice here:

  • A new pod for every task instance — if your task is as simple as a print statement, it will still follow the lifecycle of a pod: create a pod -> execute the code -> destroy it.
  • Tasks take time to start to execute — it takes time to create the container, I noticed usually took 30 seconds to 1-minute wait-time (on minikube) to accomplish the task.

Step 0: set up kops on AWS and Dockerfile for Airflow

It is out of scope here. You can refer more information here

Step 1: change the executor in airflow.cfg

Find the following line in your airflow.cfg

executor = KubernetesExecutor

Step 2: update Kubernetes section in Airflow.cfg

worker_container_repository = $IMAGE

worker_container_tag = $VERSION

dags_volume_claim = airflow-dags

logs_volume_claim = airflow-logs

worker_container_repository and worker_container_tag are the default image and tag if you don’t specify in your Airflow operator. Since we are…

--

--

Data Engineering Space
Data Engineering Space

Published in Data Engineering Space

Data Engineering Space is a Medium.com publication that provides high-quality content and resources related to data engineering. Our website feature articles, tutorials, and educational content that provide insights into best practices for data engineering.

Chengzhi Zhao
Chengzhi Zhao

Written by Chengzhi Zhao

Data Engineer | Data Content Creator | Contributor of Airflow, Flink | Blog chengzhizhao.com

Responses (4)

Write a response