Using a local infrastructure to work with Big Data is often expensive and inefficient: tasks that take only a few hours a week require huge computing power that must be paid for, configured, and maintained. Therefore, many companies are moving big data processing to the cloud, wherein in a matter of minutes, you can get a fully tuned and optimized data cluster with pay per second – for the resources used.
Another reason why working with BigData is preferable in the cloud is the ability to use Kubernetes aaS. The main advantages of working with Big Data in Kubernetes are flexible scaling and isolation of environments. The first allows you to automatically change the resources allocated in the cloud depending on changing loads. The second ensures the compatibility of different versions of libraries and applications in the cluster through containerization.
Since AirFlow is designed to orchestrate ETL processes in Big Data and Data Science, it can be launched and even recommended in the cloud. AirFlow also works well with Kubernetes. Here are the ways to launch Airflow in Kubernetes were briefly mentioned above – we will describe them in more detail:
- Using the Kubernetes Pod Operator – in this case, only some Airflow tasks are brought to Kubernetes, to which the corresponding operator is associated. Using an operator pod will be created for each task inside Kubernetes. In this case, the standard CeleryExecutor will be used as an Executor.
- With the help of the Kubernetes Executor – in this case, a separate Worker within Kubernetes will be created for each Airflow task, which will create new pods if necessary. Suppose you use KubernetesPodOperator and Kubernetes Executor simultaneously. In that case, the first sub – Worker will be created first, and then it will make the following sub and launch an Airflow task on it.
- The good thing about this method is that it creates pods only on-demand, thereby saving resources and allowing it to scale as needed. However, you need to consider the delays in making new pods. Therefore, with many tasks running only a few minutes, it is better to use CeleryExecutor, and move the most resource-intensive tasks to Kubernetes using KubernetesPodOperator.
- With the help of the CeleryKubernetes Executor – in this case, CeleryExecutor and KubernetesExecutor can be shared. The method is recommended for use in three points:
- Many small tasks can be done in a CeleryExecutor, but there are also resource-intensive tasks that require a KubernetesExecutor.
- Relatively few tasks require an isolated environment.
- Estimated peak loads significantly exceed the capabilities of the Kubernetes cluster.
Why Is AirFlow Required?
Of course, AirFlow is far from the only solution of this kind in the IT market. There are many other tools for planning and monitoring ETL processes, both paid and open-source. You can completely get by with the standard Cron scheduler in the simplest cases, setting up workflows through Crontab.
Here are some typical scenarios where AirFlow might be the best choice:
- Cron capabilities are not enough for scheduling tasks – automation is required.
- The team already has sufficient expertise in Python programming.
- The project uses batch processing (Batch), not streaming (Stream). AirFlow is designed for Batch jobs; Apache NiFi is better for streaming data processing.
- For tasks used on a project, it is possible to define dependencies by presenting them as a DAG graph.
- A transition to the cloud is planned or already completed and a reliable orchestrator is needed that supports all the principles of Cloud-Native.
Also Read: Data Lineage And Provenance: – Big Data Management For Beginners