airflow practice

lightsong發表於2024-08-05

airflow_course

https://github.com/fanqingsong/airflow_course

Sample code for Harry's Airflow online trainng course

You can find the videos on youtube or bilibili.

I am working on adding below things:

  1. the slide pdf files(done)
  2. another video about creating custom operators
  3. docker-compose with CeleryExecutors.

DAG file 動態載入

動態程式碼,動態生成

https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#registering-dynamic-dags

from datetime import datetime
from airflow.decorators import dag, task

configs = {
    "config1": {"message": "first DAG will receive this message"},
    "config2": {"message": "second DAG will receive this message"},
}

for config_name, config in configs.items():
    dag_id = f"dynamic_generated_dag_{config_name}"

    @dag(dag_id=dag_id, start_date=datetime(2022, 2, 1))
    def dynamic_generated_dag():
        @task
        def print_message(message):
            print(message)

        print_message(config["message"])

    dynamic_generated_dag()

靜態程式碼,動態載入

https://www.cnblogs.com/woshimrf/p/airflow-first-dag.html

部署dag

將上述hello.py上傳到dag目錄, airflow會自動檢測檔案變化, 然後解析py檔案,匯入dag定義到資料庫.

訪問airflow地址,重新整理即可看到我們的dag.
airflow practice

https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dagfile-processing.html

DAG File Processing refers to the process of turning Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled.

There are two primary components involved in DAG file processing. The DagFileProcessorManager is a process executing an infinite loop that determines which files need to be processed, and the DagFileProcessorProcess is a separate process that is started to convert an individual file into one or more DAG objects.

The DagFileProcessorManager runs user codes. As a result, you can decide to run it as a standalone process in a different host than the scheduler process. If you decide to run it as a standalone process, you need to set this configuration: AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True and run the airflow dag-processor CLI command, otherwise, starting the scheduler process (airflow scheduler) also starts the DagFileProcessorManager.

../_images/dag_file_processing_diagram.png

monitor files 並觸發DAG

https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_sensors.html

https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/sensors/filesystem.html

https://airflow.apache.org/docs/apache-airflow/2.6.0//_modules/airflow/triggers/file.html

FileSensor

https://stackoverflow.com/questions/65019365/triggering-an-airflow-dag-based-on-filesystem-changes

from datetime import datetime
from airflow import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator
import psycopg2

with DAG('Write_data_to_PG', description='This DAG is for writing data to postgres.', schedule_interval='*/5 * * * *',
         start_date=datetime(2018, 11, 1), catchup=False) as dag:
    create_table = PostgresOperator(
        task_id='create_table',
        sql="""CREATE TABLE users(
            id integer PRIMARY KEY,
            email text,
            name text,
            address text
        )
        """,
    )


    def my_func():
        print('Creating table in database.')
        conn = psycopg2.connect("host=localhost dbname=testdb user=testuser")
        print(conn)

        cur = conn.cursor()
        print(cur)

        with open('test.csv', 'r') as f:
            next(f)  # Skip the header row.
            cur.copy_from(f, 'users', sep=',')

        conn.commit()
        print(conn)
        print('DONE!!!!!!!!!!!.')


    file_sensing_task = FileSensor(task_id='sense_the_csv',
                                   filepath='test.csv',
                                   fs_conn_id='my_file_system',
                                   poke_interval=10)

    python_task = PythonOperator(task_id='populate_data', python_callable=my_func)

    create_table >> file_sensing_task >> python_task

https://www.restack.io/docs/airflow-faq-howto-operator-file-01

Understanding and using FileSensor in Apache Airflow - FAQ November 2023

Explore FAQs on Apache Airflow's FileSensor, its purpose, usage, default connection, defining connections, 'fs_conn_id' parameter, local filesystem detection, examples, 'fs_default' connection, passing connection id, and limitations.

https://www.qubole.com/blog/understand-apache-airflows-modular-architecture

Airflow Scheduler:

The scheduler is at the core of Airflow and manages anything and everything related to DAG runs, tasks, the task runs, parsing, and storing DAGs while also taking care of other aspects like worker pool management, and SLAs, and many more.

The scheduler is a multi-threaded python process that keeps on checking and parsing all the code present in the Airflow DAGs folder. Based on the configuration, each DAG gets a number of processes or pools on which it can run.

Note: The Scheduler parses all the DAG files every few minutes which can be set by the setting Scheduler parse interval. This means all the top-level code (code written outside methods/classes/operators in the global scope) in a DAG file will run when the Scheduler parses it. This slows down the Scheduler DAG parsing, resulting in increased usage of memory and CPU. To avoid this and pass all operations to the Operator ( like Python Operator, passing methods and creating classes instead of writing code in global scope wherever possible) unless absolutely necessary.

The Scheduler then decides which DAG should run on which pool and based on the Executor configuration it delegates the actual task run to the Executor. It then keeps a watch on which tasks are running and which tasks are up for execution or retries. It also updates all state transitions in the metadata database.

Over the years Airflow’s Scheduler has matured a lot, improving stability and error handling. In the past, there was a need to restart the Scheduler process every few hours, or sometimes it even went into a zombie state where the process was actually running but not processing anything. These issues have been fixed over the years and since version 1.10.0, the Scheduler is a very stable process that can run for days or months without crashing or the need to restart it manually.

Airflow Executor:

The executor in Apache Airflow is the actual entity that runs the tasks. There are various types of Executors in Airflow and any one of them can be selected using the configuration file based on requirements for parallel processing. Let’s dive into some commonly used executors in Airflow:

    • SequentialExecutor will only run one task instance at a time. This type of Executor is suited only for debugging or testing a DAG locally before pushing to test the environment. It is also the only executor that can be used with SQLite since SQLite doesn’t support multiple connections.
    • LocalExecutor runs tasks by spawning processes in a controlled fashion in different modes. Ideally, the number of tasks a LocalExecutor is unlimited if we specify the parallelism parameter to be 0. But, the number of tasks then is limited by the configuration of the machine being used to run the tasks and the amount of memory and CPU they have available. Arguably, SequentialExecutor could be thought of as a LocalExecutor with limited parallelism of just 1 worker, i.e. self.parallelism = 1. This option could lead to the unification of the executor implementations, running locally, into just one LocalExecutor with multiple modes. Again, this type of Executor is suitable only for running small workloads on a single machine, for higher parallelism we need to be able to distribute tasks to multiple worker processes running on different machines.
    • CeleryExecutor is based on python celery which is widely used to process asynchronous tasks. Celery is an asynchronous task queue/job queue based on distributed message passing. For CeleryExecutor, one needs to set up a queue (Redis, RabbitMQ, or any other task broker supported by Celery) on which all the celery workers running keep on polling for any new tasks to run. In the case of CeleryExecutor, the Scheduler adds all tasks to the task queue that we configure. From the queue, the celery worker picks up the task and runs it. After the execution is complete, the worker reports the status of the task in the database. The Scheduler knows from the database when a task has been completed and then runs the next set of tasks or process alerts based on what is configured in the DAG.

      Flower (The UI for Celery) shows a list of recent task runs

      KubernetesExecutor

    • KubernetesExecutor provides a way to run Airflow tasks on Kubernetes, Kubernetes launch a new pod for each task. While Kubernetes takes care of the pod lifecycle (as Celery took care of task processing) and the Scheduler keeps on polling for task status from Kubernetes. Kubernetes provides a native way to run tasks in a queue. Airflow takes advantage of this mechanism and delegates the actual task processing to Kubernetes. With KubernetesExecutor, each task runs in a new pod within the Kubernetes cluster, which helps isolate the environment for all the tasks. This also improves dependency management using Docker Images, and up or downscales workers as needed—even all the way down to zero worker processes.

airflow-mlops

https://www.astronomer.io/docs/learn/airflow-mlops

Best practices for orchestrating MLOps pipelines with Airflow

Machine Learning Operations (MLOps) is a broad term encompassing everything needed to run machine learning models in production. MLOps is a rapidly evolving field with many different best practices and behavioral patterns, with Apache Airflow providing tool agnostic orchestration capabilities for all steps.

相關文章