airflow practice

lightsong發表於2024-08-05

原文網址 : https://www.cnblogs.com/lightsong/p/18344256

airflow_course

https://github.com/fanqingsong/airflow_course

Sample code for Harry's Airflow online trainng course

You can find the videos on youtube or bilibili.

I am working on adding below things:

the slide pdf files(done)

another video about creating custom operators

docker-compose with CeleryExecutors.

DAG file 動態載入

動態程式碼，動態生成

https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#registering-dynamic-dags

from datetime import datetime
from airflow.decorators import dag, task

configs = {
    "config1": {"message": "first DAG will receive this message"},
    "config2": {"message": "second DAG will receive this message"},
}

for config_name, config in configs.items():
    dag_id = f"dynamic_generated_dag_{config_name}"

    @dag(dag_id=dag_id, start_date=datetime(2022, 2, 1))
    def dynamic_generated_dag():
        @task
        def print_message(message):
            print(message)

        print_message(config["message"])

    dynamic_generated_dag()

靜態程式碼，動態載入

https://www.cnblogs.com/woshimrf/p/airflow-first-dag.html

部署dag

將上述hello.py上傳到dag目錄, airflow會自動檢測檔案變化, 然後解析py檔案,匯入dag定義到資料庫.

訪問airflow地址,重新整理即可看到我們的dag.

https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dagfile-processing.html

DAG File Processing refers to the process of turning Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled.

There are two primary components involved in DAG file processing. The DagFileProcessorManager is a process executing an infinite loop that determines which files need to be processed, and the DagFileProcessorProcess is a separate process that is started to convert an individual file into one or more DAG objects.

The DagFileProcessorManager runs user codes. As a result, you can decide to run it as a standalone process in a different host than the scheduler process. If you decide to run it as a standalone process, you need to set this configuration: AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True and run the airflow dag-processor CLI command, otherwise, starting the scheduler process (airflow scheduler) also starts the DagFileProcessorManager.

monitor files 並觸發DAG

https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_sensors.html

https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/sensors/filesystem.html

https://airflow.apache.org/docs/apache-airflow/2.6.0//_modules/airflow/triggers/file.html

FileSensor

https://stackoverflow.com/questions/65019365/triggering-an-airflow-dag-based-on-filesystem-changes

from datetime import datetime
from airflow import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator
import psycopg2

with DAG('Write_data_to_PG', description='This DAG is for writing data to postgres.', schedule_interval='*/5 * * * *',
         start_date=datetime(2018, 11, 1), catchup=False) as dag:
    create_table = PostgresOperator(
        task_id='create_table',
        sql="""CREATE TABLE users(
            id integer PRIMARY KEY,
            email text,
            name text,
            address text
        )
        """,
    )


    def my_func():
        print('Creating table in database.')
        conn = psycopg2.connect("host=localhost dbname=testdb user=testuser")
        print(conn)

        cur = conn.cursor()
        print(cur)

        with open('test.csv', 'r') as f:
            next(f)  # Skip the header row.
            cur.copy_from(f, 'users', sep=',')

        conn.commit()
        print(conn)
        print('DONE!!!!!!!!!!!.')


    file_sensing_task = FileSensor(task_id='sense_the_csv',
                                   filepath='test.csv',
                                   fs_conn_id='my_file_system',
                                   poke_interval=10)

    python_task = PythonOperator(task_id='populate_data', python_callable=my_func)

    create_table >> file_sensing_task >> python_task

https://www.restack.io/docs/airflow-faq-howto-operator-file-01

Understanding and using FileSensor in Apache Airflow - FAQ November 2023

Explore FAQs on Apache Airflow's FileSensor, its purpose, usage, default connection, defining connections, 'fs_conn_id' parameter, local filesystem detection, examples, 'fs_default' connection, passing connection id, and limitations.

https://www.qubole.com/blog/understand-apache-airflows-modular-architecture

Airflow Scheduler:

The scheduler is at the core of Airflow and manages anything and everything related to DAG runs, tasks, the task runs, parsing, and storing DAGs while also taking care of other aspects like worker pool management, and SLAs, and many more.

The scheduler is a multi-threaded python process that keeps on checking and parsing all the code present in the Airflow DAGs folder. Based on the configuration, each DAG gets a number of processes or pools on which it can run.

Note: The Scheduler parses all the DAG files every few minutes which can be set by the setting Scheduler parse interval. This means all the top-level code (code written outside methods/classes/operators in the global scope) in a DAG file will run when the Scheduler parses it. This slows down the Scheduler DAG parsing, resulting in increased usage of memory and CPU. To avoid this and pass all operations to the Operator ( like Python Operator, passing methods and creating classes instead of writing code in global scope wherever possible) unless absolutely necessary.

The Scheduler then decides which DAG should run on which pool and based on the Executor configuration it delegates the actual task run to the Executor. It then keeps a watch on which tasks are running and which tasks are up for execution or retries. It also updates all state transitions in the metadata database.

Over the years Airflow’s Scheduler has matured a lot, improving stability and error handling. In the past, there was a need to restart the Scheduler process every few hours, or sometimes it even went into a zombie state where the process was actually running but not processing anything. These issues have been fixed over the years and since version 1.10.0, the Scheduler is a very stable process that can run for days or months without crashing or the need to restart it manually.

Airflow Executor:

The executor in Apache Airflow is the actual entity that runs the tasks. There are various types of Executors in Airflow and any one of them can be selected using the configuration file based on requirements for parallel processing. Let’s dive into some commonly used executors in Airflow:

SequentialExecutor will only run one task instance at a time. This type of Executor is suited only for debugging or testing a DAG locally before pushing to test the environment. It is also the only executor that can be used with SQLite since SQLite doesn’t support multiple connections.

LocalExecutor runs tasks by spawning processes in a controlled fashion in different modes. Ideally, the number of tasks a LocalExecutor is unlimited if we specify the parallelism parameter to be 0. But, the number of tasks then is limited by the configuration of the machine being used to run the tasks and the amount of memory and CPU they have available. Arguably, SequentialExecutor could be thought of as a LocalExecutor with limited parallelism of just 1 worker, i.e. self.parallelism = 1. This option could lead to the unification of the executor implementations, running locally, into just one LocalExecutor with multiple modes. Again, this type of Executor is suitable only for running small workloads on a single machine, for higher parallelism we need to be able to distribute tasks to multiple worker processes running on different machines.

CeleryExecutor is based on python celery which is widely used to process asynchronous tasks. Celery is an asynchronous task queue/job queue based on distributed message passing. For CeleryExecutor, one needs to set up a queue (Redis, RabbitMQ, or any other task broker supported by Celery) on which all the celery workers running keep on polling for any new tasks to run. In the case of CeleryExecutor, the Scheduler adds all tasks to the task queue that we configure. From the queue, the celery worker picks up the task and runs it. After the execution is complete, the worker reports the status of the task in the database. The Scheduler knows from the database when a task has been completed and then runs the next set of tasks or process alerts based on what is configured in the DAG.

Flower (The UI for Celery) shows a list of recent task runs

KubernetesExecutor

KubernetesExecutor provides a way to run Airflow tasks on Kubernetes, Kubernetes launch a new pod for each task. While Kubernetes takes care of the pod lifecycle (as Celery took care of task processing) and the Scheduler keeps on polling for task status from Kubernetes. Kubernetes provides a native way to run tasks in a queue. Airflow takes advantage of this mechanism and delegates the actual task processing to Kubernetes. With KubernetesExecutor, each task runs in a new pod within the Kubernetes cluster, which helps isolate the environment for all the tasks. This also improves dependency management using Docker Images, and up or downscales workers as needed—even all the way down to zero worker processes.

airflow-mlops

https://www.astronomer.io/docs/learn/airflow-mlops

Best practices for orchestrating MLOps pipelines with Airflow

Machine Learning Operations (MLOps) is a broad term encompassing everything needed to run machine learning models in production. MLOps is a rapidly evolving field with many different best practices and behavioral patterns, with Apache Airflow providing tool agnostic orchestration capabilities for all steps.

Practice
2024-05-25
shell practice 03
2024-04-12
shell practice 04
2024-04-12
shell practice 05
2024-04-12
shell practice 06
2024-04-12
shell practice 07
2024-04-12
scientifically practice DP
2024-10-14
Practice| 流程控制
2020-04-05
Airflow 實戰軍規
2018-07-05
AI
Airflow 任務排程
2024-04-24
AI
airflow DAG/PIPELINE examples reference
2024-08-06
AI
Vegetables need more practice.
2024-12-04
Apache Airflow 2.3.0 釋出
2022-05-02
ApacheAI
[#181024][PAT Practice] A+B Format
2018-10-31
ORM
yolov5s ncnn practice
2024-08-17
YOLOCNN
大資料技術 - Airflow
2023-01-06
大資料AI
Airflow 2正式出來了
2020-12-23
AI
Airflow 中文文件：快速開始
2018-11-28
AI
[譯] Robinhood 為什麼使用 Airflow
2018-07-13
AI
Oracle 12.2 RAC on Linux Best Practice Documentation
2019-08-21
OracleLinux
use-case-airflow-llm-rag-finance
2024-11-30
AINaN
Apache Airflow十條最佳實踐
2022-02-27
ApacheAI
[譯] 理解 Apache Airflow 的關鍵概念
2018-08-21
ApacheAI
Airflow替代方案：Prefect和Dagster比較
2022-01-11
AI
Airflow 中文文件：用Dask擴充套件
2018-11-29
AI套件
1003 我要通過！| PAT (Basic Level) Practice
2018-10-06
MacChromecast遠端播放工具——Airflow for Mac
2020-04-30
MacChromeASTAI
SAP S4HANA LTMC Practice - the first shot!
2021-11-23
DolphinDB +Python Airflow 高效實現資料清洗
2023-04-14
PythonAI
Practice – iOS 專案持續整合實踐（一）
2018-07-18
iOS
Practice - iOS 專案持續整合實踐（一）
2018-07-17
iOS
[譯] Airflow: 一個工作流程管理平臺
2018-07-28
AI
透過API觸發airflow的DAG任務
2024-05-06
APIAI
Operating Systems: Principles and Practice 2nd ed. Edition
2024-08-30
airflow2.0.2分散式安裝文件
2021-06-07
AI分散式
The practice of high-performance graph computing system Plato in Nebula Graph
2022-11-24
ORM
DBT、Airflow 和 Kubernetes的架構演進 - yan
2022-08-02
AI架構
PAT (Advanced Level) Practice 1149 Dangerous Goods Packaging (25分)
2020-12-03
Go

airflow practice

airflow_course

DAG file 動態載入

動態程式碼，動態生成

靜態程式碼，動態載入

部署dag

monitor files 並觸發DAG

FileSensor

Understanding and using FileSensor in Apache Airflow - FAQ November 2023

Airflow Scheduler:

Airflow Executor:

Flower (The UI for Celery) shows a list of recent task runs

KubernetesExecutor

airflow-mlops

Best practices for orchestrating MLOps pipelines with Airflow

相關文章