poetry 下執行 dbt(qbit)

qbit發表於2022-12-30

前言

  • dbt 是 Data Build Tool 的簡稱,目前由 dbt labs 公司負責開發和維護,公司的前身是Fishtown Analytics。
  • dbt 主要處理 ETL 中的 T(transform)
  • dbt 主要分為兩部分

    dbt Core: cli 命令列工具
    dbt Cloud: 雲服務
  • dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
  • qbit 個人覺得 dbt Core 的官方入門教程不友好,一開始就要求谷歌的 BigQuery 服務、GitHub 賬號等,其實 dbt Core 可以不聯網完全離線使用。
  • 本文主要參照:【大資料架構之旅】2 從零起步學 dbt

技術棧

Windows    10
Python     3.8.10
poetry     1.3.1
git        2.35.1.windows.2
PostgreSQL 15.1

poetry 專案配置

  • .../test_dbt/pyproject.toml

    [tool.poetry]
    name = "test-dbt"
    version = "0.1.0"
    description = ""
    authors = ["qbit <q@bit.cn>"]
    readme = "README.md"
    packages = [{include = "test_dbt"}]
    
    [[tool.poetry.source]]
    name = "aliyun"
    url = "https://mirrors.aliyun.com/pypi/simple/"
    default = true
    
    [tool.poetry.dependencies]
    python = "^3.8"
    dbt-core = "~1.3.1"
    dbt-postgres = "~1.3.1"
    
    [build-system]
    requires = ["poetry-core"]
    build-backend = "poetry.core.masonry.api"
  • 本文中的 poetry 專案資料夾.../test_dbt/
  • 本文中的 dbt 專案資料夾.../test_dbt/dbt_demo/

建立資料庫與初始化專案

  • 在 PostgreSQL 裡面新建資料庫 dbt-demo
  • 初始化 dbt 專案,在 .../test_dbt/ 目錄下執行以下命令

    poetry run dbt init dbt_demo
  • 按提示執行完後,.../test_dbt/ 的目錄結構如下

    $ tree
    .
    ├── dbt_demo
    │   ├── analyses
    │   ├── dbt_project.yml
    │   ├── macros
    │   ├── models
    │   │   └── example
    │   │       ├── my_first_dbt_model.sql
    │   │       ├── my_second_dbt_model.sql
    │   │       └── schema.yml
    │   ├── README.md
    │   ├── seeds
    │   ├── snapshots
    │   └── tests
    ├── logs
    │   └── dbt.log
    ├── poetry.lock
    └── pyproject.toml
  • 在個人使用者目錄下找到檔案 ~/.dbt/profiles.ymlqbit 的絕對路徑是 C:\Users\qbit\.dbt\profiles.yml,將以下內容複製貼上到檔案,中括號裡面的內容按自己的實際情況填寫

    dbt_demo:
    outputs:
    
      dev:
        type: postgres
        threads: 1
        host: [host]
        port: [port]
        user: [dev_username]
        pass: [dev_password]
        dbname: dbt_demo
        schema: dev_schema
    
      prod:
        type: postgres
        threads: 1
        host: [host]
        port: [port]
        user: [prod_username]
        pass: [prod_password]
        dbname: dbt_demo
        schema: prod_schema
    
    target: dev
  • 下文命令若無特殊說明,都在 .../test_dbt/dbt_demo/ 目錄下執行

自帶模型

  • 進入 .../test_dbt/dbt_demo/ 目錄後執行以下命令檢查環境資訊

    $ poetry run dbt debug
    06:41:09  Running with dbt=1.3.1
    dbt version: 1.3.1
    python version: 3.8.10
    python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe
    os info: Windows-10-10.0.19045-SP0
    Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml
    Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml
    
    Configuration:
    profiles.yml file [OK found and valid]
    dbt_project.yml file [OK found and valid]
    
    Required dependencies:
     - git [OK found]
    
    Connection:
    host: 192.168.1.52
    port: 5432
    user: postgres
    database: dbt_demo
    schema: dev_schema
    search_path: None
    keepalives_idle: 0
    sslmode: None
    Connection test: [OK connection ok]
    
    All checks passed!
  • 生成自帶模型

    $ poetry run dbt run
    06:47:04  Running with dbt=1.3.1
    06:47:04  Partial parse save file not found. Starting full parse.
    06:47:04  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
    06:47:04
    06:47:05  Concurrency: 1 threads (target='dev')
    06:47:05
    06:47:05  1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]
    06:47:05  1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s]
    06:47:05  2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]
    06:47:05  2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s]
    06:47:05
    06:47:05  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s).
    06:47:05
    06:47:05  Completed successfully
    06:47:05
    06:47:05  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 檢查 PostgreSQL 資料庫,應該可以看到一張名為 my_first_dbt_model 的表,和一個名為 my_second_dbt_model 的檢視

外部資料

建立底表

  • 加州大學歐文分校下載共享單車的資料集,將 hour.csvday.csv 放到 .../test_dbt/dbt_demo/seeds 資料夾下。
  • .../test_dbt/dbt_demo/seeds 目錄下建立檔案 bike_share.yml,內容如下:

    version: 2
    seeds:
    - name: hour
      config:
        column_types: 
          dteday: date
    - name: day
      config:
        column_types:
          dteday: date
  • 這裡指定 date 型別是為了避免 dbt 自動推斷型別出錯
  • 關於資料列的型別,可參考官方文件 column_types
  • 執行以下命令匯入基礎資料

    $ poetry run dbt seed
    07:19:26  Running with dbt=1.3.1
    07:19:26  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:19:26
    07:19:26  Concurrency: 1 threads (target='dev')
    07:19:26
    07:19:26  1 of 2 START seed file dev_schema.day .......................................... [RUN]
    07:19:31  1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s]
    07:19:31  2 of 2 START seed file dev_schema.hour ......................................... [RUN]
    07:21:24  2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s]
    07:21:24
    07:21:24  Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s).
    07:21:24
    07:21:24  Completed successfully
    07:21:24
    07:21:24  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 檢查 PostgreSQL 資料庫,應該可以看到新建了 2 張表:hourday

自定義模型

  • 建立模型目錄 .../test_dbt/dbt_demo/models/bike_share,下面 sql 檔案中 ref('day') 是引用的上面匯入的 day資料庫表
  • .../models/bike_share 目錄內建立 weather_trend.sql,內容如下

    /* 天氣趨勢表 */
    
    {{ config(materialized='table') }}
    
    with weather_trend as (
      select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by weathersit
    )
  • .../models/bike_share 目錄內建立 season_trend.sql,內容如下

    /* 季節趨勢表 */
    
    {{ config(materialized='table') }}
    
    with season_trend as (
      select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by season
    )
  • .../models/bike_share 目錄內建立 month_trend.sql,內容如下

    /* 月度趨勢表 */
    
    {{ config(materialized='table') }}
    
    with month_trend as (
      select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by mnth
    )
  • .../models/bike_share 目錄內建立 weekday_trend.sql,內容如下

    /* 星期幾趨勢表 */
    
    {{ config(materialized='table') }}
    
    with weekday_trend as (
      select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by weekday
    )
  • 執行以下命令生成模型

    $ poetry run dbt run
    07:43:59  Running with dbt=1.3.1
    07:43:59  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:43:59
    07:43:59  Concurrency: 1 threads (target='dev')
    07:43:59
    07:43:59  1 of 6 START sql table model dev_schema.month_trend ............................ [RUN]
    07:44:00  1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s]
    07:44:00  2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]
    07:44:00  2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s]
    07:44:00  3 of 6 START sql table model dev_schema.season_trend ........................... [RUN]
    07:44:00  3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s]
    07:44:00  4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN]
    07:44:00  4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s]
    07:44:00  5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN]
    07:44:00  5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s]
    07:44:00  6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]
    07:44:00  6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s]
    07:44:00
    07:44:00  Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s).
    07:44:00
    07:44:00  Completed successfully
    07:44:00
    07:44:00  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

文件生成和檢視

  • 執行以下命令生成文件

    $ poetry run dbt docs generate
    07:49:09  Running with dbt=1.3.1
    07:49:09  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:49:09
    07:49:09  Concurrency: 1 threads (target='dev')
    07:49:09
    07:49:09  Done.
    07:49:10  Building catalog
    07:49:10  Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
  • 執行以下命令,會啟動 http 服務,並開啟預設瀏覽器檢視文件

    $ poetry run dbt docs serve
    07:49:53  Running with dbt=1.3.1
    07:49:53  Serving docs at 0.0.0.0:8080
    07:49:53  To access from your browser, navigate to:  http://localhost:8080
    07:49:53
    07:49:53
    07:49:53  Press Ctrl+C to exit.
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 -
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 -
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
  • 網頁介面
    image.png
  • 點選右下角按鈕檢視血緣關係圖(Lineage Graph)
    image.png

目錄結構

  • 最後再看一眼目錄結構

    $ tree ./dbt_demo/ -L 3
    ./dbt_demo/
    ├── analyses
    ├── dbt_packages
    ├── dbt_project.yml
    ├── logs
    │   └── dbt.log
    ├── macros
    ├── models
    │   ├── bike_share
    │   │   ├── month_trend.sql
    │   │   ├── season_trend.sql
    │   │   ├── weather_trend.sql
    │   │   └── weekday_trend.sql
    │   └── example
    │       ├── my_first_dbt_model.sql
    │       ├── my_second_dbt_model.sql
    │       └── schema.yml
    ├── README.md
    ├── seeds
    │   ├── bike_share.yml
    │   ├── day.csv
    │   └── hour.csv
    ├── snapshots
    ├── target
    │   ├── catalog.json
    │   ├── compiled
    │   │   └── dbt_demo
    │   ├── graph.gpickle
    │   ├── index.html
    │   ├── manifest.json
    │   ├── partial_parse.msgpack
    │   ├── run
    │   │   └── dbt_demo
    │   └── run_results.json
    └── tests
本文出自 qbit snap

相關文章