前言
- dbt 是 Data Build Tool 的簡稱,目前由 dbt labs 公司負責開發和維護,公司的前身是Fishtown Analytics。
- dbt 主要處理
ETL
中的T
(transform) dbt 主要分為兩部分
dbt Core: cli 命令列工具 dbt Cloud: 雲服務
- dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
- qbit 個人覺得 dbt Core 的官方入門教程不友好,一開始就要求谷歌的 BigQuery 服務、GitHub 賬號等,其實
dbt Core
可以不聯網完全離線使用。 - 本文主要參照:【大資料架構之旅】2 從零起步學 dbt
技術棧
Windows 10
Python 3.8.10
poetry 1.3.1
git 2.35.1.windows.2
PostgreSQL 15.1
poetry 專案配置
.../test_dbt/pyproject.toml
[tool.poetry] name = "test-dbt" version = "0.1.0" description = "" authors = ["qbit <q@bit.cn>"] readme = "README.md" packages = [{include = "test_dbt"}] [[tool.poetry.source]] name = "aliyun" url = "https://mirrors.aliyun.com/pypi/simple/" default = true [tool.poetry.dependencies] python = "^3.8" dbt-core = "~1.3.1" dbt-postgres = "~1.3.1" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
- 本文中的
poetry 專案資料夾
指.../test_dbt/
- 本文中的
dbt 專案資料夾
指.../test_dbt/dbt_demo/
建立資料庫與初始化專案
- 在 PostgreSQL 裡面新建資料庫
dbt-demo
初始化 dbt 專案,在
.../test_dbt/
目錄下執行以下命令poetry run dbt init dbt_demo
按提示執行完後,
.../test_dbt/
的目錄結構如下$ tree . ├── dbt_demo │ ├── analyses │ ├── dbt_project.yml │ ├── macros │ ├── models │ │ └── example │ │ ├── my_first_dbt_model.sql │ │ ├── my_second_dbt_model.sql │ │ └── schema.yml │ ├── README.md │ ├── seeds │ ├── snapshots │ └── tests ├── logs │ └── dbt.log ├── poetry.lock └── pyproject.toml
在個人使用者目錄下找到檔案
~/.dbt/profiles.yml
,qbit
的絕對路徑是C:\Users\qbit\.dbt\profiles.yml
,將以下內容複製貼上到檔案,中括號裡面的內容按自己的實際情況填寫dbt_demo: outputs: dev: type: postgres threads: 1 host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: dbt_demo schema: dev_schema prod: type: postgres threads: 1 host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: dbt_demo schema: prod_schema target: dev
- 下文命令若無特殊說明,都在
.../test_dbt/dbt_demo/
目錄下執行
自帶模型
進入
.../test_dbt/dbt_demo/
目錄後執行以下命令檢查環境資訊$ poetry run dbt debug 06:41:09 Running with dbt=1.3.1 dbt version: 1.3.1 python version: 3.8.10 python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe os info: Windows-10-10.0.19045-SP0 Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: 192.168.1.52 port: 5432 user: postgres database: dbt_demo schema: dev_schema search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
生成自帶模型
$ poetry run dbt run 06:47:04 Running with dbt=1.3.1 06:47:04 Partial parse save file not found. Starting full parse. 06:47:04 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 06:47:04 06:47:05 Concurrency: 1 threads (target='dev') 06:47:05 06:47:05 1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 06:47:05 1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s] 06:47:05 2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 06:47:05 2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s] 06:47:05 06:47:05 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s). 06:47:05 06:47:05 Completed successfully 06:47:05 06:47:05 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 檢查 PostgreSQL 資料庫,應該可以看到一張名為
my_first_dbt_model
的表,和一個名為my_second_dbt_model
的檢視
外部資料
建立底表
- 從加州大學歐文分校下載共享單車的資料集,將
hour.csv
和day.csv
放到.../test_dbt/dbt_demo/seeds
資料夾下。 在
.../test_dbt/dbt_demo/seeds
目錄下建立檔案bike_share.yml
,內容如下:version: 2 seeds: - name: hour config: column_types: dteday: date - name: day config: column_types: dteday: date
- 這裡指定
date
型別是為了避免dbt
自動推斷型別出錯 - 關於資料列的型別,可參考官方文件 column_types
執行以下命令匯入基礎資料
$ poetry run dbt seed 07:19:26 Running with dbt=1.3.1 07:19:26 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:19:26 07:19:26 Concurrency: 1 threads (target='dev') 07:19:26 07:19:26 1 of 2 START seed file dev_schema.day .......................................... [RUN] 07:19:31 1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s] 07:19:31 2 of 2 START seed file dev_schema.hour ......................................... [RUN] 07:21:24 2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s] 07:21:24 07:21:24 Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s). 07:21:24 07:21:24 Completed successfully 07:21:24 07:21:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 檢查 PostgreSQL 資料庫,應該可以看到新建了 2 張表:
hour
和day
自定義模型
- 建立模型目錄
.../test_dbt/dbt_demo/models/bike_share
,下面 sql 檔案中ref('day')
是引用的上面匯入的day
資料庫表 在
.../models/bike_share
目錄內建立weather_trend.sql
,內容如下/* 天氣趨勢表 */ {{ config(materialized='table') }} with weather_trend as ( select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weathersit )
在
.../models/bike_share
目錄內建立season_trend.sql
,內容如下/* 季節趨勢表 */ {{ config(materialized='table') }} with season_trend as ( select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by season )
在
.../models/bike_share
目錄內建立month_trend.sql
,內容如下/* 月度趨勢表 */ {{ config(materialized='table') }} with month_trend as ( select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by mnth )
在
.../models/bike_share
目錄內建立weekday_trend.sql
,內容如下/* 星期幾趨勢表 */ {{ config(materialized='table') }} with weekday_trend as ( select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weekday )
執行以下命令生成模型
$ poetry run dbt run 07:43:59 Running with dbt=1.3.1 07:43:59 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:43:59 07:43:59 Concurrency: 1 threads (target='dev') 07:43:59 07:43:59 1 of 6 START sql table model dev_schema.month_trend ............................ [RUN] 07:44:00 1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s] 07:44:00 2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 07:44:00 2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s] 07:44:00 3 of 6 START sql table model dev_schema.season_trend ........................... [RUN] 07:44:00 3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s] 07:44:00 4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN] 07:44:00 4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s] 07:44:00 5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN] 07:44:00 5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s] 07:44:00 6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 07:44:00 6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s] 07:44:00 07:44:00 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s). 07:44:00 07:44:00 Completed successfully 07:44:00 07:44:00 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
文件生成和檢視
執行以下命令生成文件
$ poetry run dbt docs generate 07:49:09 Running with dbt=1.3.1 07:49:09 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:49:09 07:49:09 Concurrency: 1 threads (target='dev') 07:49:09 07:49:09 Done. 07:49:10 Building catalog 07:49:10 Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
執行以下命令,會啟動 http 服務,並開啟預設瀏覽器檢視文件
$ poetry run dbt docs serve 07:49:53 Running with dbt=1.3.1 07:49:53 Serving docs at 0.0.0.0:8080 07:49:53 To access from your browser, navigate to: http://localhost:8080 07:49:53 07:49:53 07:49:53 Press Ctrl+C to exit. 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
- 網頁介面
- 點選右下角按鈕檢視血緣關係圖(Lineage Graph)
目錄結構
最後再看一眼目錄結構
$ tree ./dbt_demo/ -L 3 ./dbt_demo/ ├── analyses ├── dbt_packages ├── dbt_project.yml ├── logs │ └── dbt.log ├── macros ├── models │ ├── bike_share │ │ ├── month_trend.sql │ │ ├── season_trend.sql │ │ ├── weather_trend.sql │ │ └── weekday_trend.sql │ └── example │ ├── my_first_dbt_model.sql │ ├── my_second_dbt_model.sql │ └── schema.yml ├── README.md ├── seeds │ ├── bike_share.yml │ ├── day.csv │ └── hour.csv ├── snapshots ├── target │ ├── catalog.json │ ├── compiled │ │ └── dbt_demo │ ├── graph.gpickle │ ├── index.html │ ├── manifest.json │ ├── partial_parse.msgpack │ ├── run │ │ └── dbt_demo │ └── run_results.json └── tests
本文出自 qbit snap