Causal Inference理論學習篇-Tree Based-Causal Forest

real-zhouyc發表於2024-04-18

原文網址 : https://www.cnblogs.com/zhouyc/p/18144726

廣義隨機森林

瞭解causal forest之前，需要先了解其forest實現的載體：GENERALIZED RANDOM FORESTS[6]（GRF)
其是隨機森林的一種推廣，經典的隨機森林只能去估計label Y，不能用於估計複雜的目標，比如causal effect，Causal Tree、Cauasl Forest的同一個作者對其進行了改良。先定義一下矩估計參數列達式：

\[\begin{equation} \tag{1} \mathbb E[\psi_{\theta(x), \upsilon(x)}(O_i)|X=x]=0 \end{equation} \]

其中，$\psi$ 是score function，也就是measure metric，$\theta$ 是我們不得不去計算的引數，比如tree裡面的各項引數如特徵threshold，葉子節點估計值..etc, $\upsilon$

則是一個可選引數。$O$ 表示和計算相關的值，比如監督訊號。像response類的模型，$O_i={Y_i}$, 像causal 模型，$O_i={Y_i, W_i}$ $W$ 表示某種treatment。
該式在實際最佳化引數的時候，等價於最小化：

\[\tag{2} \left(\hat \theta(x), \upsilon(x)\right)\in argmin_{\theta, \upsilon}\left|\left|\sum\alpha_i(x)\psi_{\theta, \upsilon(O_i)}\right|\right|_2 \]

其中，$\alpha$ 是一種權重，當然，這裡也可以理解為樹的權重，假設總共需要學習$B$ 棵樹：

\[\alpha_i(x)=\frac{1}{B}\sum_{b=1}^{B}\alpha_{bi}(x) \]

\[\alpha_{bi(x)}=\frac{1(\{x\in L_b(x)\})}{|L_b(x)|} \]

其中，$L_b(x)$ 表示葉子節點裡的樣本。本質上，這個權重表示的是：訓練樣本和推理或者測試樣本的相似度，因為如果某個樣本$x_i$落入葉子$L_b$ ,且我們可以認為葉子節點內的樣本同質的情況下，那麼可以認為這個樣本和當前落入的tree有相似性。

當然，按照這個公式，如果$L_b$ 很大，說明進入這個葉子的訓練樣本很多，意味著沒劃分完全，異質性低，則最後分配給這棵樹的權重就低，反之亦然。

分裂準則框架

對於每棵樹，父節點$P$ 透過最最佳化下式進行分裂：

\[\tag{3}\left(\hat{\theta}_P, \hat{\nu}_P\right)(\mathcal{J}) \in \operatorname{argmin}_{\theta, \nu}\left\{\left\|\sum_{\left\{i \in \mathcal{J}: X_i \in P\right\}} \psi_{\theta, \nu}\left(O_i\right)\right\|_2\right\} . \]

其中，$\mathcal{J}$ 表示train set，分裂後形成的2個子節點標準為：透過最小化估計值與真實值間的誤差平方：

\[\tag{4}\operatorname{err}\left(C_1, C_2\right)=\sum_{j=1,2} \mathbb{P}\left[X \in C_j \mid X \in P\right] \mathbb{E}\left[\left(\hat{\theta}_{C_j}(\mathcal{J})-\theta(X)\right)^2 \mid X \in C_j\right] \]

等價於最大化節點間的異質性：

\[\tag{5}\Delta\left(C_1, C_2\right):=n_{C_1} n_{C_2} / n_P^2\left(\hat{\theta}_{C_1}(\mathcal{J})-\hat{\theta}_{C_2}(\mathcal{J})\right)^2 \]

但是$\theta$ 引數比較難最佳化，交給梯度下降：

\[\tag{6}\tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|\left\{i: X_i \in C\right\}\right|} \sum_{\left\{i: X_i \in C\right\}} \xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right) \]

其中，$\hat \theta_P$ 透過 (2) 式獲得, $A_p$ 為score function的梯度

\[\tag{7}A_P=\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}} \nabla \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right), \]

梯度計算部分包含2個step：

step1：labeling-step 得到一個pseudo-outcomes

\[\tag{8}\rho_i=-\xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right) \in \mathbb{R}$. \]

step2：迴歸階段，用這個pseudo-outcomes 作為訊號，傳遞給split函式, 最終是最大化下式指導節點分割

\[{\Delta}\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|\left\{i: X_i \in C_j\right\}\right|}\left(\sum_{\left\{i: X_i \in C_j\right\}} \rho_i\right)^2 \]

以下是GRF的幾種Applications：

Causal Forest

以Casual-Tree為base，不做任何估計量的改變

與單棵 tree 淨化到 ensemble 一樣，causal forest[7] 沿用了經典bagging系的隨機森林，將一顆causal tree 擴充到多棵：

\[\hat \tau=\frac{1}{B}\sum_{b=1}^{B} \hat \tau_b(x) \]

其中，每科子樹$\hat \tau$ 為一顆Casual Tree。使用隨機森林作為擴充的好處之一是不需要對causal tree做任何的變換，這一點比boosing系的GBM顯然成本也更低。

不過這個隨機森林使用的是廣義隨機森林 , 經典的隨機森林只能去估計label Y，不能用於估計複雜的目標，比如causal effect，Causal Tree、Cauasl Forest的同一個作者對其進行了改良，放在後面再講。

在實現上，不考慮GRF，單機可以直接套用sklearn的forest子類，重寫fit方法即可。分散式可以直接套用spark ml的forest。

self._estimator = CausalTreeRegressor(
			    control_name=control_name, 
			    criterion=criterion, 
			    groups_cnt=groups_cnt)
			    
trees = [self._make_estimator(append=False, random_state=random_state)
                for i in range(n_more_estimators)]
                
trees = Parallel(
                n_jobs=self.n_jobs,
                verbose=self.verbose,
                **_joblib_parallel_args,
            )(
                delayed(_parallel_build_trees)(
                    t,
                    self,
                    X,
                    y,
                    sample_weight,
                    i,
                    len(trees),
                    verbose=self.verbose,
                    class_weight=self.class_weight,
                    n_samples_bootstrap=n_samples_bootstrap,
                )
                for i, t in enumerate(trees)
            )

            self.estimators_.extend(trees)

CAPE: 適用連續treatment 的 causal effect預估

Conditional Average Partial Effects(CAPE)

GRF給定了一種框架：輸入任意的score-function，能夠指導最大化異質節點的方向持續分裂子樹，和response類的模型一樣，同樣我們需要一些估計值(比如gini index、entropy)來計算分裂前後的score-function變化，計算估計值需要估計量，定義連續treatment的估計量為：

\[\theta(x)=\xi^{\top} \operatorname{Var}\left[W_i \mid X_i=x\right]^{-1} \operatorname{Cov}\left[W_i, Y_i \mid X_i=x\right] \]

估計量參與指導分裂計算，但最終，葉子節點儲存的依然是outcome的期望。

此處的motivation來源於工具變數和線性迴歸：

\[y=f(x)=wx+b \]

此處我們假設$x$是treatment，y是outcome， $w$ 作為一個引數簡單的描述了施加treatment對結果的直接影響，要尋找到引數我們需要一個指標衡量引數好壞, 也就是loss, 和casual tree一樣，通常使用mse：

\[L(w, b) = \frac{1}{2}\sum(f(x)-y)^2 \]

為了最快的找到這個w，當然是往函式梯度的方向, 我們對loss求偏導並令其為0：

\[\tag{1}\frac{\partial L}{\partial w}=\sum(f(x)-y)x=\sum(wx+b-y)x \]

\[ \tag{2} \begin{aligned} \frac{\partial L}{\partial b} & = \sum(f(x)-y)=\sum(wx+b-y) \\ & \Rightarrow \sum b= \sum y-\sum wx \\ & \Rightarrow b = E(y)-wE(x) = \bar y - w\bar x \end{aligned} \]

(2) 代入 (1) 式可得：

\[ \begin{aligned} \frac{\partial L}{\partial w} & \Rightarrow \sum(wx+\bar y-w\bar x-y)x =0 \\ &\Rightarrow w=\frac{\sum xy-\bar y\sum x}{\sum x^2-\bar x\sum x} \\ &\Rightarrow w=\frac{\sum(x-\bar x)(y-\bar y)}{\sum(x-\bar x)^2}\\ &\Rightarrow w=\frac{Cov(x,y)}{Var(x)} \end{aligned} \]

可簡化得引數w是關於treatment和outcome的協方差/方差。至於$\xi$ , 似乎影響不大。

refs

https://hwcoder.top/Uplift-1
工具: scikit-uplift
Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning
Athey, Susan, and Guido Imbens. "Recursive partitioning for heterogeneous causal effects." Proceedings of the National Academy of Sciences 113.27 (2016): 7353-7360.
https://zhuanlan.zhihu.com/p/115223013
Athey, Susan, Julie Tibshirani, and Stefan Wager. "Generalized random forests." (2019): 1148-1178.
Wager, Stefan, and Susan Athey. "Estimation and inference of heterogeneous treatment effects using random forests." Journal of the American Statistical Association 113.523 (2018): 1228-1242.
Rzepakowski, P., & Jaroszewicz, S. (2012). Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32, 303-327.
annik Rößler, Richard Guse, and Detlef Schoder. The best of two worlds: using recent advances from uplift modeling and heterogeneous treatment effects to optimize targeting policies. International Conference on Information Systems, 2022.

Causal Inference理論學習篇-Tree Based-Causal Tree
2024-04-14
Causal Inference理論學習篇-Tree Based-From Uplift Tree to Uplift Forest
2024-04-18
REST
【機器學習】Logistic Regression 的前世今生（理論篇）
2019-02-22
機器學習
資訊理論理論學習筆記
2019-02-22
筆記
分散式理論學習
2024-03-14
分散式
深度學習相關理論
2024-05-05
深度學習
RocketMQ - 理論篇
2022-02-03
MQ
深度學習-理論學習關鍵示意圖
2020-10-03
深度學習
快照隔離的理論學習
2020-08-31
資訊理論-Turbo碼學習
2020-11-19
鑑權理論知識學習
2024-08-07
機器學習基礎篇：支援向量機（SVM）理論與實踐
2021-08-20
機器學習
EdgeX Foundry理論篇
2018-07-09
GraphQL分享理論篇
2018-03-31
跨域-理論篇
2021-01-11
跨域
JAVA_RMI(理論篇)
2024-11-26
Java
林軒田機器學習技法課程學習筆記10 — Random Forest
2018-07-28
機器學習筆記randomREST
深度學習——資料預處理篇
2019-02-18
深度學習
【深度學習論文篇 01-1 】AlexNet論文翻譯
2022-04-05
深度學習
webpack學習(四) -- css tree shaking
2019-03-27
WebCSS
react diff 學習之tree diff
2024-09-03
React
KD-Tree 學習筆記
2024-07-28
筆記
Link Cut Tree學習筆記
2021-01-18
筆記
PHP效能優化 -理論篇
2019-02-12
PHP優化
Delphi 論文閱讀 Delphi: A Cryptographic Inference Service for Neural Networks
2023-03-29
聊聊 AI 學習入門 - 數學和資訊理論
2024-07-19
AI
決策樹在機器學習的理論學習與實踐
2018-03-29
機器學習
DDD理論學習系列（3）-- 限界上下文
2021-09-09
[學習筆記 #7] Link Cut Tree
2024-12-06
筆記
從資訊瓶頸理論一瞥機器學習的“大一統理論”
2019-01-04
機器學習
20篇頂級深度學習論文（附連結）
2018-05-24
深度學習
【深度學習論文篇 02-1 】YOLOv1論文精讀
2022-04-14
深度學習YOLOv1
機器學習-學習率：從理論到實戰，探索學習率的調整策略
2023-12-05
機器學習
深入理解hashmap（二）理論篇
2018-11-10
HashMap
設計模式總結(理論篇）
2020-10-05
設計模式
State設計模式上篇(理論篇)
2024-06-03
設計模式
機器學習入門(二) — 迴歸模型 (理論)
2018-12-07
機器學習模型
關於“學習金字塔理論”的所思所想
2018-08-17