基於正則化的多工聯邦

1.導言

現在多工學習根據實現方法可以粗略地被分為兩種，一個是基於神經網路的多工學習[1][2][3][4]，這種多工學習在CV和NLP取得了大量的應用。

基於神經網路的多工學習

然而我們最根溯源，其實多工學習最開始並不是基於神經網路的，而是另一種經典的方法——基於正則表示的多工學習，我們這篇文章也主要介紹後者。為什麼在深度學習稱為主流的今天，我們還需要了解過去的傳統方法呢？

首先，經典的多工學習和基於神經網路的多工學習方法本質上都是基於知識共享的思想，而這種思想其他領域，比如聯邦學習中得到了大量的應用（參見我的文章《分散式多工學習及聯邦學習個性化》）。而經典的基於正則表示的多工學習更容易分散式化，因此大多數聯邦多工學習的論文靈感其實都來源於經典方法。我的研究領域主要是聯邦學習，故我們下面主要介紹經典多工學習。

2、多工學習簡介

2.1 多工學習：遷移學習和知識表示的延伸

多工學習(Multi-task Learning, MTL)近年來在CV、NLP、推薦系統等領域都得到了廣泛的應用。類似於遷移學習，多工學習也運用了知識遷移的思想，即在不同任務間泛化知識。但二者的區別在於：
- 遷移學習可能有多個源域；而多工學習沒有源域而只有多個目標域。
- 遷移學習注重提升目標任務效能，並不關心源任務的效能（知識由源任務

下圖從知識遷移流的角度來說明遷移學習和多工學習之間的區別所示：

遷移學習和多工學習

不嚴格地說，多工學習的目標為利用多個彼此相關的學習任務中的有用資訊來共同對這些任務進行學習。

2.2 多工學習目前的兩大主要實現方式

現在多工學習根據資料的收集方式可以粗略地被分為兩種，一個是集中化的計算方法，一種是分散式的計算方法，可以參見我的文章《多工學習分散式化及聯邦學習》。

3、基於正則化的多工學習

3.1 基於正則化的多工學習的形式表述

形式化地說，給定

（此處

不過，如果我們直接對各任務的損失函式和

這裡

3.2 基於正則化的多工學習分類

基於正則化的多工學習依靠正則化來找到任務之間的相關性，大致可以分為基於特徵和基於模型的這兩種。

3.2.1 基於特徵的多工學習

a. 特徵變換

即透過線性/非線性變換由原始特徵構建共享特徵表示。這種思想最早可追溯到多工學習的開山論文——使用多層前饋網路（Caruana, 1997）[9]，如下圖所示：

多層前饋網路完成共享特徵表示

該示例中，假設所有任務的輸入相同，將多層前饋網路的隱藏層輸出視為所有任務共享的特徵表示，將輸出層的輸出視為對

這裡

此時

b. 聯合特徵學習（joint feature learning）

透過特徵選擇得到原始特徵的共享子集（shared feature subset），以做為共享的特徵表示。我們常採用的方法是將引數矩陣

3.2.2 基於模型的多工監督學習

a. 共享子空間學習(shared subspace learning)

該方法的假設引數矩陣

Chen等人(2009)[18]透過為

核範數是一rank function[19]（Fazel等人, 2001）的緊的凸鬆弛，可以用近端梯度下降法求解。

b. 聚類方法

該方法受聚類方法的啟發，基本思想為：將任務分為若個個簇，每個簇內部的任務在模型引數上更相似。
Thrun等人（1996）[20]提出了第一個任務聚類演算法，它包含兩個階段。在第一階段，該方法根據在單任務下單獨學習得到的模型來聚類任務，確定不同的任務簇。在第二階段，聚合同一個任務簇中的所有訓練資料，以學習這些任務的模型。這種方法把任務聚類和模型引數學習分為了兩個階段，可能得不到最優解，因此後續工作都採用了同時學習任務聚類和模型引數的方法。
Bakker等人(2003)[21]提出了一個多工貝葉斯神經網路(multi-task Bayesian neural network)，其結構與我們前面所展現的多層神經網路相同，其亮點在於基於連線隱藏層和輸出層的權重採用高斯混合模型（Gaussian mixture model）對任務進行分組。若給定資料集

這是一個高斯分佈，均值為

其中，每個高斯分佈可以被認為是描述一個任務簇。式

多工貝葉斯神經網路

Xue等人（2007）[22]根據模型引數應用Dirichlet過程（一種廣泛用於資料聚類的貝葉斯模型）對任務進行分組。
除了依賴貝葉斯模型的方法，還有一些正則化方法也被用於分組任務。如Jocob等人（2008）[23]提出了一個正則化項，將任務簇內部和之間的差異都考慮在內，以幫助學習任務簇，目標函式為：

其中第一個正則項度量所有任務任務平均權重的大小，第二個正則項表示任務簇內部的差異和簇之間的差異。

Kang等人(2011)[24]將式

這裡

Kumar和Daume（2012）[26]以及Barizilai和Crammer（2015）[27]都提出了

其中

該正則函式試圖為每個任務只指定單一的任務簇，這裡

綜上所述，聚類⽅法的思想可以總結為：將不同任務分為不同的獨⽴簇，每個簇存在於⼀個低維空間，每個簇的任務共⽤同⼀個模型。我們可以透過交替迭代學習不同簇的分配權重和每個簇的模型權重。就這種方法而言，任務之間有強的關聯性，並行化難度非常大，後面我們在提到如何將基於聚類的方法並行化時再細講。

我們對基於正則化的多工學習方法介紹就到此為止，後面我們會介紹如何採用不同的手段對這類方法進行分散式並行。

引用

[1] Long M, Cao Z, Wang J, et al. Learning multiple tasks with multilinear relationship networks[J]. arXiv preprint arXiv:1506.02117, 2015.
[2] Misra I, Shrivastava A, Gupta A, et al. Cross-stitch networks for multi-task learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3994-4003.
[3] Hashimoto K, Xiong C, Tsuruoka Y, et al. A joint many-task model: Growing a neural network for multiple nlp tasks[J]. arXiv preprint arXiv:1611.01587, 2016.
[4] Kendall A, Gal Y, Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7482-7491.
[5] Evgeniou T, Pontil M. Regularized multi--task learning[C]//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004: 109-117.
[6] Zhou J, Chen J, Ye J. Malsar: Multi-task learning via structural regularization[J]. Arizona State University, 2011, 21.
[7] Zhou J, Chen J, Ye J. Clustered multi-task learning via alternating structure optimization[J]. Advances in neural information processing systems, 2011, 2011: 702.
[8] Ji S, Ye J. An accelerated gradient method for trace norm minimization[C]//Proceedings of the 26th annual international conference on machine learning. 2009: 457-464.
[9] Caruana R. Multitask learning[J]. Machine learning, 1997, 28(1): 41-75
[10] Evgeniou A, Pontil M. Multi-task feature learning[J]. Advances in neural information processing systems, 2007, 19: 41.
[11] Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning[J]. Machine learning, 2008, 73(3): 243-272.
[12] Maurer A, Pontil M, Romera-Paredes B. Sparse coding for multitask and transfer learning[C]//International conference on machine learning. PMLR, 2013: 343-351.
[13] Obozinski G, Taskar B, Jordan M. Multi-task feature selection[J]. Statistics Department, UC Berkeley, Tech. Rep, 2006, 2(2.2): 2.
[14] Obozinski G, Taskar B, Jordan M I. Joint covariate selection and joint subspace selection for multiple classification problems[J]. Statistics and Computing, 2010, 20(2): 231-252.
[15] Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery[C]//Proceedings of the 26th Annual International Conference on Machine Learning. 2009: 649-656.
[16] Gong P, Ye J, Zhang C. Multi-stage multi-task feature learning[J]. arXiv preprint arXiv:1210.5806, 2012.
[17] Ando R K, Zhang T, Bartlett P. A framework for learning predictive structures from multiple tasks and unlabeled data[J]. Journal of Machine Learning Research, 2005, 6(11).
[18] Chen J, Tang L, Liu J, et al. A convex formulation for learning shared structures from multiple tasks[C]//Proceedings of the 26th Annual International Conference on Machine Learning. 2009: 137-144.
[19] Fazel M, Hindi H, Boyd S P. A rank minimization heuristic with application to minimum order system approximation[C]//Proceedings of the 2001 American Control Conference.(Cat. No. 01CH37148). IEEE, 2001, 6: 4734-4739.
[20] Thrun S, O'Sullivan J. Discovering structure in multiple learning tasks: The TC algorithm[C]//ICML. 1996, 96: 489-497.
[21] Bakker B J, Heskes T M. Task clustering and gating for bayesian multitask learning[J]. 2003.
[22] Xue Y, Liao X, Carin L, et al. Multi-task learning for classification with dirichlet process priors[J]. Journal of Machine Learning Research, 2007, 8(1).
[23] Zhou J, Chen J, Ye J. Clustered multi-task learning via alternating structure optimization[J]. Advances in neural information processing systems, 2011, 2011: 702.
[24] Kang Z, Grauman K, Sha F. Learning with whom to share in multi-task feature learning[C]//ICML. 2011.
[25] Han L, Zhang Y. Learning multi-level task groups in multi-task learning[C]//Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
[26] Kumar A, Daume III H. Learning task grouping and overlap in multi-task learning[J]. arXiv preprint arXiv:1206.6417, 2012.
[27] Barzilai A, Crammer K. Convex multi-task learning by clustering[C]//Artificial Intelligence and Statistics. PMLR, 2015: 65-73.
[28] 楊強等. 遷移學習[M].機械工業出版社, 2020.