Paper Information

Titlel：《Semi-Supervised Classification with Graph Convolutional Networks》
Authors：Thomas Kipf, M. Welling
Source：2016, ICLR
Paper：Download
Code：Download

致敬 Thomas Kipf

　　我原以為將 GCN 發揚光大的人應該是一位老先生，畢竟能將一個理論影響全世界的人必應該有很多的知識儲備（主觀直覺），然後我發現自己大錯特錯，沒想到將 GCN 發揚光大的是 Thomas Kipf ，一位20年畢業的博士生，來自同齡人的壓迫...........

　　這裡奉上其博士論文《Deep learning with graph-structured representations》。

　　其個人主頁：Thomas Kipf

　　總結：沒有對比沒有傷害，希望自己有朝一日也可以............hhh

　　本文主要是：基礎+二代GCN+三代GCN

0 Knowledge review

0.1 卷積

　　卷積的定義：$(f∗g)(t)$ 為 $f ∗ g $ 的卷積

　　連續形式：

　　　　$(f * g)(t)=\int_{R} f(x) g(t-x) d x$

　　離散形式：

　　　　$(f * g)(t)=\sum\limits _{R} f(x) g(t-x)$

0.2 傅立葉變換

　　核心：將函式用一組正交基函式的線性組合表述出來。

　　Fourier 變換：

　　　　$F\{f\}(v)=\int_{R} f(x) e^{-2 \pi i x \cdot v} d x$

　　Where

　　　　$e^{-2 \pi i x \cdot v}$ 為傅立葉變換基函式，且為拉普拉斯運算元的特徵函式。

　　Fourier 逆變換：

　　　　$F^{-1}\{f\}(x)=\int_{R} f(x) e^{2 \pi i x \cdot v} d v$

0.3 傅立葉變換和卷積的關係

　　定義 $h$ 是 $f$ 和 $g$ 的卷積，則有

　　　　$\begin{aligned}\mathcal{F}\{f * g\}(v)&=\mathcal{F}\{h\}(v)\\&=\int_{R} h(x) e^{-i 2 \pi v x} d x \\&=\int_{R} \int_{R} f(\tau) g(x-\tau) e^{-i 2 \pi v x} d \tau d x \\&=\int_{R} \int_{R}\left[f(\tau) e^{-i 2 \pi v \tau} d \tau\right] \left[g(x-\tau) e^{-i 2 \pi v(x-\tau)} d x\right] \\&=\int_{R}\left[f(\tau) e^{-i 2 \pi v \tau} d \tau\right] \int_{R}\left[g\left(x^{\prime}\right) e^{-i 2 \pi v x^{\prime}} d x^{\prime}\right]\\&=\mathcal{F}\{f \}(v)\mathcal{F}\{g \}(v)\end{aligned}$

　　對上式左右兩邊做逆變換 $F^{-1}$ ，得到

　　　　$f * g=\mathcal{F}^{-1}\{\mathcal{F}\{f\} \cdot \mathcal{F}\{g\}\}$

0.4 拉普拉斯運算元

　　定義：歐幾里德空間中的一個二階微分運算元，定義為梯度的散度。可以寫作 $\Delta$, $\nabla^{2}$, $\nabla \cdot \nabla$ 這幾種形式。

　　例子：一維空間形式

　　　　$\begin{aligned}\Delta f(x) &=\frac{\partial^{2} f}{\partial x^{2}} \\&=f^{\prime \prime}(x) \\& \approx f^{\prime}(x)-f^{\prime}(x-1) \\& \approx[f(x+1)-f(x)]-[f(x)-f(x-1)] \\&=f(x+1)+f(x-1)-2 f(x)\end{aligned}$

　　即：拉普拉斯運算元是所有自由度上進行微小變化後所獲得的增益。

0.5 拉普拉斯矩陣

　　所有節點經過微小擾動產生的資訊增益：

　　　　$\begin{array}{l}\bigtriangleup f &= (D-W) f \\&=L f\end{array}$

　　類比到圖上，拉普拉斯運算元可由拉普拉斯矩陣 $L$ 代替。

　　Where

　　　　$L $ is Graph Laplacian.

0.6 拉普拉斯矩陣譜分解

　　　　$L \mu_{k}=\lambda_{k} \mu_{k}$

　　由於 $L $ 拉普拉斯矩陣是 實對稱矩陣，所以

　　　　$L=U \Lambda U^{-1}=U \Lambda U^{T}$

　　Where

- $\Lambda$ 為特徵值構成的對角矩陣;
- $U$ 為特徵向量構成的正交矩陣。

0.7 圖傅立葉變換

　　讓拉普拉斯運算元 $\bigtriangleup$ 作用到傅立葉變換的基函式上，則有:

　　　　$\begin{array}{c}\Delta e^{-2 \pi i x \cdot v}=\frac{\partial^{2}}{\partial^{2} v} e^{-2 \pi i x \cdot v}=-{(2 \pi x)}^2 e^{-2 \pi i x \cdot v} \\\Updownarrow \\L \mu_{k}=\lambda_{k} \mu_{k}\end{array}$

　　Where

- 拉普拉斯運算元 $\bigtriangleup $ 與拉普拉斯矩陣 $L$ “等價” 。（兩者均是資訊增益度）
- 正交基函式 $e^{-2 \pi i x \cdot v}$ 與拉普拉斯矩陣的正交特徵向量 $\mu_{k}$ 等價。
- 根據亥姆霍茲方程 $\bigtriangleup f= \nabla^{2} f=-k^{2} f$ ，$-(2 \pi x)^{2}$ 與 $\lambda_{k}$ 等價。

　　對比傅立葉變換：

　　　　$F\{f\}(v)=\int_{R} f(x) e^{-2 \pi i x \cdot v} d x$

　　可以近似得圖傅立葉變換：

　　　　$F\{f\}\left(\lambda_{l}\right)=F\left(\lambda_{l}\right)=\sum \limits _{i=1}^{n} u_{l}^{*}(i) f(i)$

　　Where

- $\lambda_{l}$ 表示第 $l$ 個特徵；
- $n$ 表示 graph 上頂點個數；
- $f(i)$ 頂點 $i$ 上的函式。

　　Another thing：對於所有節點

- 值向量

　　　　　　$f=\left[\begin{array}{c}f(0) \\f(1) \\\cdots \\f(n-1)\end{array}\right]$

- $n$ 個特徵向量組成的矩陣

　　　　　　$U^{T}=\left[\begin{array}{c}\overrightarrow{u_{0}} \\\overrightarrow{u_{1}} \\\cdots \\u_{n-1}\end{array}\right]=\left[\begin{array}{cccc}u_{0}^{0} & u_{0}^{1} & \cdots & u_{0}^{n-1} \\u_{1}^{0} & u_{1}^{1} & \cdots & u_{1}^{n-1} \\\cdots & \cdots & \cdots & \cdots \\u_{n-1}^{0} & u_{n-1}^{1} & \cdots & u_{n-1}^{n-1}\end{array}\right]$

　　　　　　其中 $\overrightarrow{u_{0}}$ 為特徵值為 $\lambda_{0}$ 對應的特徵向量， $\overrightarrow{u_{1}}$ 、 $\overrightarrow{u_{2}}$ 、 $\ldots$ 類似

- 圖上傅立葉變換矩陣形式如下：

　　　　　　$F(\lambda)=\left[\begin{array}{c}\hat{f}\left(\lambda_{0}\right) \\\hat{f}\left(\lambda_{1}\right) \\\cdots \\\hat{f}\left(\lambda_{n-1}\right)\end{array}\right]=\left[\begin{array}{cccc}u_{0}^{0} & u_{0}^{1} & \cdots & u_{0}^{n-1} \\u_{1}^{0} & u_{1}^{1} & \cdots & u_{1}^{n-1} \\\cdots & \cdots & \cdots & \cdots \\u_{n-1}^{0} & u_{n-1}^{1} & \cdots & u_{n-1}^{n-1}\end{array}\right] \cdot\left[\begin{array}{c}f(0) \\f(1) \\\cdots \\f(n-1)\end{array}\right]$

　　　　　　常見的傅立葉變換形式為：（下面推導要用）

　　　　　　　　$\hat{f}=U^{T} f$

　　　　　　$f$ 在圖上的逆傅立葉變換：

　　　　　　　　$f=U \hat{f}$

　　下面敘述正文~~~~~

Abstract

　　 Our convolutional architecture via a localized first-order approximation of spectral graph convolutions.

1 Introduction

Target：classfy node in graph.
Our methods：a graph-based semi-supervised method.
Type of loss function: graph-based regularization. Function as following:

　　　　$\begin {array}{l}\mathcal{L}=\mathcal{L}_{0}+\lambda \mathcal{L}_{\text {reg }}\\\quad \text { with } \quad \mathcal{L}_{\text {reg }}=\sum \limits _{i, j} A_{i j}\left\|f\left(X_{i}\right)-f\left(X_{j}\right)\right\|^{2}=f(X)^{\top} \Delta f(X)\end{array}$

　　　Where

- $\mathcal{L}_{0} $ denotes the supervised loss.It is the filter error on the dataset with label.
- $\mathcal{L}_{\text {reg }}$ means smoothness between nodes. If two nodes is similary, their own value $f\left(X_{i}\right)$ and $f\left(X_{j}\right)$ may be more equal.
- $f$ is the lable function in the graph;
- $X$ is a matrix of node feature vectors $X_i$;
- $\bigtriangleup = D − A$ is Laplacian matrix;
- Adjacency matrix $A \in \mathbb{R}^{N \times N}$ (binary or weighted. For common sitution is weighted adjacency matrix ).

2 Fast approximate convolutions on graphs

　　The goal of this model is to learn a function of signals/features on a graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$.

　　It takes feature matrix $X$ and adjacency matrix $A$ as input:

- A feature description $x_{i}$ for every node $i$ ; summarized in a $N \times D$ feature matrix $X$($N$: number of nodes, $D$: number of input features)
- A representative description of the graph structure in matrix form; typically in the form of an adjacency matrix $A$ (or some function thereof)

　　It produces a node-level output $Z$ (an $N×F$ feature matrix, where $F$ is the number of output features per node). Graph-level outputs can be modeled by introducing some form of pooling operation.

　　Every neural network layer can then be written as a non-linear function

　　　　$H^{(l+1)}=f\left(H^{(l)}, A\right),$

　　with $H^{(0)}=X$ and $H^{(L)}=Z$ (or $z$ for graph-level outputs), $L$ being the number of layers. The specific models then differ only in how $f(\cdot, \cdot)$ is chosen and parameterized.

　　Conditional GCN model ：

　　At first,we get the definition of Conditional GCN model：

　　　　$f\left(H^{(l)}, A\right)=\sigma\left(A H^{(l)} W^{(l)}\right)$

　　In the formula, $A H $ means every node in the graph need to consider it's own neighbour.

　　Where

- $W^{(l)}$ is a weight matrix for the $l -th$ neural network layer .
- $\sigma(\cdot)$ is a non-linear activation function like the ReLU.

　　If you are smart enough ,you can see that : $A H $ will not consider node itself.

　　改進一：Add self-connection

　　An improvement to this problem is to change $A$ as $\tilde{A}=A+I_{N} $ ( self-connection自環) ,

　　Degree matrix : $\tilde{D}_i=\sum\limits_{j} \tilde{A}_{i j}$

　　總結：新增自環的原因是節點自身的特徵有時也很重要。

　　改進二：Normalized adjacency matrix

　　Different node have different count of neighbours and weight of edge.If one node have a lot of neighbours ,after aggregating neighbours information ,eigenvalues are much larger than nodes with few neighbours.So ,we need to do a Normalization. $\tilde{A}$ change into the following formulation:

　　　　$\tilde{D}^{-1} \tilde{A}$

　　總結：為防止邊權大的節點特徵值特徵值過大影響資訊傳遞，這邊需要採用歸一化來消除影響。

　　改進三：Symmetric normalization

　　上述歸一化只考慮了聚合節點 $i$ 的度的情況，但沒有考慮到鄰居 $j$ (其節點的情況)，即未對鄰居 $j$ 所傳播的資訊進行歸一化。(此處預設每個節點通過邊對外傳送相同量的資訊, 邊越多的節點,每條邊傳送出去的資訊量就越小, 類似均攤. )

　　$\tilde{A}$ change into the following formulation:

　　　　$\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$

　　In this article ,we use the following layer-wise propagation rule：

　　　　$H^{(l+1)}=\sigma\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right)$

　　Where

- $\tilde{A}=A+I_{N}$;
- $\tilde{D}_{i i}=\sum\limits _{j} \tilde{A}_{i j}$;
- $\sigma(\cdot)$ is an activation function;
- $H^{(l)} \in \mathbb{R}^{N \times D}$ is the matrix of activations in the $l-th$ layer , and $H^{(0)}=X$ .

2.1 Spectral graph convolutions

2.1.1 foundation of Spectral graph convolutions

　　We consider spectral convolutions on graphs：

　　假設 $x$ 是特徵函式，$g$ 是卷積核，則圖卷積為：

　　　　$(g * x)=F^{-1}[F[g] \odot F[x]]$

　　　　$\left(g * x\right)=U\left(U^{T} x \odot U^{T} g\right)=U\left(U^{T} g \odot U^{T} x\right)$

　　把 $U^{T} g$ 整體看作可學習的卷積核 :

　　　　$U^{T} g=\left[\begin{array}{c}\hat{g}_{\theta}\left(\lambda_{0}\right) \\\hat{g}_{\theta}\left(\lambda_{1}\right) \\\ldots \\\hat{g}_{\theta}\left(\lambda_{n-1}\right)\end{array}\right]$

　　其中 $\theta$ 為 $g$ 的引數。

　　則可得：

　　　　$\begin{array}{l}\left(U^{T} g\right) \odot\left(U^{T} x\right)&=\left[\begin{array}{c}\hat{g}_{\theta}\left(\lambda_{0}\right) \\\hat{g_{\theta}}\left(\lambda_{1}\right) \\\cdots \\\hat{g_{\theta}}\left(\lambda_{n-1}\right)\end{array}\right] \odot\left[\begin{array}{c}\hat{x}\left(\lambda_{0}\right) \\\hat{x}\left(\lambda_{1}\right) \\\cdots \\\hat{x}\left(\lambda_{n-1}\right)\end{array}\right]\\&=\left[\begin{array}{c}\hat{g}_{\theta}\left(\lambda_{0}\right) \cdot \hat{x}\left(\lambda_{0}\right) \\\hat{g}_{\theta}\left(\lambda_{1}\right) \cdot \hat{x}\left(\lambda_{1}\right) \\\cdots \\\hat{g}_{\theta}\left(\lambda_{n-1}\right) \cdot \hat{x}\left(\lambda_{n-1}\right)\end{array}\right]\\&=\left[\begin{array}{cccc}\hat{g}_{\theta}\left(\lambda_{0}\right) & 0 & \cdots & 0 \\0 & \hat{g}_{\theta}\left(\lambda_{1}\right) & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \hat{g}_{\theta}\left(\lambda_{n-1}\right)\end{array}\right] \cdot\left[\begin{array}{c}\hat{x}\left(\lambda_{0}\right) \\\hat{x}\left(\lambda_{1}\right) \\\cdots \\\hat{x}\left(\lambda_{n-1}\right)\end{array}\right]\\&=g_{\theta}(\Lambda)U^{T} x\end{array}$

　　最終圖上的卷積公式是:

　　　　$\begin{array}{l}\left(g * x\right)&=U\left(U^{T} x \odot U^{T} g\right)\\&=U\left(U^{T} g \odot U^{T} x\right)\\&=g_{\theta}\left(U \Lambda U^{T}\right) x\\&=U g_{\theta}(\Lambda) U^{T} x\end{array}$

　　ps:後面兩部推導參考 Courant-Fischer min-max theorem ：$\underset{\operatorname{dim}(U)=k}{min} \;\;\underset{x \in U,\|x\|=1}{max} x^{H}Ax= \lambda_{k} $。

　　Where

- Symmetric normalized Laplacian ：$L^{\text {sym }}=D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}}=D^{-\frac{1}{2}}(D-A) D^{-\frac{1}{2}}=I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}=U \Lambda U^{T}$
- $U$ is the matrix of eigenvectors of the Symmetric normalized Laplacian.
- $Λ$ a diagonal matrix of its eigenvalues of the Symmetric normalized Laplacian.
- $U^{\top} x$ being the graph Fourier transform of $x$.
- We can understand $g_{\theta }$ as a function of the eigenvalues of L, i.e. $g_{\theta }(Λ)$.

　　接下來將介紹的圖上頻域卷積的工作，都是在 $g_{\theta}(\Lambda)$ 的基礎上做文章，引數 $\theta$ 即為模型需要學習的卷積核引數。

$g_{\theta}(\Lambda)=\left[\begin{array}{cccc}\hat{g}_{\theta}\left(\lambda_{0}\right) & 0 & \cdots & 0 \\0 & \hat{g}_{\theta}\left(\lambda_{1}\right) & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \hat{g}_{\theta}\left(\lambda_{n-1}\right)\end{array}\right]$

2.1.2 Improvement 1 :Polynomial parametrization for localized filters

　　在二代 GCN 中採用:

　　　　$g_{\theta}(\Lambda)=\sum \limits_{k=0}^{K-1} \theta_{k} \Lambda^{k}$

　　which is

　　　　$\hat{g}_{\theta}\left(\lambda_{i}\right)=\sum \limits _{k=0}^{K-1} \theta_{k} \lambda_{i}{ }^{k}$

　　Equivalent form

　　　　$g_{\theta}(\Lambda)=\left[\begin{array}{cccc}\sum\limits _{k=0}^{K-1} \theta_{k} \lambda_{0}{ }^{k} & 0 & \ldots & 0 \\0 & \sum\limits_{k=0}^{K-1} \theta_{k} \lambda_{1}{ }^{k} & \ldots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \sum\limits_{k=0}^{K-1} \theta_{k} \lambda_{n-1}{ }^{k}\end{array}\right]=\sum\limits_{k=0}^{K-1} \theta_{k} \Lambda^{k}$

　　Note that different eigenvalues in the above formula share the same parameter $\theta$ , which achieves parameter sharing.

　　Continue to derive:(對上式進一步推導)

　　　　$\begin{array}{l}U g_{\theta}(\Lambda) U^{T}&=U \sum \limits _{k=0}^{K-1} \theta_{k} \Lambda^{k} U^{T}\\&=\sum \limits_{k=0}^{K-1} \theta_{k} U \Lambda^{k} U^{T}\\&=\sum \limits _{k=0}^{K-1} \theta_{k}L^{k}\end{array}$

　　Tips：for the output

　　　　$y=\sigma\left(\sum \limits _{k=0}^{K-1} \theta_{k}(L)^{k} x\right)$

　　It is easily to find that:

卷積核只有 $K$ 個引數，通常 $K$ 遠小於 $n$，引數複雜度被大大降低。
矩陣變換後，不需要進行拉普拉斯矩陣的特徵分解，直接使用拉普拉斯矩陣 $L$ 替換，但計算 $L$ 時間複雜度還是 $O\left(n^{2}\right)$。
卷積核具有很好的空間區域性性，也就是說每次卷積會將中心頂點 K-hop neighbor 上的 feature 進行加權求和，權係數就是 $\theta_{k} $。

2.1.3 Improvement 2 : Recursive formulation for fast filtering

　　首先給出切比雪夫多項式的公式：

　　　　$\begin{array}{l}T_{0}(x)=1 \\T_{1}(x)=x \\T_{k}(x)=2 x T_{k-1}(x)-T_{k-2}(x)\end{array}$

　　利用切比雪夫多項式來逼近卷積核函式：

　　　　$g_{\theta}(\Lambda)=\sum\limits _{k=0}^{K} \beta_{k} T_{k}(\hat{\Lambda})$

　　Where

- $T_{k}(\cdot)$ 表示切比雪夫多項式；
- $ \beta_{k}$ 表示模型需要學習的引數；
- $ \hat{\Lambda} $ 表示 re-scale 後的特徵值對角矩陣，$\hat{\Lambda}=2 \Lambda / \lambda_{\max }-I $，這是由於 $Chebyshev$ 多項式的輸入要在 $ [-1,1]$ 之間；

　　此時

　　　　$\begin{array}{l} U g_{\theta}(\Lambda) U^{T} x&=U \sum\limits_{k=0}^{K} \beta_{k} T_{k}(\hat{\Lambda}) U^{T} x\\&=\sum\limits_{k=0}^{K} \beta_{k} T_{k}\left(U \hat{\Lambda} U^{T}\right) x\\&=\sum\limits _{k=0}^{K} \beta_{k} T_{k}(\hat{L}) x\end{array}$

　　Where

- $\hat{L}=2 L / \lambda_{\max }-I$

　　備註（2.2節用到）：$g*x=U g_{\theta}(\Lambda) U^{T} x=\sum\limits_{k=0}^{K} \beta_{k} T_{k}(\hat{L})x$

　　在實際計算中，通常採用遞推的方式進行計算：

　　　　　　$\begin{array}{l}T_{0}(\hat{L})=I\\ T_{1}(\hat{L})=\hat{L}\\T_{k}(\hat{L})=2 \hat{L} T_{k-1}(\hat{L})-T_{k-2}(\hat{L})\end{array}$

　　For example:

　　Here ,we can get something from the graph:

　　　　$A=\left[\begin{array}{ccc}0 & 1 & 0 \\1 & 0 & 1 \\0 & 1 & 0\end{array}\right]\quad D=\left[\begin{array}{ccc}1 & 0 & 0 \\0 & 2 & 0 \\0 & 0& 1\end{array}\right]\quad L=D-A=\left[\begin{array}{ccc}1 & -1 & 0 \\-1 & 2 & -1 \\0 & -1 & 1\end{array}\right]$

$K=0$ 時，卷積核為

　　　　$\begin{array}{l} g_{\theta}(\Lambda)&=\beta_{0} * T_{0}(\hat{L})\\&=\beta_{0} * I\\&=\left[\begin{array}{ccc}\beta_{0} & 0 & 0 \\0 & \beta_{0} & 0 \\0 & 0 & \beta_{0}\end{array}\right]\end{array}$

　　　　從這，發現 $K=0$ 時，可以發現此時卷積核只考慮了節點自身屬性。

$K=1$ 時，卷積核為

　　　　$\begin{array}{l} g_{\theta}(\Lambda)&=\beta_{0} * T_{0}(\hat{L})+\beta_{1} * T_{1}(\hat{L})\\&=\left[\begin{array}{ccc}\beta_{0}+\beta_{1} & -\beta_{1} & 0 \\-\beta_{1} & \beta_{0}+2 \beta_{1} & -\beta_{1} \\0 & -\beta_{1} & \beta_{0}+\beta_{1}\end{array}\right]\end{array}$

　　　　從這，發現 $K=1$ 時，卷積核能關注到每個節點本身與其一階相鄰的節點。

$K=2$ 時，切比雪夫多項式 $T_{2}$ 為

　　　　$T_{2}(\hat{L})=2 \hat{L} T_{1}(\hat{L})-T_{0}(\hat{L}) $

　　此時卷積核為：

　　　　$\begin{array}{l} g_{\theta}(\Lambda)&=\beta_{0} * T_{0}(\hat{L})+\beta_{1} * T_{1}(\hat{L})+\beta_{2} * T_{2}(\hat{L}) \\&=\left[\begin{array}{ccc}\beta_{0}+\beta_{1}+3 \beta_{2} & -\beta_{1}-6 \beta_{2} & 2 \beta_{2} \\-\beta_{1}-6 \beta_{2} & \beta_{0}+2 \beta_{1}+11 \beta_{2} & -\beta_{1}-6 \beta_{2} \\2 \beta_{2} & -\beta_{1}-6 \beta_{2} & \beta_{0}+\beta_{1}+3 \beta_{2}\end{array}\right]\end{array}$

　　　　從這，發現 $K=2$ 時，卷積核能關注到每個節點本身與其一階相鄰和二階相鄰的節點。

$K=N$ 時，切比雪夫多項式 $T_{N}$ 為

　　　　$T_{N}(\hat{L})=2 \hat{L} T_{N-1}(\hat{L})-T_{N-2}(\hat{L})$

　　此時卷積核為

　　　　$g_{\theta}(\Lambda)=\sum\limits _{k=0}^{N} \beta_{k} T_{k}(\hat{\Lambda})$

　　　經過上面推導，可得 $K=N$ 能關注到每個節點本身與 $N$ 階之內相鄰的節點。

　　總結：

- 該卷積過程的複雜度為 $O(K|\varepsilon|)$ 。對比上一小節的多項式卷積核，這裡利用切比雪夫多項式設計的卷積核只是利用僅包含加減操作的遞迴關係來計算 $T_{k}(\widetilde{L})$ ，這樣大大減少了計算的複雜度。
- 引數共享機制：同階共享相同引數，不同階的引數不一樣。

2.2 Layer-wise linear model

2.2.1 Application

　　Overfitting on local neighborhood structures：

- Graphs with very wide node degree distributions.

　　回憶 2.1.3 節中的切比雪夫多項式卷積核：

　　　　$g_{\theta}(\Lambda)=\sum\limits _{k=0}^{K} \beta_{k} T_{k}(\hat{\Lambda})$

　　此時圖卷積為：（參考2.1.3節備註處）

　　　　$g*x=U g_{\theta}(\Lambda) U^{T} x=\sum\limits_{k=0}^{K} \beta_{k} T_{k}(\hat{L})x$

　　Where

- $\hat{L}=2 L / \lambda_{\max }-I$

　　為緩解上述 Overfitting on local neighborhood structures 問題，這裡考慮 $K=1$ 的情況，即考慮自身屬性與 $1$ 階鄰居屬性。

　　則此時，圖卷積 $g*x$ 簡化為：

　　　　$\begin{array}{l} g*x&=\beta_{0} x+\beta_{1} \hat{L}x\\&=\beta_{0} x+\beta_{1}(2 L / \lambda_{\max }-I_{N})x\\&\approx \beta_{0} x+\beta_{1} (L-I_{N})x\\&\approx \beta_{0} x+\beta_{1} (L^{sym}-I_{N})x\\&=\beta_{0} x-\beta_{1} (D^{-\frac{1}{2}} A D^{-\frac{1}{2}})x\end{array}$

2.2.2 Simplification

　　我們設定 $\beta=\beta _{0}=-\beta_{1}$ ，得圖卷積為
　　　　$g*x=\beta( I_{N}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}} )x$

　　此時 $I_{N}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}} $ 特徵值的取值範圍為 $[0,2]$。

　　 This operator can therefore lead to numerical instabilities and exploding/vanishing gradients when used in a deep neural network model.(帶來梯度爆炸核梯度消失的問題)

　　所以採用下面的歸一化技巧：

　　　　$I_{N}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \rightarrow \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$

　　Where

- $\tilde{A}=A+I_{N}$
- $\tilde{D}_{i i}=\sum_{j} \tilde{A}_{i j}$

2.2.3 Generalize above definition

　　考慮多通道輸入的情況：

　　We can generalize this definition to a signal $X \in \mathbb{R}^{N \times C}$ with $C$ input channels (i.e. a $C$ -dimensional feature vector for every node) and $F$ filters or feature maps as follows:

　　　　$Z=\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} X \Theta$

　　where

- $\Theta \in \mathbb{R}^{C \times F}$ is now a matrix of filter parameters.(濾波器引數矩陣)
- $Z \in \mathbb{R}^{N \times F}$ is the convolved signal matrix.(卷積後的訊號矩陣)

　　This filtering operation has complexity $\mathcal{O}(|\mathcal{E}| F C)$ , as $\tilde{A} X$ can be efficiently implemented as a product of a sparse matrix with a dense matrix.

3 Semi-supervised node classification

　　Flexible model $f(X, A)$ for efficient information propagation on graphs, we can return to the problem of semi-supervised node classification.

　　The overall model, a multi-layer GCN for semi-supervised learning, is schematically depicted in Figure 1.

　　上圖中，Figure 1 (a) 是一個 GCN 網路示意圖，在 Input layer 擁有 $C$ 個輸入，中間有若干隱藏層，在輸出層有 $F$ 個特徵對映；圖的結構（邊用黑線表示）在層之間共享；標籤用$Y_{i}$表示。Figure 1(b)是一個兩層 GCN 在 Cora 資料集上（使用了 5% 的標籤）訓練得到的隱藏層啟用值的形象化表示，顏色表示文件類別。

3.1 Example

　　We consider a two-layer GCN for semi-supervised node classification on a graph with a symmetric adjacency matrix $A$ (binary or weighted).

　　Our forward model then takes the simple form

　　　　$Z=f(X, A)=\operatorname{softmax}\left(\hat{A} \operatorname{ReLU}\left(\hat{A} X W^{(0)}\right) W^{(1)}\right)$

　　Where

- $\hat{A}=\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$ in a pre-processing step.
- $W^{(0)} \in \mathbb{R}^{C \times H}$ is an input-to-hidden weight matrix for a hidden layer with $H$ feature maps.
- $W^{(1)} \in \mathbb{R}^{H \times F}$ is a hidden-to-output weight matrix.
- Softmax activation function: $\operatorname{softmax}\left(x_{i}\right)=\frac{1}{\sum_{i} \exp \left(x_{i}\right)} \exp \left(x_{i}\right)$ with $\mathcal{Z}=\sum\limits _{i} \exp \left(x_{i}\right)$ , is applied row-wise.

　　For semi-supervised multiclass classification, we then evaluate the cross-entropy error over all labeled examples:

　　　　$\mathcal{L}=-\sum\limits _{l \in \mathcal{Y}_{L}} \sum\limits_{f=1}^{F} Y_{l f} \ln Z_{l f}$

　　Where

- $\mathcal{Y}_{L}$ is the set of node indices that have labels.
- The neural network weights $W_{(0)}$ and $W_{(1)}$ are trained using gradient descent.
- We perform batch gradient descent using the full dataset for every training iteration.
- Using a sparse representation for $A$, memory requirement is $O(|E|)$, i.e. linear in the number of edges.

4 Related work

　　略......

5 Experiments

　　Do some experiments:

semi-supervised document classification in citation networks;
semi-supervised entity classification in a bipartite graph extracted from a knowledge graph;
an evaluation of various graph propagation models;
a run-time analysis on random graphs.

5.1 Dayasets

　　In the citation network datasets—Citeseer, Cora and Pubmed

- Nodes are documents.
- Edges are citation links.
- Label rate denotes the number of labeled nodes that are used for training divided by the total number of nodes in each dataset.

　　NELL is a bipartite graph dataset extracted from a knowledge graph with 55,864 relation nodes and 9,891 entity nodes.

5.2 Experimental set-up

We train a two-layer GCN as described in Section 3.1.
Evaluate prediction accuracy on a test set of 1,000 labeled examples.
We provide additional experiments using deeper models with up to 10 layers.
We choose the same dataset as in Yang et al. (2016) with an additional validation set of 500 labeled examples for hyperparameter optimization.

Dropout rate for all layers.
L2 regularization factor for the first GCN layer.
Number of hidden units.

5.3 Baselines

Label propagation (LP) (Zhu et al., 2003).
Semi-supervised embedding (SemiEmb) (Weston et al., 2012).
Manifold regularization (ManiReg) (Belkin et al., 2006) .
Skip-gram based graph embeddings (DeepWalk) (Perozzi et al., 2014).
Iterative classification algorithm (ICA) proposed in Lu & Getoor (2003).
Planetoid.

6 Results

6.1 Semi-supervised node classification

　　 Results are summarized in Table 2.

6.2 Evaluation of propagation model

　　 We compare different variants of our proposed per-layer propagation model on the citation network datasets.

　　Results are summarized in Table 3.

　　The propagation model of our original GCN model is denoted by renormalization trick (in bold).

6.3 Training time per epoch

　　Here, we report results for the mean training time per epoch (forward pass, cross-entropy calculation, backward pass) for 100 epochs on simulated random graphs.

　　We compare results on a GPU and on a CPU-only.

　　Results are summarized in Figure 2.

7 Discussion

7.1 Semi-supervised model

　　Our model's advantage:

- Consider both nodes propertity and global structure.
- Easy to optimize.

7.2 Limitations and future work

Memory requirement.
Directed edges and edge features.
Limiting assumptions.

8 Conclusion

We have introduced a novel approach for semi-supervised classification on graph-structured data.
Our GCN model uses an efficient layer-wise propagation rule that is based on a first-order approximation of spectral convolutions on graphs.
The proposed GCN model is capable of encoding both graph structure and node features in a way useful for semi-supervised classification.
In this setting, our model outperforms several recently proposed methods by a significant margin, while being computationally efficient.