# 論文資訊

Title：《Symmetric Graph Convolutional Autoencoder for Unsupervised Graph Representation Learning》
Authors：Jiwoong Park、Minsik Lee、H. Chang、Kyuewang Lee、J. Choi
Sources：2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Others：44 Citations, 44 References

# Abstract

本文提出一個完全對稱的自編碼器，其中 解碼器 基於 Laplacian sharpening  設計 ，編碼器  基於 Laplacian smoothing 設計。此外還設計了一種基於  latent representation 和  affinity matrix  的代價函式。

# 1. Introduction

在機器學習和資料探勘領域，使用圖資料做以下研究：

• node clustering
• dimensionality reduction
• social network analysis
• chemical property prediction of a molecular graph
• image segmentation

卷積方法產生限制的原因：

• Low computational efficiency due to eigendecomposition or singular value decomposition.
• Only showing a shallow relationship between nodes.

近年來，一個被稱為幾何深度學習（ geometric deep learning） 的新興領域，將深度神經網路模型推廣到 非歐幾里得域 （ non-Euclidean domains），比如： meshes、 manifolds、 graphs。其中，利用自編碼器框架尋找圖的 geometrical structures 的深層潛在表示正越來越受到關注。

首先嚐試的是   VGAE  它由一個  Graph Convolutional Network (GCN)  encoder   和一個  matrix outerproduct decoder  組成，如 Figure 1 (a) 所示。

作為  VGAE  的一種變體，ARVGA 採用了 VGAE 的  adversarial approach。然而 VGAE 和  ARVGA  重建   affinity matrix  $A$，而不是節點   feature matrix  $X$，因此，解碼器部分不能是可學習的，故， graphical feature 不能在解碼器部分中使用。

之後，提出了  MGAE，它使用具有線性啟用函式的堆疊單層圖自動編碼器和邊緣化過程 如 Figure 1 (b)所示。但是， since the MGAE reconstructs the feature matrix of nodes without hidden layers, it cannot manipulate the dimension of the latent representation and performs a linear mapping.

本文提出了一種新的圖卷積自編碼器框架，該框架具有對稱的自編碼器架構，並在編碼和解碼過程中同時使用圖和節點屬性，如 Figure 1(c)  所示 。

VGAE  的編碼器可以解釋為一種特殊形式的   Laplacian smoothing  ，它計算每個節點的新表示加權區域性平均的鄰居和本身。

本文使用切比雪夫多項式（Chebyshev polynomial）的形式表示拉普拉斯銳化以構造一個 Laplacian sharpening  的解碼器。

本文貢獻如下：

• 提出了第一個完全對稱的圖卷積自編碼器，它在整個編碼-解碼過程中同時利用了圖的結構和節點屬性。
• 推導了一種新的數值穩定形式的防止神經網路數值不穩定的解碼器。
• 設計了一個計算高效的子空間聚類代價來同時尋找影像聚類任務的潛在表示和潛在親和矩陣。

# 2. Preliminaries

## 2.1. Basic notations on graphs

• A graph is represented as  $\mathcal{G}=(\mathcal{V}, \mathcal{E}, A)$
• $\mathcal{V}$  denotes the node set with  $v_{i} \in \mathcal{V}$  and  $|\mathcal{V}|=n$.
• $\mathcal{E}$ denotes the edge set with  $\left(v_{i}, v_{j}\right) \in \mathcal{E}$.
• $A \in \mathbb{R}^{n \times n}$ denotes an affinity matrix which encodes pairwise affinities between nodes.
• $D \in \mathbb{R}^{n \times n}$  denotes a degree matrix which is a diagonal matrix with  $D_{i i}=\sum_{j} A_{i j}$ .
• Unnormalized graph Laplacian  $\Delta$  is defined by  $\Delta=D-A$  .
• Symmetric graph Laplacian  $L$  is defined by  $L=I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$ .
• Random walk graph Laplacian  $L_{r w}$ is defined by $L_{r w}=I_{n}-D^{-1} A$
• $I_{n} \in \mathbb{R}^{n \times n}$  denotes an identity matrix.
• $\Delta$, $L$  and  $L_{r w}$ are positive semidefinite matrices.

## 2.2. Spectral convolution on graphs

可以發現 2.2 節的內容在《第三代GCN》已經將的很清楚了，所以下面不再詳細展開介紹。

A spectral convolution on a graph is the multiplication of an input signal $x \in \mathbb{R}^{n}$ with a spectral filter $g_{\theta}=\operatorname{diag}(\theta)$ parameterized by the vector of Fourier coefficients $\theta \in \mathbb{R}^{n}$ as follows:

$g_{\theta} * x=U g_{\theta} U^{T} x\quad \quad \quad (1)$

where

• $U$ is the matrix of eigenvectors of the symmetric graph Laplacian $L=U \Lambda U^{T}$ .
• $U^{T} x$ is the graph Fourier transform of the input $x$ ,
• $g_{\theta}$ is a function of the eigenvalues of $L$ .
• $g_{\theta}(\Lambda)$ , where $\Lambda$ is the diagonal matrix of eigenvalues of $L$ .

一般圖卷積帶來的問題：this operation is inappropriate for large-scale graphs since it requires an eigendecomposition to obtain the eigenvalues and eigenvectors of $L$ .

解決辦法：the spectral filter $g_{\theta}(\Lambda)$ was approximated by $K^{t h}$ order Chebyshev polynomials in previous works . By doing so, the spectral convolution on the graph can be approximated as

$g_{\theta} * x \approx U \sum\limits _{k=0}^{K} \theta_{k}^{\prime} T_{k}(\tilde{\Lambda}) U^{T} x=\sum\limits _{k=0}^{K} \theta_{k}^{\prime} T_{k}(\tilde{L}) x\quad \quad \quad (2)$

where

• $T_{k}(\cdot)$ denote the Chebyshev polynomials .
• $\theta^{\prime}$ denote a vector of the Chebyshev coefficients.
• $\tilde{\Lambda}$ is $\frac{2}{\lambda_{\max }} \Lambda-I_{n}$.
• $\lambda_{\max }$ denotes the largest eigenvalue of $L$.
• $\tilde{L}$ is $U \tilde{\Lambda} U^{T}=\frac{2}{\lambda_{\max }} L-I_{n}$.

In the GCN , the Chebyshev approximation model was simplified by setting $K=1$, $\lambda_{\max } \approx 2$ and $\theta=\theta_{0}^{\prime}= -\theta_{1}^{\prime}$ . This makes the spectral convolution simplified as follows:

$g_{\theta} * x \approx \theta\left(I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right) x \quad \quad \quad (3)$

However, repeated application of $I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$ can cause numerical instabilities in neural networks since the spectral radius of $I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$ is $2$ , and the Chebyshev polynomials form an orthonormal basis when its spectral radius is $1$ . To circumvent this issue, the GCN uses renormalization trick:

$I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \rightarrow \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \quad \quad \quad (4)$

where $\tilde{A}=A+I_{n}$ and $\tilde{D}_{i i}=\sum_{j} \tilde{A}_{i j}$ . Since adding selfloop on nodes to an affinity matrix cannot affect the spectral radius of the corresponding graph Laplacian matrix , this renormalization trick can provide a numerically stable form of $I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$ while maintaining the meaning of each elements as follows:

$\left(I_{n}+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right)_{i j}=\left\{\begin{array}{ll}1 & i=j \\A_{i j} / \sqrt{D_{i i} D_{j j}} & i \neq j\end{array}\right.\quad \quad \quad (5)$

$\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}\right)_{i j}=\left\{\begin{array}{ll}1 /\left(D_{i i}+1\right) & i=j \\A_{i j} / \sqrt{\left(D_{i i}+1\right)\left(D_{j j}+1\right)} & i \neq j\end{array}\right.\quad \quad \quad (6)$

Finally, the forward-path of the GCN can be expressed by

$H^{(m+1)}=\xi\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(m)} \Theta^{(m)}\right)\quad \quad \quad (7)$

where $H^{(m)}$  is the activation matrix in the $m^{t h}$  layer and $H^{(0)}$  is the input nodes' feature matrix $X$. $\xi(\cdot)$  is a nonlinear activation function like $\operatorname{ReLU}(\cdot)=\max (0, \cdot)$, and $\Theta^{(m)}$  is a trainable weight matrix. The GCN presents a computationally efficient convolutional process (given the assumption that $\tilde{A}$  is sparse) and achieves an improved accuracy over the state-of-the-art methods in semi-supervised node classification task by using features of nodes and geometric structure of graph simultaneously.

## 2.3. Laplacian smoothing

Li et al. show that GCN is a special form of Laplacian smoothing.

Laplacian smoothing equation:

$x_{i}^{(m+1)}=(1-\gamma) x_{i}^{(m)}+\gamma \sum\limits_{j} \frac{\tilde{A}_{i j}}{\tilde{D}_{i i}} x_{j}^{(m)}\quad \quad \quad (8)$

Where:

• $\tilde{A}=A+I_{n}$.
• $\tilde{D}=D+I_{n}$.
• $x_{i}^{(m+1)}$ is the new representation of $x_{i}^{(m)}$.
• $\gamma (0<\gamma \leq 1)$ is a regularization parameter which controls the importance between itself and its neighbors.

Rewrite the above equation in a matrix form as follows:

\begin{aligned}X^{(m+1)} &=(1-\gamma) X^{(m)}+\gamma \tilde{D}^{-1} \tilde{A} X^{(m)} \\&=X^{(m)}-\gamma\left(I_{n}-\tilde{D}^{-1} \tilde{A}\right) X^{(m)} \\&=X^{(m)}-\gamma \tilde{L}_{r w} X^{(m)}\end{aligned}\quad \quad \quad (9)

Tips:

Random walk normalized Laplacian

$L^{r w}=D^{-1} L=I-D^{-1} A$

Symmetric normalized Laplacian

$L^{\text {sys }}=D^{-1 / 2} L D^{-1 / 2}=I-D^{-1 / 2} A D^{-1 / 2}$

We set  $\gamma=1$  and replace $\tilde{L}_{r w}$  with  $\tilde{L}$ , then Eq. (9) is changed into

$X^{(m+1)}=\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} X^{(m)}$

This equation is the same as the renormalized version of spectral convolution in Eq. (7).

# 3 The proposed method

Our mode is GALA (Graph convolutional Autoencoder using Laplacian smoothing and sharpening).

In GALA, there are  $M$  layers in total, from the first to $\frac{M}{2}$  th layers for the encoder and from the  $\left(\frac{M}{2}+1\right)$ th to  $M$  th layers for the decoder where  $M$  is an even number.

• The encoder part of GALA is designed to perform the computationally efficient spectral convolution on the graph with a numerically stable form of Laplacian smoothing in the Eq. (7) .
• Its decoder part is designed to be a special form of Laplacian sharpening . By this decoder part, GALA reconstructs the feature matrix of nodes directly, instead of yielding an affinity matrix as in the existing VGAE-related algorithms whose decoder parts are incomplete.

## 3.1 Laplacian sharpening

The encoder performs Laplacian smoothing that makes the latent representation of each node similar to those of its neighboring nodes.

Laplacian sharpening is a process that makes the reconstructed feature of each node farther away from the centroid of its neighbors.

Laplacian sharpening equation:

$x_{i}^{(m+1)}=(1+\gamma) x_{i}^{(m)}-\gamma \sum\limits _{j} \frac{A_{i j}}{D_{i i}} x_{j}^{(m)}\quad \quad \quad (10)$

Where:

• $x_{i}^{(m+1)}$ is the new representation of $x_{i}^{(m)}$.
• $\gamma (0<\gamma \leq 1)$ is a regularization parameter which controls the importance between itself and its neighbors.

The matrix form of Eq. (10) :

\begin{aligned}X^{(m+1)} &=(1+\gamma) X^{(m)}-\gamma D^{-1} A X^{(m)} \\&=X^{(m)}+\gamma\left(I_{n}-D^{-1} A\right) X^{(m)} \\&=X^{(m)}+\gamma L_{r w} X^{(m)}\end{aligned}\quad \quad \quad (11)

We set  $\gamma=1$  and replace $\tilde{L}_{r w}$  with  $\tilde{L}$ , then Eq. (10) is changed into

$X^{(m+1)}=(2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}})X^{(m)}$

We express Laplacian sharpening in the form of Chebyshev polynomial  and simplify it with

• $K=1$
• $\lambda_{\max } \approx 2$
• $\theta=\frac{1}{2} \theta_{0}^{\prime}=\theta_{1}^{\prime}$

Then, a decoder layer can be expressed by

$H^{(m+1)}=\xi\left(\left(2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right) H^{(m)} \Theta^{(m)}\right)\quad \quad \quad (12)$

Where:

• $H^{(m)}$  is the matrix of the activation in the  $m^{t h}$  layer,
• $2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$  is a special form of Laplacian sharpening,
• $\xi(\cdot)$  is the nonlinear activation function like  $\operatorname{ReLU}(\cdot) =\max (0, \cdot)$ ,
• $\Theta^{(m)}$  is a trainable weight matrix.

The spectral radius of  $2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$  is 3 , repeated application of this operator can be numerically instable.

若 $\boldsymbol{A}=\left(a_{i j}\right)$ 是複數域上的 $\mathrm{n}$ 階方陣，又 $\lambda_{1}, \lambda_{2}, \ldots, \lambda_{n}$ 是 $A$ 的全部特徵值，則

$\rho(\boldsymbol{A})= \underset{1 \leq i \leq n}{max} |\lambda_{i}|$

From previous articles,we know that

$D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$ range  $[ -1,1]$

So we need to find a  numerically stable form of Laplacian sharpening.

## 3.2 Numerically stable Laplacian sharpening

To find a new representation of Laplacian sharpening whose spectral radius is $1$ .

A signed graph is denoted by  $\Gamma=(\mathcal{V}, \mathcal{E}, \hat{A})$  which is induced from the unsigned graph  $\mathcal{G}=(\mathcal{V}, \mathcal{E}, A)$ .

Each element in  $\hat{A}$  has the same absolute value with  $A$ , but its sign is changed into minus or keeps plus.

The degree matrix $\hat{D}$ can be defined as $\hat{D}_{i i}=\sum\limits _{j}\left|\hat{A}_{i j}\right|$ .

Thus,we can construct

• unnormalized graph Laplacian  $\hat{\Delta}=\hat{D}-\hat{A}$.
• symmetric graph Laplacian  $\hat{L}=I_{n}-\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$  of the signed graph.

The range of the eigenvalue of  $\hat{L}$  is  $[0,2]$ , thus the spectral radius of  $\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$  is $1$ for any choice of  $\hat{A}$

Using this result, instead of Eq. (12):

$H^{(m+1)}=\xi\left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} H^{(m)} \Theta^{(m)}\right)\quad \quad \quad (13)$

Where:

• $\hat{A}=2 I_{n}-A$
• $\hat{D}=2 I_{n}+D$

$\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$ 計算方式如下：

$\left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}\right)_{i j}=\left\{\begin{array}{ll}2 /\left(D_{i i}+2\right) & i=j \\-A_{i j} / \sqrt{\left(D_{i i}+2\right)\left(D_{j j}+2\right)} & i \neq j\end{array}\right.\quad \quad \quad (14)$

Which has the same meaning with the original one given by

$\left(2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right)_{i j}=\left\{\begin{array}{ll}2 & i=j \\-A_{i j} / \sqrt{D_{i i} D_{j j}} & i \neq j\end{array}\right.\quad \quad \quad (15)$

目的：

• 將譜半徑變成 $1$。
• $\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$  保留了和  $2 I_{n}-D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$  的性質。

Sumary, the numerically stable decoder layer of GALA is given as

$H^{(m+1)}=\xi\left(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} H^{(m)} \Theta^{(m)}\right),\left(m=\frac{M}{2}, \ldots, M-1\right)\quad \quad \quad (16)$

Where

• $\hat{A}=2 I_{n}-A$
• $\hat{D}=2 I_{n}+D$

The encoder layer of GALA is given as

$H^{(m+1)}=\xi\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(m)} \Theta^{(m)}\right),\left(m=0, \ldots, \frac{M}{2}-1\right)\quad \quad \quad (17)$

Where

• $\tilde{A}=I_{n}+A$
• $\tilde{D}=I_{n}+D$

Both encoder and decoder propagation functions, Eqs. (16) and (17), are both $\mathcal{O}(m p c)$.

• $m$ is the cardinality of edges in the graph.
• $p$ is the feature dimension of the previous layer.
• $c$ is the feature dimension of the current layer.

In Table 1, we show the reason why the Laplacian smoothing is not appropriate to the decoder and the necessity of numerically stable Laplacian sharpening by node clustering experiments.

• Laplacian smoothing decoder (Eq. 7) shows the lowest performances, since Laplacian smoothing which makes the representation of each node similar to those of its neighboring nodes conflicts with the purpose of reconstruction cost.
• A numerically instable form of Laplacian sharpening decoder (Eq. 12) shows higher performance compared to smoothing decoder because the role of Laplacian sharpening coincide with reconstructing the node feature.
• The performance of proposed numerically stable Laplacian sharpening decoder (Eq. 16) significantly higher than others, since it solves instability issue of neural network while maintaining the meaning of original Laplacian sharpening.

The basic cost function of GALA is given by

$\min _{\bar{X}} \frac{1}{2}\|X-\bar{X}\|_{F}^{2}\quad \quad \quad (18)$

Where

• $\bar{X}$  is the reconstructed feature matrix of nodes.
• the column of  $\bar{X}$  corresponds to the output of the decoder for an input feature of a node.
• $\|\cdot\|_{F}$  denotes the Frobenius norm.

## 3.3. Subspace clustering cost for image clustering

For image clustering tasks,we add  an element of subspace clustering to the proposed method.

$\underset{\bar{X}, H, A_{H}}{min} \frac{1}{2}\|X-\bar{X}\|_{F}^{2}+\frac{\lambda}{2}\left\|H-H A_{H}\right\|_{F}^{2}+\frac{\mu}{2}\left\|A_{H}\right\|_{F}^{2}\quad \quad \quad (19)$

Where

• $H \in \mathbb{R}^{k \times n}$  denotes the latent representations (i.e., the output of the encoder).
• $A_{H} \in \mathbb{R}^{n \times n}$  denotes the affinity matrix which is a new latent variable for subspace clustering,
• $\lambda$, $\mu$  are the regularization parameters.

The second term of Eq. (19) aims at the self-expressive model of subspace clustering and the third term of Eq. (19) is for regularizing  $A_{H}$ .

If we only consider minimizing  $A_{H}$ , the problem becomes:

$\underset{A_{H}}{min} \frac{\lambda}{2}\left\|H-H A_{H}\right\|_{F}^{2}+\frac{\mu}{2}\left\|A_{H}\right\|_{F}^{2}\quad \quad \quad (20)$

We can easily obtain the analytic solution

$A_{H}^{*}=(H^{T} H+\frac{\mu}{\lambda} I_{n})^{-1} H^{T} H$

By using this analytic solution and singular value decomposition, we derive a computationally efficient subspace clustering cost function as follows :

$\underset{\bar{X}, H}{min} \frac{1}{2}\|X-\bar{X}\|_{F}^{2}+\frac{\mu \lambda}{2} \operatorname{tr}\left(\left(\mu I_{k}+\lambda H H^{T}\right)^{-1} H H^{T}\right)\quad \quad \quad (21)$

Some benefits are it can reduce the computational burden significantly.

• The above problem can be solved by  $k \times k$  matrix inversion instead of  $n \times n$ matrix inversion.  from  $\mathcal{O}\left(n^{3}\right)$  to  $\mathcal{O}\left(k^{3}\right)$ .
• $k$ is the dimension of latent representation.

## 3.4. Training

• We train GALA to minimize Eq. (18) by using the ADAM algorithm.
• We train GALA deterministically by using the full batch in each training epoch and stop when the cost is converged.
• We set the learning rate to $1.0 × 10^{-4}$ for training and train GALA in an unsupervised way without any cluster labels.

• First, in the pre-training stage, the training method is the same as that of minimizing Eq. (18).
• After pre-training, we fine-tune GALA to minimize Eq. (21) using ADAM.

After the training process are over, we construct  $k$  nearest neighborhood graph using attained latent representations  $H^{*}$ . Then we perform spectral clustering and get the clustering performance.

In the case of image clustering, after all training processes are over, we construct the optimal affinity matrix  $A_{H}^{*}$  noted in the previous subsection by using the attained latent representation matrix  $H^{*}$  from GALA. Then we perform spectral clustering on the affinity matrix and get the optimal clustering with respect to our cost function.

# 4. Experiments

## 4.1. Datasets

Four network datasets

• Cora
• Citeseer
• Wik
• Pubmed

Every network dataset has the feature matrix $X$ and the affinity matrix $A$ .

Three image datasets

• COIL20
• YALE
• MNIST

Every image dataset has the feature matrix $X$ only.

• Node clustering.

## 4.2. Experimental settings

Three metrics:

• accuracy (ACC)
• normalized mutual information (NMI)

For average in 50 times.

## 4.3. Comparing methods

• Using features only.
• Kmeans
• Using network structures only.
• Spectral
• Big-Clam
• DeepWalk
• GraEnc
• DNGR
• Using both
• Circles
• RTM
• RMSC
• Using both with spectral convolution on graphs
• GAE
• VGAE
• MGAE
• ARGA
• ARVGA

## 4.4. Node clustering results

The experimental results of node clustering are presented in Table 2.

• For every dataset, the methods which use features and network structures simultaneously show better performance than the methods which use only one of them.
• Use both features and network structures, algorithms with neural network models which exploit spectral convolution on graphs present outstanding performance since they can learn deeper relationships between nodes than the methods which do not use spectral convolution on graphs.
• GALA shows superior performance to other methods.
• MGAE, which is the state-of-the-art graph convolutional autoencoder algorithm, by about 6.15%, 6.56% and 8.68% on ACC, NMI and ARI.
• The better performance of GALA comes from the better decoder design based on the numerically stable form of Laplacian sharpening both and full utilizing of graph structure and node attributes in the whole autoencoder framework.

We conduct another node clustering experiment on a large network dataset (Pubmed), and the results are reported in Table 5.

• GALA outperforms every baselines and state-of-the-art graph convolution algorithms.

## 4.5. Image clustering results

The experimental results of image clustering are presented in Table 4.

We report both GALA’s performance of reconstruction cost only case and the subspace clustering cost added case (GALA+SCC).

• GALA outperforms several baselines and the state-of-the-art graph convolution algorithms for most of the cases.
• The proposed subspace clustering cost term contributes to improve the performance of the image clustering.

## 4.6. Ablation studies

We validate the effectiveness of the proposed stable decoder and the subspace clustering cost by image clustering experiments on the three image datasets (COIL20, YALE and MNIST).

There are four configurations as shown in Table 6.

• The numerically stable form of Laplacian sharpening and subspace clustering cost are helpful to find the latent representations.
• Stable decoder with the reconstruction cost only outperforms the state-of-the-art algorithms in most cases

$\underset{\bar{X}, H}{min} \frac{1}{2}\|X-\bar{X}\|_{F}^{2}+\gamma \mathbb{E}_{H}[\log p(\hat{A} \mid H)]$

where

• $H$  is the latent representation,
• $\hat{A}=\operatorname{sigmoid}\left(H H^{T}\right)$  is the reconstructed affinity matrix and
• $\gamma$  is the regularization parameter.

The results are shown in Table 7

• Our model outperforms the compared methods in terms of the link prediction task as well as the node clustering task.

## 4.8. Visualization

• The encoder makes the feature of each node becomes similar with its neighbors.
• The decoder makes the features of each node distinguishable with its neighbors using the geometrical structure of the graphs.

We visualize the distribution of learned latent representations and the input features of each node in two-dimensional space using t-SNE [23] as shown in Figure 3.

• From the visualization, we can see that GALA is wellclustering the data according to their corresponding labels even though GALA performs in an unsupervised manner.
• We can see through the red dotted line in embedding results of the latent representation on YALE that GALA embeds the representation of nodes better than the compared methods by minimizing inter-cluster affinity and maximizing intra-cluster affinity

# 5. Conclusions

• We designed a symmetric graph convolutional autoencoder architecture where the encoder performs Laplacian smoothing while the decoder performs Laplacian sharpening.
• To prevent numerical instabilities, we designed a new representation of Laplacian sharpening with spectral radius one by incorporating the concept of the signed graph.
• To enhance the performance of image clustering tasks, we added a subspace clustering cost term to the reconstruction cost of the autoencoder.