Transformers for Graph Representation

Ethan_CW發表於2021-06-20

Do Transformers Really Perform Badfor Graph Representation?

microsoft/Graphormer: This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?". (github.com)

1 Introduction

作者們發現關鍵問題在於如何補回Transformer模型的自注意力層丟失掉的圖結構資訊!不同於序列資料(NLP, Speech)或網格資料(CV),圖的結構資訊是圖資料特有的屬性,且對圖的性質預測起著重要的作用。

There are many attempts of leveraging Transformer into the graph domain, but the only effective way is replacing some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention[47,7,22,48,58,43,13]

  • [47] Graph attention networks. ICLR, 2018.
  • [7] Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
  • [22] Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pages 2704–2710, 2020.
  • [48] Direct multi-hop attention based graph neural network.arXiv preprint arXiv:2009.14332, 2020.
  • [58] Graph-bert: Only attention is needed forlearning graph representations.arXiv preprint arXiv:2001.05140, 2020.
  • [43] Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information ProcessingSystems, 33, 2020.
  • [13] generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021
對於每個節點,the self-attention 只計算節點和其他節點之間的語義相似性,而不考慮反映在節點上的圖的結構資訊和節點對之間的關係。
基於此,研究人員們在圖預測任務上提出了Graphormer模型 —— 一個標準的Transformer模型,並且帶有三種結構資訊編碼(中心性編碼Centrality Encoding、空間編碼Spatial Encoding以及邊編碼Edge Encoding),幫助Graphormer模型編碼圖資料的結構資訊。
  • Centrality Encoding: capture the node importance in the graph. In particular, we leverage the degree centrality for the centrality encoding, where a learnable vectoris assigned to each node according to its degree and added to the node features in the input layer. 
  • Spatial Encoding: capture the structural relation between nodes.
  • Edge Encoding
通過使用上述編碼,我們進一步從數學上證明了Graphormer具有很強的表達能力,因為許多流行的GNN變體只是它的特例。
 

2 Graphormer

2.1 Structural Encodings in Graphormer

2.1.1 a Centrality Encoding

In Graphormer, we use the degree centrality, which is one of the standard centrality measures inliterature, as an additional signal to the neural network. To be specific, we develop a Centrality Encoding which assigns each node two real-valued embedding vectors according to its indegree and outdegree.

2.1.2 a Centrality Encoding

 An advantage of Transformer is its global receptive field.

Spatial Encoding:

In this paper, we choose φ(vi,vj) to be the distance of the shortest path (SPD) between vi and vj if the two nodes are connected. If not, we set the output ofφto be a special value, i.e., -1. We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the  (i,j)-element of the Query-Key product matrix A, we have:

2.1.3 Edge Encoding in the Attention

In many graph tasks, edges also have structural features.

In the first method, the edge features areadded to the associated nodes’ features [21,29].

  • [21] Open graph benchmark: Datasets for machine learning on graphs.arXiv preprintarXiv:2005.00687, 2020.
  • [29] Deepergcn: All you need to train deepergcns.arXiv preprint arXiv:2006.07739, 2020

In the second method, for each node, its associated edges’ features will be used together with the node features in the aggregation [15,51,25].

  • [51] How powerful are graph neural networks?InInternational Conference on Learning Representations, 2019.
  • [25] Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

However, such ways of using edge feature only propagate the edge information to its associated nodes, which may not be an effective way to leverage edge information in representation of the whole graph.

a new edge encoding method in Graphormer: 

3.2 Implementation Details of Graphormer

Graphormer Layer:

  • MHA: multi-head self-attention (MHA)
  • FFN: the feed-forward blocks
  • LN: the layer normalization

Special Node:

生成一個VNODE連線圖中所有的點,而它與所有節點的 spatial encodings 是 a distinct learnable scalar

3 Experiments

3.1 OGB Large-Scale Challenge

3.2 Graph Representation

 

相關文章