Do Transformers Really Perform Badfor Graph Representation?
1 Introduction
作者們發現關鍵問題在於如何補回Transformer模型的自注意力層丟失掉的圖結構資訊!不同於序列資料(NLP, Speech)或網格資料(CV),圖的結構資訊是圖資料特有的屬性,且對圖的性質預測起著重要的作用。
There are many attempts of leveraging Transformer into the graph domain, but the only effective way is replacing some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention[47,7,22,48,58,43,13]
- [47] Graph attention networks. ICLR, 2018.
- [7] Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
- [22] Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pages 2704–2710, 2020.
- [48] Direct multi-hop attention based graph neural network.arXiv preprint arXiv:2009.14332, 2020.
- [58] Graph-bert: Only attention is needed forlearning graph representations.arXiv preprint arXiv:2001.05140, 2020.
- [43] Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information ProcessingSystems, 33, 2020.
- [13] generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021
- Centrality Encoding: capture the node importance in the graph. In particular, we leverage the degree centrality for the centrality encoding, where a learnable vectoris assigned to each node according to its degree and added to the node features in the input layer.
- Spatial Encoding: capture the structural relation between nodes.
- Edge Encoding
2 Graphormer
2.1 Structural Encodings in Graphormer
2.1.1 a Centrality Encoding
In Graphormer, we use the degree centrality, which is one of the standard centrality measures inliterature, as an additional signal to the neural network. To be specific, we develop a Centrality Encoding which assigns each node two real-valued embedding vectors according to its indegree and outdegree.
2.1.2 a Centrality Encoding
An advantage of Transformer is its global receptive field.
Spatial Encoding:
In this paper, we choose φ(vi,vj) to be the distance of the shortest path (SPD) between vi and vj if the two nodes are connected. If not, we set the output ofφto be a special value, i.e., -1. We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the (i,j)-element of the Query-Key product matrix A, we have:
2.1.3 Edge Encoding in the Attention
In many graph tasks, edges also have structural features.
In the first method, the edge features areadded to the associated nodes’ features [21,29].
- [21] Open graph benchmark: Datasets for machine learning on graphs.arXiv preprintarXiv:2005.00687, 2020.
- [29] Deepergcn: All you need to train deepergcns.arXiv preprint arXiv:2006.07739, 2020
In the second method, for each node, its associated edges’ features will be used together with the node features in the aggregation [15,51,25].
- [51] How powerful are graph neural networks?InInternational Conference on Learning Representations, 2019.
- [25] Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016
However, such ways of using edge feature only propagate the edge information to its associated nodes, which may not be an effective way to leverage edge information in representation of the whole graph.
a new edge encoding method in Graphormer:
3.2 Implementation Details of Graphormer
Graphormer Layer:
- MHA: multi-head self-attention (MHA)
- FFN: the feed-forward blocks
- LN: the layer normalization
Special Node:
生成一個VNODE連線圖中所有的點,而它與所有節點的 spatial encodings 是 a distinct learnable scalar
3 Experiments
3.1 OGB Large-Scale Challenge
3.2 Graph Representation