Graph Attention Networks Authors: Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio
Outlines Goal Challenges Method Discussion (advantages) Experiment Future work (limitations) Summary
Goal Build a model for node classification of graph-structured data Compute the hidden representations of each node by attending over its neighbors
Challenges Limitations of the existing work Depends on the graph structure, cannot directly apply to unseen graphs Sample a fixed-sized neighborhood of each node (e.g., GraphSAGE) For the proposed model (contributions) Generalized to completely unseen graphs Apply to graph nodes having different degrees
Method Graph attention layer The input: a set of node features The output: a set of new node features including neighboring information Attention coefficients: eij indicates the importance of node j’s features to node i
Attention coefficients W is a weight matrix, a linear transformation a is a shared attentional mechanism For node i, neighboring nodes are first-order neighbors of i After normalization and activation, the coefficients is ∥ is the concatenation operation
Multi-head attention For a single attention for a node i, For k independent attention for a node i, For the final layer, replace concatenation with average
Advantages Computational efficient (parallel computation) The time complexity: Assign different weights to nodes of a same neighborhood No requirement to be undirected Work on the entire neighboring nodes No ordering on the neighboring nodes
Datasets for the experiment Tested on four datasets
Experiment setup and Evaluation metrics Transductive learning: a two-layer GAT model For Cora and Citeseer datasets, The first layer: K = 8, F’ = 8, ELU (exponential linear unit) The second layer (classifier), K = 1, F’ = #classes, softmax For Pubmed, the only change is K = 8 in the classification layer. Mean classification accuracy Inductive learning: a three-layer GAT model The first two layers: K = 4, F’ = 256, ELU (exponential linear unit) The third layer (classifier): K = 6, F’ = 121, logistic sigmoid Micro-average F1
Experiment results (Mean classification accuracy)
Experiment results (Micro-average F1)
Future work In practical, handle larger batch size Because the implementation only supports sparse matric multiplication for rank-2 tensors Neighboring nodes attention for model interpretability Incorporate edge features
Summary Graph attention networks Graph attention layer Deal with different sized neighborhoods Does not need to know the entire graph structure upfront