Graph Attention Networks
图注意力网络(GAT)
通过堆叠节点能够关注其邻域特征的层,能够(隐式地)为邻域中的不同节点指定不同的权重,而不需要任何类型的代价高昂的矩阵运算(例如矩阵转置)或依赖于预先了解图结构。通过这种方式,同时解决了基于谱的图神经网络的几个关键挑战,并使的模型易于应用于聚合和传播问题。
GAT模型已在四个已建立的转导和诱导图基准上实现或匹配最新结果:Cora、Citseeer和Pubmed引文网络数据集,以及蛋白质-蛋白质相互作用数据集(其中测试图在训练期间保持不可见)。
多头注意力机制如下所示,
左:模型采用的注意力机制 a ( W h i ⃗ , W h j ⃗ ) a(W\vec{h_i},W\vec{h_j}) a(Whi,Whj),由权重向量 a ⃗ ∈ R 2 F ‘ \vec{a}∈R^{2F^`} a∈R2F‘进行参数化,应用LeakyReLU激活函数。
右:节点1附近的多头注意力(K=3头)示意图。不同的箭头样式和颜色表示独立的注意力计算。将每个头部的聚集特征连接或平均以获得$$。
t-SNE可视化cora数据集结果:
算法:geometric提供开源实现
torch_geometric.nn — pytorch_geometric documentation (pytorch-geometric.readthedocs.io)
The graph attentional operator from the “Graph Attention Networks” paper:
x i ′ = α i , i Θ x i + ∑ j ∈ N ( i ) α i , j Θ x j , \mathbf{x}^{\prime}_i = \alpha_{i,i}\mathbf{\Theta}\mathbf{x}_{i} + \sum_{j \in \mathcal{N}(i)} \alpha_{i,j}\mathbf{\Theta}\mathbf{x}_{j}, xi′=αi,iΘxi+j∈N(i)∑αi,jΘxj,
where the attention coefficients a i j a_{ij} aij are computed as
α i , j = exp ( L e a k y R e L U ( a ⊤ [ Θ x i ∥ Θ x j ] ) ) ∑ k ∈ N ( i ) ∪ { i } exp ( L e a k y R e L U ( a ⊤ [ Θ x i ∥ Θ x k ] ) ) . \alpha_{i,j} = \frac{ \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{\Theta}\mathbf{x}_i \, \Vert \, \mathbf{\Theta}\mathbf{x}_j] \right)\right)} {\sum_{k \in \mathcal{N}(i) \cup \{ i \}} \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{\Theta}\mathbf{x}_i \, \Vert \, \mathbf{\Theta}\mathbf{x}_k] \right)\right)}. αi,j=∑k∈N(i)∪{i}exp(LeakyReLU(a⊤[Θxi∥Θxk]))exp(LeakyReLU(a⊤[Θxi∥Θxj])).
If the graph has multi-dimensional edge features e i j e_{ij} eij, the attention coefficients a i j a_{ij} aijare computed as
α i , j = exp ( L e a k y R e L U ( a ⊤ [ Θ x i ∥ Θ x j ∥ Θ e e i , j ] ) ) ∑ k ∈ N ( i ) ∪ { i } exp ( L e a k y R e L U ( a ⊤ [ Θ x i ∥ Θ x k ∥ Θ e e i , k ] ) ) . \alpha_{i,j} = \frac{ \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{\Theta}\mathbf{x}_i \, \Vert \, \mathbf{\Theta}\mathbf{x}_j \, \Vert \, \mathbf{\Theta}_{e} \mathbf{e}_{i,j}]\right)\right)} {\sum_{k \in \mathcal{N}(i) \cup \{ i \}} \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{\Theta}\mathbf{x}_i \, \Vert \, \mathbf{\Theta}\mathbf{x}_k \, \Vert \, \mathbf{\Theta}_{e} \mathbf{e}_{i,k}]\right)\right)}. αi,j=∑k∈N(i)∪{i}exp(LeakyReLU(a⊤[Θxi∥Θxk∥Θeei,k]))exp(LeakyReLU(a⊤[Θxi∥Θxj∥Θeei,j])).
PARAMETERS
-
in_channels (int or tuple) – Size of each input sample, or
-1
to derive the size from the first input(s) to the forward method. A tuple corresponds to the sizes of source and target dimensionalities. -
out_channels (int) – Size of each output sample.
-
heads (int, optional) – Number of multi-head-attentions. (default:
1
) -
concat (bool, optional) – If set to
False
, the multi-head attentions are averaged instead of concatenated. (default:True
) -
negative_slope (float, optional) – LeakyReLU angle of the negative slope. (default:
0.2
) -
dropout (float, optional) – Dropout probability of the normalized attention coefficients which exposes each node to a stochastically sampled neighborhood during training. (default:
0
) -
add_self_loops (bool, optional) – If set to
False
, will not add self-loops to the input graph. (default:True
) -
edge_dim (int, optional) – Edge feature dimensionality (in case there are any). (default:
None
) -
fill_value (float or Tensor or str, optional) – The way to generate edge features of self-loops (in case
edge_dim != None
). If given asfloat
ortorch.Tensor
, edge features of self-loops will be directly given byfill_value
. If given asstr
, edge features of self-loops are computed by aggregating all features of edges that point to the specific node, according to a reduce operation. ("add"
,"mean"
,"min"
,"max"
,"mul"
). (default:"mean"
) -
bias (bool, optional) – If set to
False
, the layer will not learn an additive bias. (default:True
) -
**kwargs (optional) – Additional arguments of
conv.MessagePassing
.def __init__(self,in_channels: Union[int, Tuple[int, int]],out_channels: int,heads: int = 1,concat: bool = True,negative_slope: float = 0.2,dropout: float = 0.0,add_self_loops: bool = True,edge_dim: Optional[int] = None,fill_value: Union[float, Tensor, str] = 'mean',bias: bool = True,**kwargs,):def forward(self, x: Union[Tensor, OptPairTensor], edge_index: Adj,edge_attr: OptTensor = None, size: Size = None,return_attention_weights=None):
注释:文中提到的转导学习和归纳学习,transductive and inductive
(1条消息) 转导学习 transductive learning_TBYourHero的博客-CSDN博客_transductive
归纳是从观察到的训练实例到一般规则的推理,然后将其应用于测试实例。转导是从观察到的特定(训练)实例到特定(测试)实例的推理。
有什么区别?
主要的区别在于,在转导学习过程中,您在训练模型时已经遇到了训练集和测试集。然而,归纳学习在训练模型时仅会遇到训练数据,并将学习到的模型应用于从未见过的数据集上。
转导不能建立预测模型。如果一个新的数据点被添加到测试数据集中,那么我们将不得不从头重新运行算法,训练模型,然后使用它来预测标签。另一方面,归纳学习建立了一个预测模型。当遇到新的数据点时,不需要从头重新运行算法。
简单地说,归纳学习试图建立一个通用模型,在这个模型中,任何新的数据点都将基于一组观察到的训练数据点进行预测。在这里,您可以预测点空间中除未标记点之外的任何点。相反,转导学习建立了一个适合它已经观察到的训练数据点和测试数据点的模型,这种方法利用已知的标记点和附加信息来预测未标记点的标记。
在引入新数据点的情况下,转导学习的成本可能会很高,每次有新数据点时,都必须重新运行所有内容。另一方面,归纳学习最初会建立一个预测模型,新的数据点可以在很短的时间内用较少的计算量标记出来。)
参考:
(1条消息) 转导学习 transductive learning_TBYourHero的博客-CSDN博客_transductive
torch_geometric.nn — pytorch_geometric documentation (pytorch-geometric.readthedocs.io)
[1710.10903] Graph Attention Networks (arxiv.org)