当前位置: 首页 > 工具软件 > DGL > 使用案例 >

【DGL】链接预测

谯德佑
2023-12-01

概述

链接预测任务也是一个长期存在的图学习问题,其目的是预测任何一对节点之间现在缺失或未来可能形成的链接。
这里仍然使用Cora数据集,在论文的引用网络中预测两篇论文之间是否存在引用关系。将其视为一个二分类问题:

  • 图中存在的边视为positive examples
  • 图中不存在的边,抽样出一部分视为negative examples
  • 将positive examples和negative examples划分出训练集和测试集
  • 采用AUC作评估指标

准备数据集

随机选择边数的10%当作测试集,其他用作训练集

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

u, v = g.edges()

eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.num_edges()-test_size
#positive examples
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

#[u', v'] = 1  2708*2708
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
# 2708*2708 空边
adj_neg = 1-adj.todense()-np.eye(g.num_nodes())

neg_u, neg_v = np.where(adj_neg!=0)
neg_eids = np.random.choice(len(neg_u), g.number_of_edges())

#negative examples
test_neg_u, test_neg_v = (neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]])
train_neg_u, train_neg_v = (neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]])

#从原图中删去用作测试集当中的positive examples
train_g = dgl.remove_edges(g, eids[:test_size])

搭建模型

搭建一个含两层GraphSAGE的模型

class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

利用函数(比如MLP、点积)来计算两个相关节点的表示来预测它们之间边的存在概率:
y ^ u ∼ v = f ( h u , h v ) \hat{y}_{u\sim v} =f(h_u, h_v) y^uv=f(hu,hv)

构建新图

在节点分类问题中需要计算节点表示,而在链接预测问题中需要计算边(即一对节点)的表示。
在DGL中,鼓励我们将一对节点视作另一个图,因为可以用一条边来描述。将所有positive examples构建成一个新图positive graph,将所有negative examples构建成一个新图negative graph,使得在传递节点特征时更加便捷,它们的节点数和原图的节点数保持一致。

#positive graph
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.number_of_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.number_of_nodes())
#negative graph
test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.number_of_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.number_of_nodes())

apply_edges

将一对节点视作一个图的优势是可以使用内置的DGLGraph.apply_edges方法,根据相关节点特征和原来的边特征来便捷地计算新的边特征。
比如dgl.function.u_dot_v,计算节点间的点积当作边的特征:

import dgl.function as fn
class DotPredictor(nn.Module):
    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            #计算节点u的'h'特征和节点v的'h'特征的点积,来当作边u_v的'score'特征
            g.apply_edges(fn.u_dot_v("h", "h", "score"))
            # u_dot_v returns a 1-element vector for each edge so you need to squeeze it.
            return g.edata["score"][:, 0]

可以自己实现更复杂的形式:

class MLPPredictor(nn.Module):
    def __init__(self, h_feats):
        super().__init__()
        self.W1 = nn.Linear(h_feats * 2, h_feats)
        self.W2 = nn.Linear(h_feats, 1)

    def apply_edges(self, edges):#将节点u的'h'特征和节点v的'h'特征拼接起来然后传入MLP
        h = torch.cat([edges.src["h"], edges.dst["h"]], 1)
        return {"score": self.W2(F.relu(self.W1(h))).squeeze(1)}

    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            g.apply_edges(self.apply_edges)
            return g.edata["score"]

apply_edges和update_all中的消息传递函数具有相同的形式

训练

损失函数

采用二分类的交叉熵:
L = − ∑ u ∼ v ∈ D ( y u ∼ v log ⁡ ( y ^ u ∼ v ) + ( 1 − y u ∼ v ) log ⁡ ( 1 − y ^ u ∼ v ) ) ) \mathcal{L}=-\sum_{u \sim v \in \mathcal{D}}\left(y_{u \sim v} \log \left(\hat{y}_{u \sim v}\right)+\left(1-y_{u \sim v}\right) \log \left(1-\hat{y}_{u \sim v}\right)\right)) L=uvD(yuvlog(y^uv)+(1yuv)log(1y^uv)))
利用AUC指标衡量分类效果

model = GraphSAGE(train_g.ndata['feat'].shape[1], 16)
# pred = MLPPredictor(16)
pred = DotPredictor()
def compute_loss(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score])
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    )
    return F.binary_cross_entropy_with_logits(scores, labels)

from sklearn.metrics import roc_auc_score
def compute_auc(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score]).numpy()
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    ).numpy()
    return roc_auc_score(labels, scores)

训练过程

optimizer = torch.optim.Adam(
    itertools.chain(model.parameters(), pred.parameters()), lr=0.01
)
all_logits = []
for e in range(100):
    #计算train_g中节点的新的表征  2708*16
    h = model(train_g, train_g.ndata['feat'])
    #计算positive graph中边的得分  label应该为1
    pos_score = pred(train_pos_g, h)
    #计算positive graph中边的得分  label应该为1
    neg_score = pred(train_neg_g, h)
    #计算损失函数
    loss = compute_loss(pos_score, neg_score)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if e%5==0:
        print("In epoch {}, loss: {}".format(e, loss))

with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print('AUC', compute_auc(pos_score, neg_score))
In epoch 0, loss: 0.6929953694343567
In epoch 5, loss: 0.667258620262146
In epoch 10, loss: 0.5942424535751343
In epoch 15, loss: 0.5307831168174744
In epoch 20, loss: 0.4933420717716217
In epoch 25, loss: 0.4626922905445099
In epoch 30, loss: 0.43797579407691956
In epoch 35, loss: 0.41242456436157227
In epoch 40, loss: 0.3902527689933777
In epoch 45, loss: 0.3677825331687927
In epoch 50, loss: 0.34458068013191223
In epoch 55, loss: 0.3225224018096924
In epoch 60, loss: 0.30010712146759033
In epoch 65, loss: 0.27759820222854614
In epoch 70, loss: 0.25534024834632874
In epoch 75, loss: 0.2331291139125824
In epoch 80, loss: 0.21092979609966278
In epoch 85, loss: 0.18891064822673798
In epoch 90, loss: 0.16841943562030792
In epoch 95, loss: 0.14776507019996643
AUC 0.8631683924440152

参考

https://docs.dgl.ai/tutorials/blitz/4_link_predict.html

 类似资料: