链接预测任务也是一个长期存在的图学习问题,其目的是预测任何一对节点之间现在缺失或未来可能形成的链接。
这里仍然使用Cora数据集,在论文的引用网络中预测两篇论文之间是否存在引用关系。将其视为一个二分类问题:
随机选择边数的10%当作测试集,其他用作训练集
dataset = dgl.data.CoraGraphDataset()
g = dataset[0]
u, v = g.edges()
eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.num_edges()-test_size
#positive examples
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]
#[u', v'] = 1 2708*2708
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
# 2708*2708 空边
adj_neg = 1-adj.todense()-np.eye(g.num_nodes())
neg_u, neg_v = np.where(adj_neg!=0)
neg_eids = np.random.choice(len(neg_u), g.number_of_edges())
#negative examples
test_neg_u, test_neg_v = (neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]])
train_neg_u, train_neg_v = (neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]])
#从原图中删去用作测试集当中的positive examples
train_g = dgl.remove_edges(g, eids[:test_size])
搭建一个含两层GraphSAGE的模型
class GraphSAGE(nn.Module):
def __init__(self, in_feats, h_feats):
super(GraphSAGE, self).__init__()
self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
利用函数(比如MLP、点积)来计算两个相关节点的表示来预测它们之间边的存在概率:
y
^
u
∼
v
=
f
(
h
u
,
h
v
)
\hat{y}_{u\sim v} =f(h_u, h_v)
y^u∼v=f(hu,hv)
在节点分类问题中需要计算节点表示,而在链接预测问题中需要计算边(即一对节点)的表示。
在DGL中,鼓励我们将一对节点视作另一个图,因为可以用一条边来描述。将所有positive examples构建成一个新图positive graph,将所有negative examples构建成一个新图negative graph,使得在传递节点特征时更加便捷,它们的节点数和原图的节点数保持一致。
#positive graph
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.number_of_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.number_of_nodes())
#negative graph
test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.number_of_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.number_of_nodes())
将一对节点视作一个图的优势是可以使用内置的DGLGraph.apply_edges
方法,根据相关节点特征和原来的边特征来便捷地计算新的边特征。
比如dgl.function.u_dot_v
,计算节点间的点积当作边的特征:
import dgl.function as fn
class DotPredictor(nn.Module):
def forward(self, g, h):
with g.local_scope():
g.ndata["h"] = h
#计算节点u的'h'特征和节点v的'h'特征的点积,来当作边u_v的'score'特征
g.apply_edges(fn.u_dot_v("h", "h", "score"))
# u_dot_v returns a 1-element vector for each edge so you need to squeeze it.
return g.edata["score"][:, 0]
可以自己实现更复杂的形式:
class MLPPredictor(nn.Module):
def __init__(self, h_feats):
super().__init__()
self.W1 = nn.Linear(h_feats * 2, h_feats)
self.W2 = nn.Linear(h_feats, 1)
def apply_edges(self, edges):#将节点u的'h'特征和节点v的'h'特征拼接起来然后传入MLP
h = torch.cat([edges.src["h"], edges.dst["h"]], 1)
return {"score": self.W2(F.relu(self.W1(h))).squeeze(1)}
def forward(self, g, h):
with g.local_scope():
g.ndata["h"] = h
g.apply_edges(self.apply_edges)
return g.edata["score"]
apply_edges和update_all中的消息传递函数具有相同的形式
采用二分类的交叉熵:
L
=
−
∑
u
∼
v
∈
D
(
y
u
∼
v
log
(
y
^
u
∼
v
)
+
(
1
−
y
u
∼
v
)
log
(
1
−
y
^
u
∼
v
)
)
)
\mathcal{L}=-\sum_{u \sim v \in \mathcal{D}}\left(y_{u \sim v} \log \left(\hat{y}_{u \sim v}\right)+\left(1-y_{u \sim v}\right) \log \left(1-\hat{y}_{u \sim v}\right)\right))
L=−u∼v∈D∑(yu∼vlog(y^u∼v)+(1−yu∼v)log(1−y^u∼v)))
利用AUC指标衡量分类效果
model = GraphSAGE(train_g.ndata['feat'].shape[1], 16)
# pred = MLPPredictor(16)
pred = DotPredictor()
def compute_loss(pos_score, neg_score):
scores = torch.cat([pos_score, neg_score])
labels = torch.cat(
[torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
)
return F.binary_cross_entropy_with_logits(scores, labels)
from sklearn.metrics import roc_auc_score
def compute_auc(pos_score, neg_score):
scores = torch.cat([pos_score, neg_score]).numpy()
labels = torch.cat(
[torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
).numpy()
return roc_auc_score(labels, scores)
optimizer = torch.optim.Adam(
itertools.chain(model.parameters(), pred.parameters()), lr=0.01
)
all_logits = []
for e in range(100):
#计算train_g中节点的新的表征 2708*16
h = model(train_g, train_g.ndata['feat'])
#计算positive graph中边的得分 label应该为1
pos_score = pred(train_pos_g, h)
#计算positive graph中边的得分 label应该为1
neg_score = pred(train_neg_g, h)
#计算损失函数
loss = compute_loss(pos_score, neg_score)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if e%5==0:
print("In epoch {}, loss: {}".format(e, loss))
with torch.no_grad():
pos_score = pred(test_pos_g, h)
neg_score = pred(test_neg_g, h)
print('AUC', compute_auc(pos_score, neg_score))
In epoch 0, loss: 0.6929953694343567
In epoch 5, loss: 0.667258620262146
In epoch 10, loss: 0.5942424535751343
In epoch 15, loss: 0.5307831168174744
In epoch 20, loss: 0.4933420717716217
In epoch 25, loss: 0.4626922905445099
In epoch 30, loss: 0.43797579407691956
In epoch 35, loss: 0.41242456436157227
In epoch 40, loss: 0.3902527689933777
In epoch 45, loss: 0.3677825331687927
In epoch 50, loss: 0.34458068013191223
In epoch 55, loss: 0.3225224018096924
In epoch 60, loss: 0.30010712146759033
In epoch 65, loss: 0.27759820222854614
In epoch 70, loss: 0.25534024834632874
In epoch 75, loss: 0.2331291139125824
In epoch 80, loss: 0.21092979609966278
In epoch 85, loss: 0.18891064822673798
In epoch 90, loss: 0.16841943562030792
In epoch 95, loss: 0.14776507019996643
AUC 0.8631683924440152