早期基于 DGL 库学习卷积神经网络,写过一个 GCN demo 。后来PyTorch的几何扩展库出来了,发现学术界很多paper都是基于 PyG 实现的,因此学习下 PyG 如何使用。PyG 全称为 PyTorch Geometric。
事实上这两个库都非常实用,但 PyG 和 DGL 这两大框架应该如何选择呢?没有好坏之分,个人只是从工具生态中进行判断,给出这两个库在Github中 Fork 和 Star 数量,可以说明 PyG 维护人员和受欢迎的程度还是要高一点的啊!
PyG 安装 PyG 全称是PyTorch-Geometric,是一个PyTorch基础上的一个库,专门用于图形式的数据,可以加速图学习算法的计算过程,比如稀疏化的图等。
参考:https://github.com/rusty1s/pytorch_geometric 对于 PyTorch 版本为1.5.0
,安装一下库:
1 2 3 4 5 $ pip install torch-scatter==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.5.0.html $ pip install torch-sparse==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.5.0.html $ pip install torch-cluster==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.5.0.html $ pip install torch-spline-conv==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.5.0.html $ pip install torch-geometric
其中${CUDA}
取决于为本地的PyTorch安装环境,可以替换为:cpu, cu92, cu101 or cu102
。
如果以上命令执行完成说明 torch_geometric
已安装成功!
问题 理论上安装完成后没问题但是我导入 torch_geometric 时报错:
1 2 3 4 5 6 7 8 9 RuntimeError: Tried to access nonexistent attribute or method 'true_divide_' of type 'Tensor'.: File "/usr/local/lib/python3.6/dist-packages/torch_scatter/scatter.py", line 53 count = broadcast(count, out, dim) if torch.is_floating_point(out): out.true_divide_(count) ~~~~~~~~~~~~~~~~ <--- HERE else: out.floor_divide_(count)
问题待解决,已提交issue:https://github.com/rusty1s/pytorch_scatter/issues/140
入门实例 参考官方教程:https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html
本文采用官方的实例进行入门,主要包括:
Data Handling of Graphs:图数据处理 Common Benchmark Datasets:公开的基准数据集 Mini-batches:小批量数据 Data Transforms:数据变换 Learning Methods on Graphs:图上的学习方法 以下实验部分在jupyter中学习更加哦!传送门~
Data Handling of Graphs 图是用来建模现实系统中实体(节点)和关系(边)的,简单图在 PyG 中通常存储为实例:torch_geometric.data.Data
实例,具有一下属性值:
data.x
: 节点特征矩阵,大小为 [num_nodes, num_node_features]
data.edge_index
: COO格式的边存储形式,大小为 [2, num_edges]
data.edge_attr
: 边属性,大小为 [num_edges, num_edge_features]
data.y
: 任意大小的训练分类目标, 例如,节点级别大小为 [num_nodes, *]
或图级别大小为 [1, *]
data.pos
: 节点的位置矩阵,大小为 [num_nodes, num_dimensions]
以代码示例说明上述属性,对于三个节点四条边的无向无权图:
1 2 3 4 5 6 7 8 9 import torchfrom torch_geometric.data import Dataedge_index = torch.tensor([[0 , 1 , 1 , 2 ], [1 , 0 , 2 , 1 ]], dtype=torch.long) x = torch.tensor([[-1 ], [0 ], [1 ]], dtype=torch.float ) data = Data(x=x, edge_index=edge_index)
注意:如果边不是以COO形式给出的而是以节点对的形式,需要先转置t()
再利用函数contiguous()
,例如以上示例:
1 2 edge_index = torch.tensor([[0 ,1 ], [1 ,0 ], [1 ,2 ], [2 ,1 ]], dtype=torch.long) edge_index.t().contiguous()
tensor([[0, 1, 1, 2],
[1, 0, 2, 1]])
Data(edge_index=[2, 4], x=[3, 1])
tensor([[-1.],
[ 0.],
[ 1.]])
tensor([[0, 1, 1, 2],
[1, 0, 2, 1]])
1 2 for key, item in data: print (key, item)
edge_index tensor([[0, 1, 1, 2],
[1, 0, 2, 1]])
x tensor([[-1.],
[ 0.],
[ 1.]])
3
4
1
1 data.contains_isolated_nodes()
False
1 data.contains_self_loops()
False
False
1 2 3 if torch.cuda.is_available(): data = data.to(torch.device('cuda' ))
Common Benchmark Datasets Planetoid datasets Graph classification datasets. QM7 and QM9 datasets. 3D mesh/point TUDataset 1 2 3 from torch_geometric.datasets import TUDatasetdataset = TUDataset(root='./data' , name='ENZYMES' )
600
6
1 dataset.num_node_features
3
1 dataset.num_edge_attributes
0
Data(edge_index=[2, 168], x=[37, 3], y=[1])
False
Split the dataset:90/10 train/test
1 2 3 train_dataset = dataset[:540 ] test_dataset = dataset[540 :] train_dataset
ENZYMES(540)
Dataset permutation
1 2 dataset = dataset.shuffle() dataset[0 ]
Data(edge_index=[2, 104], x=[30, 3], y=[1])
Cora dataset 1 2 from torch_geometric.datasets import Planetoiddataset = Planetoid(root='./data' , name='Cora' )
1
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
True
tensor([ True, True, True, ..., False, False, False])
1 data.train_mask.sum ().item()
140
1 data.val_mask.sum ().item()
500
1 data.test_mask.sum ().item()
1000
tensor([3, 4, 4, ..., 3, 3, 3])
train_mask
训练的节点val_mask
验证节点test_mask
测试节点Mini-batchs 神经网络通常会按照batch
方式进行训练,PyG 通过构建稀疏化的分块对角阵实现mini-batch
的并行化,构建方式按照每一个Data实例的edge_index
构建一个图的邻接矩阵,然后将所有节点的特征向量按行拼接。使得不同数量的顶点数和边数的图可以一起训练。可以使用PyG内部的 torch_geometric.data.DataLoader
进行图拼接的过程!
A = [ A 1 ⋱ A n ] , X = [ X 1 ⋮ X n ] , Y = [ Y 1 ⋮ Y n ] \mathbf{A}=\left[\begin{array}{ccc} \mathbf{A}_{1} & & \\ & \ddots & \\ & & \mathbf{A}_{n} \end{array}\right], \quad \mathbf{X}=\left[\begin{array}{c} \mathbf{X}_{1} \\ \vdots \\ \mathbf{X}_{n} \end{array}\right], \quad \mathbf{Y}=\left[\begin{array}{c} \mathbf{Y}_{1} \\ \vdots \\ \mathbf{Y}_{n} \end{array}\right] A = ⎣ ⎢ ⎡ A 1 ⋱ A n ⎦ ⎥ ⎤ , X = ⎣ ⎢ ⎢ ⎡ X 1 ⋮ X n ⎦ ⎥ ⎥ ⎤ , Y = ⎣ ⎢ ⎢ ⎡ Y 1 ⋮ Y n ⎦ ⎥ ⎥ ⎤
1 2 3 4 5 6 from torch_geometric.datasets import TUDatasetfrom torch_geometric.data import DataLoaderdataset = TUDataset(root='./data' , name='ENZYMES' , use_node_attr=True ) loader = DataLoader(dataset, batch_size=32 , shuffle=True ) len (loader)
19
1 2 for batch in loader: print (batch)
Batch(batch=[1034], edge_index=[2, 3968], x=[1034, 21], y=[32])
Batch(batch=[1057], edge_index=[2, 4196], x=[1057, 21], y=[32])
Batch(batch=[1034], edge_index=[2, 4078], x=[1034, 21], y=[32])
Batch(batch=[1167], edge_index=[2, 4186], x=[1167, 21], y=[32])
Batch(batch=[1096], edge_index=[2, 4132], x=[1096, 21], y=[32])
Batch(batch=[1176], edge_index=[2, 4286], x=[1176, 21], y=[32])
Batch(batch=[1070], edge_index=[2, 4226], x=[1070, 21], y=[32])
Batch(batch=[1077], edge_index=[2, 4114], x=[1077, 21], y=[32])
Batch(batch=[1085], edge_index=[2, 4276], x=[1085, 21], y=[32])
Batch(batch=[949], edge_index=[2, 3546], x=[949, 21], y=[32])
Batch(batch=[1049], edge_index=[2, 3944], x=[1049, 21], y=[32])
Batch(batch=[940], edge_index=[2, 3690], x=[940, 21], y=[32])
Batch(batch=[1298], edge_index=[2, 4444], x=[1298, 21], y=[32])
Batch(batch=[935], edge_index=[2, 3650], x=[935, 21], y=[32])
Batch(batch=[982], edge_index=[2, 3866], x=[982, 21], y=[32])
Batch(batch=[1022], edge_index=[2, 3908], x=[1022, 21], y=[32])
Batch(batch=[916], edge_index=[2, 3514], x=[916, 21], y=[32])
Batch(batch=[982], edge_index=[2, 3840], x=[982, 21], y=[32])
Batch(batch=[711], edge_index=[2, 2700], x=[711, 21], y=[24])
batch
大小为32个图,但是每一个图的规模是不一样的,上例中第一个batch内的32个图共1005节点,含有3948条边。
torch_geometric.data.Batch
继承torch_geometric.data.Data
,并且添加了一个额外的属性batch
。batch
是一个列向量,代表了每一个节点对应到哪一个图。
batch = [ 0 ⋯ 0 1 ⋯ n − 2 n − 1 ⋯ n − 1 ] ⊤ \text { batch }=\left[\begin{array}{cccccccc} 0 & \cdots & 0 & 1 & \cdots & n-2 & n-1 & \cdots & n-1 \end{array}\right]^{\top} batch = [ 0 ⋯ 0 1 ⋯ n − 2 n − 1 ⋯ n − 1 ] ⊤
可以根据Batch
对每个图中的节点特征进行平均化,其中使用到 scatter
库,以每一个图为单位,将各个图中的所有节点的特征向量计算了一个平均值,所以维度为[32, 21]
。
1 2 3 4 5 6 7 8 9 10 11 from torch_scatter import scatter_meanfrom torch_geometric.datasets import TUDatasetfrom torch_geometric.data import DataLoaderdataset = TUDataset(root='./data' , name='ENZYMES' , use_node_attr=True ) loader = DataLoader(dataset, batch_size=32 , shuffle=True ) for data in loader: print (data) print (data.num_graphs) x = scatter_mean(data.x, data.batch, dim=0 ) print (x.size())
Batch(batch=[971], edge_index=[2, 3762], x=[971, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1214], edge_index=[2, 4626], x=[1214, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1102], edge_index=[2, 4184], x=[1102, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1076], edge_index=[2, 3630], x=[1076, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1044], edge_index=[2, 3978], x=[1044, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1152], edge_index=[2, 4216], x=[1152, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1053], edge_index=[2, 4152], x=[1053, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1018], edge_index=[2, 3876], x=[1018, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1076], edge_index=[2, 4082], x=[1076, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[941], edge_index=[2, 3586], x=[941, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1119], edge_index=[2, 4292], x=[1119, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1026], edge_index=[2, 3818], x=[1026, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[934], edge_index=[2, 3774], x=[934, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1036], edge_index=[2, 3882], x=[1036, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[983], edge_index=[2, 3734], x=[983, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1003], edge_index=[2, 3918], x=[1003, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[1140], edge_index=[2, 4506], x=[1140, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[924], edge_index=[2, 3572], x=[924, 21], y=[32])
32
torch.Size([32, 21])
Batch(batch=[768], edge_index=[2, 2976], x=[768, 21], y=[24])
24
torch.Size([24, 21])
PyG 中的数据变换与 torchvision
中图片变换与扩充类似。变换操作可以是使用torch_geometric.transform.Compose
进行图拼接!
下面以ShapeNet(17000 3D点云,16种形状) 数据集为例,下载数据时可以根据 kNN 对点云进行图构造:
1 2 3 from torch_geometric.datasets import ShapeNetdataset = ShapeNet(root='./data/ShapeNet' , categories=['Airplane' ]) dataset
ShapeNet(2349, categories=['Airplane'])
Data(category=[1], pos=[2518, 3], x=[2518, 3], y=[2518])
可以根据KNN构造点云图,下载时已预处理下次使用时不要在进行图构造!
1 2 3 4 5 import torch_geometric.transforms as Tdataset = ShapeNet(root='./data/ShapeNet' , categories=['Airplane' ], pre_transform=T.KNNGraph(k=6 ), transform=T.RandomTranslate(0.01 )) dataset[0 ]
Data(category=[1], pos=[2518, 3], x=[2518, 3], y=[2518])
Learning Methods on Graphs 在学习PyG的数据处理与图变换方法后,接下来就依赖 PyG 实现 GNN了!
1 2 3 4 from torch_geometric.datasets import Planetoiddataset = Planetoid(root='./data' , name='Cora' ) dataset
Cora()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import torchimport torch.nn.functional as Ffrom torch_geometric.nn import GCNConvclass Net (torch.nn.Module ): def __init__ (self ): super (Net, self).__init__() self.conv1 = GCNConv(dataset.num_node_features, 16 ) self.conv2 = GCNConv(16 , dataset.num_classes) def forward (self, data ): x, edge_index = data.x, data.edge_index x = self.conv1(x, edge_index) x = F.relu(x) x = F.dropout(x, training=self.training) x = self.conv2(x, edge_index) return F.log_softmax(x, dim=1 )
1 2 3 4 5 6 7 8 9 10 11 12 13 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu' ) model = Net().to(device) data = dataset[0 ].to(device) optimizer = torch.optim.Adam(model.parameters(), lr=0.01 , weight_decay=5e-4 ) model.train() for epoch in range (200 ): optimizer.zero_grad() out = model(data) loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) loss.backward() optimizer.step()
1 2 3 4 5 6 model.eval () _, pred = model(data).max (dim=1 ) correct = float (pred[data.test_mask].eq(data.y[data.test_mask]).sum ().item()) acc = correct / data.test_mask.sum ().item() print ('Accuracy: {:.4f}' .format (acc))
Accuracy: 0.7970
本文简单说明的PyG的安装过程并以简单的实例介绍了 PyG 中的基础知识。接下来需要学习如何构造自定义的图卷积操作!
联系作者