化合物分子 ogb、dgl生成图网络及GNN模型训练

2023年6月28日下午3:22 • 人工智能 • 阅读 113

参考：https://towardsdatascience.com/learn-to-smell-molecules-with-graph-convolutional-neural-networks-62fa5a826af5
https://github.com/snap-stanford/ogb
https://github.com/dmlc/dgl/blob/master/examples/pytorch

https://programtalk.com/python-more-examples/ogb.utils.features.atom_to_feature_vector/
https://keras.io/examples/generative/wgan-graphs/

ogb 包是一个图数据操作、加载

1、ogb、dgl simles生成图网络

import torch
import dgl
import torch_geometric
from ogb.utils.features import atom_to_feature_vector, bond_to_feature_vector, get_atom_feature_dims, \
    get_bond_feature_dims
from rdkit import Chem
from rdkit.Chem.rdmolops import GetAdjacencyMatrix
from torch.utils.data import Dataset
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F
from scipy.constants import physical_constants
from typing import List, Tuple

def graph_only_collate(batch: List[Tuple]):
    return dgl.batch(batch)

class InferenceDataset(Dataset):

    def __init__(self, smiles_txt_path, device='cuda:0', transform=None, **kwargs):
        with open(smiles_txt_path) as file:
            lines = file.readlines()
            smiles_list = [line.rstrip() for line in lines]
        atom_slices = [0]
        edge_slices = [0]
        all_atom_features = []
        all_edge_features = []
        edge_indices = []  # edges of each molecule in coo format
        total_atoms = 0
        total_edges = 0
        n_atoms_list = []
        for mol_idx, smiles in tqdm(enumerate(smiles_list)):
            # get the molecule using the smiles representation from the csv file
            mol = Chem.MolFromSmiles(smiles)
            # add hydrogen bonds to molecule because they are not in the smiles representation
            mol = Chem.AddHs(mol)
            n_atoms = mol.GetNumAtoms()

            atom_features_list = []
            for atom in mol.GetAtoms():
                atom_features_list.append(atom_to_feature_vector(atom))
            all_atom_features.append(torch.tensor(atom_features_list, dtype=torch.long))

            edges_list = []
            edge_features_list = []
            for bond in mol.GetBonds():
                i = bond.GetBeginAtomIdx()
                j = bond.GetEndAtomIdx()
                edge_feature = bond_to_feature_vector(bond)
                # add edges in both directions
                edges_list.append((i, j))
                edge_features_list.append(edge_feature)
                edges_list.append((j, i))
                edge_features_list.append(edge_feature)
            # Graph connectivity in COO format with shape [2, num_edges]
            edge_index = torch.tensor(edges_list, dtype=torch.long).T
            edge_features = torch.tensor(edge_features_list, dtype=torch.long)

            edge_indices.append(edge_index)
            all_edge_features.append(edge_features)
            total_edges += len(edges_list)
            total_atoms += n_atoms
            edge_slices.append(total_edges)
            atom_slices.append(total_atoms)
            n_atoms_list.append(n_atoms)

        self.n_atoms = torch.tensor(n_atoms_list)
        self.atom_slices = torch.tensor(atom_slices, dtype=torch.long)
        self.edge_slices = torch.tensor(edge_slices, dtype=torch.long)
        self.edge_indices = torch.cat(edge_indices, dim=1)
        self.all_atom_features = torch.cat(all_atom_features, dim=0)
        self.all_edge_features = torch.cat(all_edge_features, dim=0)

    def __len__(self):
        return len(self.atom_slices) - 1

    def __getitem__(self, idx):

        e_start = self.edge_slices[idx]
        e_end = self.edge_slices[idx + 1]
        start = self.atom_slices[idx]
        n_atoms = self.n_atoms[idx]
        edge_indices = self.edge_indices[:, e_start: e_end]
        g = dgl.graph((edge_indices[0], edge_indices[1]), num_nodes=n_atoms)
        g.ndata['feat'] = self.all_atom_features[start: start + n_atoms]
        g.edata['feat'] = self.all_edge_features[e_start: e_end]
        return g

test_data = InferenceDataset(device=device, smiles_txt_path=args.smiles_txt_path)  ## simles 字符串txt，每行一个

test_loader = DataLoader(test_data, batch_size=2, collate_fn=graph_only_collate)  ## torch的DataLoader

单个测试

smiles_list =["OC1CC1(O)CC1CC1"]

atom_slices = [0]
edge_slices = [0]
all_atom_features = []
all_edge_features = []
edge_indices = []  # edges of each molecule in coo format
total_atoms = 0
total_edges = 0
n_atoms_list = []
for mol_idx, smiles in tqdm(enumerate(smiles_list)):
    # get the molecule using the smiles representation from the csv file
    mol = Chem.MolFromSmiles(smiles)
    # add hydrogen bonds to molecule because they are not in the smiles representation
    mol = Chem.AddHs(mol)
    print(Chem.MolToSmiles(mol))
    n_atoms = mol.GetNumAtoms()
    print(n_atoms)

    atom_features_list = []
    for atom in mol.GetAtoms():
        atom_features_list.append(atom_to_feature_vector(atom))
    all_atom_features.append(torch.tensor(atom_features_list, dtype=torch.long))

    edges_list = []
    edge_features_list = []
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()
        edge_feature = bond_to_feature_vector(bond)
        # add edges in both directions
        edges_list.append((i, j))
        edge_features_list.append(edge_feature)
        edges_list.append((j, i))
        edge_features_list.append(edge_feature)
    # Graph connectivity in COO format with shape [2, num_edges]
    edge_index = torch.tensor(edges_list, dtype=torch.long).T
    edge_features = torch.tensor(edge_features_list, dtype=torch.long)

    edge_indices.append(edge_index)
    all_edge_features.append(edge_features)
    total_edges += len(edges_list)
    total_atoms += n_atoms
    edge_slices.append(total_edges)
    atom_slices.append(total_atoms)
    n_atoms_list.append(n_atoms)

n_atoms = torch.tensor(n_atoms_list)
atom_slices = torch.tensor(atom_slices, dtype=torch.long)
edge_slices = torch.tensor(edge_slices, dtype=torch.long)
edge_indices = torch.cat(edge_indices, dim=1)
all_atom_features = torch.cat(all_atom_features, dim=0)
all_edge_features = torch.cat(all_edge_features, dim=0)

e_start = edge_slices[0]
e_end = edge_slices[0 + 1]
start = atom_slices[0]
n_atoms = n_atoms[0]
edge_indices = edge_indices[:, e_start: e_end]
g = dgl.graph((edge_indices[0], edge_indices[1]), num_nodes=n_atoms)
g.ndata['feat'] = all_atom_features[start: start + n_atoms]
g.edata['feat'] = all_edge_features[e_start: e_end]

2、GNN模型训练代码

row.csv

SMILES,SENTENCE
C/C=C/C(=O)C1CCC(C=C1C)(C)C,"fruity,rose"
COC(=O)OC,"fresh,ethereal,fruity"
Cc1cc2c([nH]1)cccc2,"resinous,animalic"
C1CCCCCCCC(=O)CCCCCCC1,"powdery,musk,animalic"
CC(CC(=O)OC1CC2C(C1(C)CC2)(C)C)C,"coniferous,camphor,fruity"
CCC[C@H](CCO)SC,tropicalfruit

from rdkit import Chem
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
import dgl
from ogb.utils.features import atom_to_feature_vector, bond_to_feature_vector, get_atom_feature_dims, \
    get_bond_feature_dims

def smiles2graph(smiles_string):
    mol = Chem.MolFromSmiles(smiles_string)
    A = Chem.GetAdjacencyMatrix(mol)
    A = np.asmatrix(A)
    nz = np.nonzero(A)
    src, dst = nz[0], nz[1]
    g = dgl.graph((src, dst))
    return g

def feat_vec(smiles_string):
"""
    Returns atom features for a molecule given a smiles string
"""
    # atoms
    mol = Chem.MolFromSmiles(smiles_string)
    atom_features_list = []
    for atom in mol.GetAtoms():
        atom_features_list.append(atom_to_feature_vector(atom))
    x = np.array(atom_features_list, dtype = np.int64)
    return x

df =pd.read_csv("row.csv")

lista_senten=df['SENTENCE'].to_list()
labels=[]

for olor in lista_senten:
  olor=olor.split(",")
  if 'fruity' in olor:
    labels.append(1)
  else:
    labels.append(0)

lista_mols=df['SMILES'].to_list()

j=0
graphs=[]
execptions=[]
for mol in lista_mols:

  g_mol=smiles2graph(mol)

  try:
    g_mol.ndata['feat']=torch.tensor(feat_vec(mol))
  except:
    execptions.append(j)

  graphs.append(g_mol)
  j+=1

from dgl.data import DGLDataset

class SyntheticDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='synthetic')

    def process(self):
        self.graphs = graphs
        self.labels = torch.LongTensor(labels)

    def __getitem__(self, i):
        return self.graphs[i], self.labels[i]

    def __len__(self):
        return len(self.graphs)

dataset = SyntheticDataset()

from dgl.dataloading import GraphDataLoader
from torch.utils.data.sampler import SubsetRandomSampler

num_examples = len(dataset)
num_train = int(num_examples * 0.8)

train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))

train_dataloader = GraphDataLoader(
    dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(
    dataset, sampler=test_sampler, batch_size=5, drop_last=False)

from dgl.nn import GraphConv
from torch import nn
import torch.nn.functional as F

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        g.ndata['h'] = h
        return dgl.mean_nodes(g, 'h')

model = GCN(9, 8, 2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(20):
    for batched_graph, labels in train_dataloader:
        pred = model(batched_graph, batched_graph.ndata['feat'].float())
        #print(pred,labels)
        loss = F.cross_entropy(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
    pred = model(batched_graph, batched_graph.ndata['feat'].float())
    num_correct += (pred.argmax(1) == labels).sum().item()
    num_tests += len(labels)

print('Test accuracy:', num_correct / num_tests)

##保存
torch.save(
                model.state_dict(), os.path.join(args.save_dir, args.name)
            )

## 先加载模型框架
model.load_state_dict(torch.load(os.path.join(args.save_dir, args.name)))

2、pgl图框架安装

参考：https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

这款是torch原生；上面dgl是亚马逊先开发的也比较通用

pgl容易出错，建议conda 创建个python==3.9环境，然后pgl对于的一些列包如果安装报Microsoft Visual C++ 14.0 or greater is required. Get it wit等错误，建议直接下载wheel轮子进行安装

https://pytorch-geometric.com/whl wheel轮子地址；这里安装的torch对应torch=1.12.1，windows上安装

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric

Original: https://blog.csdn.net/weixin_42357472/article/details/127819539
Author: loong_XL
Title: 化合物分子 ogb、dgl生成图网络及GNN模型训练

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/657324/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

深度学习与神经网络有什么区别

深度学习与神经网络有什么区别找深度学习和神经网络的不同点，其实主要的就是：原来多层神经网络做的步骤是：特征映射到值。特征是人工挑选。深度学习做的步骤是信号->特征->…

人工智能 2023年6月25日
0092
基于MCRA-OMLSA的语音降噪(一)：原理

前面的几篇文章讲了webRTC中的语音降噪。最近又用到了基于MCRA-OMLSA的语音降噪，就学习了原理并且软件实现了它。MCRA主要用于噪声估计，OMLSA是基于估计出来的噪声去…

人工智能 2023年5月27日
0060
[ 基础漏洞篇 ] webpack 前端源码泄露详解

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月2日
0063
中文新闻文本标题分类（基于飞桨、Text CNN）

目录一、设计方案概述二、具体实现三、结果及分析四、总结一、设计方案概述主要网络模型设计：设计所使用网络模型为TextCNN,由于其本身就适用于短中句子，在标题分类这一方面应…

人工智能 2023年7月3日
0083
cannot import name ‘keras_tensor‘ from ‘tensorflow.python.keras.engine‘

问题原因：深度学习tensorflow-gpu安装版本与tensorflow-addons解决方案：深度学习经常会使用tensorfow-gpu，但是它的安装确实有点麻烦，但是也有…

人工智能 2023年5月25日
0090
卡尔曼滤波实例——预测橘子的轨迹

目录流程一、采用轮廓的方式检测橘子位置（一）滚动条获取阈值（二）获取到图像中的包围橘子对应的白色图形的最小矩形框的信息二、获取橘子检测框的质心三、将质心送入卡尔曼滤波器…

人工智能 2023年6月30日
0077
回归预测 | MATLAB实现CNN(卷积神经网络)多输入单输出

回归预测 | MATLAB实现CNN(卷积神经网络)多输入单输出目录 * – 回归预测 | MATLAB实现CNN(卷积神经网络)多输入单输出 – + 基…

人工智能 2023年6月17日
00122
25/3/2022代码技巧

del xxx （1） import xxx 的逆操作（2）释放内存，删除之前的一些缓存 round()函数用于四舍五入数值，起到取整的作用 line.rstrip() 去除末…

人工智能 2023年6月11日
00178
深度学习之环境配置–anaconda安装虚拟环境

由于torch的CPU和GPU版本会发生冲突，为了避免冲突，我们在base中创建CPU环境之后可以再安装一个虚拟环境创建GPU版本，这样我们就同时拥有CPU和GPU两个环境了。下面…

人工智能 2023年7月5日
0067
pandas practice

数据结构 Series– 用列表生成Series时，Pandas默认自动生成整数索引，也可以指定索引。是有索引的一维数组，numpy没有索引import pandas …

人工智能 2023年7月7日
0069
《Python 快速入门》C站最全Python标准库总结

本文收录于《100 天精通 Python – 快速入门到黑科技》专栏，是由 CSDN 内容合伙人丨全站排名 Top 4 的硬核博主不吃西红柿倾力打造。分基础知识篇、…

人工智能 2023年7月3日
0096
正则化之L1和L2已经dropout的一些理解和pytorch代码实现与效果证明

文章目录前言 L1正则化 L2正则化 dropout 参考前言正则化主要解决模型过拟合问题，主要是通过减小w的值，即模型的权重来缓解过拟合的。可以看这么一张图，需要一条曲线去…

人工智能 2023年7月22日
0080
单通道语音增强

单通道语音增强技术概念：语音增强是指在语音信号被各种噪声干扰甚至淹没时，从噪声背景中提取有用的语音信号，抑制和减少噪声干扰的技术。总之，从嘈杂的语音中提取出尽可能纯净的原始语音。…

人工智能 2023年5月27日
0088
69 —-锥面及其方程、圆锥面的方程、一般锥面的方程、锥面方程的特点

设直圆锥面_的顶点为V(2,-3,5)，轴为直线L：{(x,y,z)=(1,1,1)+t(1,1,1)}，其中t为实数。以L为坐标轴，我们可以将空间点(x,y,z)表示为L上…

人工智能 2023年5月31日
0071
时间序列预测-ARMA实战

ARMA中文全称为自回归移动平均模型，广泛用于时间时间序列分析中。本文以statsmodels 模块中自带数据集co2为例，实战研究ARMA模型。一、探索性数据分析。首先导入必要的…

人工智能 2023年7月15日
0085
数据分析之卡方检验

1、卡方检验定义卡方检验，是用途非常广的一种假设检验方法，它在分类资料统计推断中的应用，包括两个率或两个构成比比较的卡方检验；多个率或多个构成比比较的卡方检验以及分类资料的相关分…

人工智能 2023年7月14日
0068

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

化合物分子 ogb、dgl生成图网络及GNN模型训练

1、ogb、dgl simles生成图网络

2、GNN模型训练代码

2、pgl图框架安装

pgl容易出错，建议conda 创建个python==3.9环境，然后pgl对于的一些列包如果安装报Microsoft Visual C++ 14.0 or greater is required. Get it wit等错误，建议直接下载wheel轮子进行安装

大家都在看