CORE: Automatic Molecule Optimization using Copy and Refine Strategy（AAAI 2020）

CORE：使用复制和改进策略自动进行分子优化

CORE: Automatic Molecule Optimization Using Copy & Refine Strategy | Papers With Code

paired dataset ：iclr19-graph2graph/diff_vae at master · wengong-jin/iclr19-graph2graph · GitHub

在分子模型优化过程中，其核心思想是：在每个生成步骤中，CORE将决定是从输入分子复制子结构（Copy）还是加入新的子结构（Refine）。

本文对我来说有用的东西：

有分子优化的数据对（可用于监督学习）： paired dataset ：iclr19-graph2graph/diff_vae at master · wengong-jin/iclr19-graph2graph · GitHub
知道了分子优化应该做什么工作：在保持与原来分子相似的结构的基础下（80%），从输入分子复制子结构（Copy）或者加入新的子结构（Refine）

背景

由于类药物分子的数量很大，估计在10^23至10^60之间，因此传统方法如高通量筛选（HTS）具有局限性。药物发现中的一项任务称为前导优化：研究者先通过HTS找出候选分子（命中），然后通过前导优化找到属性比原始命中更好的前导化合物。为了将前导优化建模为机器学习问题，训练数据是成对的分子。X是输入分子，Y是X映射到具有更理想属性的目标分子Y，训练的目的是学习到可以从输入分子生成具有更好属性的目标分子的模型。【监督学习】

由于分子可以用SMILES字符串表示，因此早期是将分子的生成归结为序列生成问题。但是，许多这类的算法都会生成许多无效的SMILES字符串，这些字符串与任何有效的分子都不对应。

针对上述问题，研究者提出了基于分子图（而不是分子图像）的方法，这些方法将分子生成任务重新定义为图到图转换的问题，从而避免了生成SMILES字符串的需要。它们的核心思想是：将输入分子图划分为子结构（例如环，原子和化学键）的骨架树，并学习生成这种树。但是，图生成方法仍然有不尽人意的地方，大量可能的树节点意味着产生大量可能的子结构，例如ZINC数据库中有约800个独一无二的子结构。

**Comparison between input and target molecules on 4 datasets/tasks.**

Table 1: Stable principle: Row 1 shows the percentage of original substructures in the target molecule, which is about 80% or more and indicates many original substructures are kept in the newly generated targets.
Novelty principle: Row 2 shows the percentage of targets have any new substructures that do not belong to the input molecule, which is also high and indicates the need for including new substructures in the targets.
Row 3 lists the number of all the substructure, i.e.,
Row 4 lists the average substructures for molecules.

这就使模型面临着挑战，一方面，在每个生成步骤中，模型都必须从大量可能的子结构中确定要添加的子结构。另一方面，根据实际数据，该团队观察到以下关于目标分子的两个原理：

稳定原理（Stable principle）：目标分子中绝大多数主体结构都来自输入分子（80%）；
新型原理（Novelty principle）：大多数目标分子中都存在新的子结构（80%）。

基于上述结论，研究人员提出了一种新的分子优化方法，称为Copy与Refine（CORE）。其核心思想是：在每个生成步骤中，CORE将决定是从输入分子复制子结构（Copy）还是加入新的子结构（Refine）。

Figure 1: Encoder include both graph and scaffolding tree（骨架树） levels.
Decoding mainly split two parts scaffolding tree decoder（scaffolding tree decoder generate molecule in greedy manner using Depth First Search with topological and substructure prediction on each node.） and graph decoder（To assemble the node of scaffolding tree into the molecule, graph decoder enumerates all possible combinations.）.

方法

• molecular graph G is the graph structure for a molecule;

• scaffolding tree $T_{G}$ is the skeleton of the molecular graph G by partitioning the original graph into substructures(subgraphs), and connecting those substructures into a tree.（骨架树是分子图G的主要框架，将原始图分割成子结构(子图)然后将这些子结构连接成树。）

给定一个分子对（输入X和目标Y），首先训练编码器，利用图（或树）上的信息传递算法将输入X的分子图G和骨架树 $T_{G}$ 嵌入到向量表示中。最后，引入两级解码器以创建新的骨架树和相应的分子图。CORE方法主要工作在解码器模块，通过该方法创建符合新型原理和稳定原理的分子。

1、编码器Encoder

To construct cycly-free（无环） structures, scaffolding tree $T_{G}$ is generated via contracting certain vertices of G into a single node（将G的某些顶点收缩为单个节点）.
By viewing scaffolding tree as graph, both input molecular graphs and scaffolding trees can be encoded via graph Message Passing Networks (MPN) (Dai, Dai, and Song 2016; Jin, Barzilay, and Jaakkola 2018).
The encoder yields an embedding vector for each node in either scaffolding tree or the input molecular graph.

More formally, on node level $f_{v}$ denotes the feature vector for node v.【 $f_{v}$ 是结点 v 的特征向量】

For atoms, $f_{v}$ includes the atom type, valence, and other atomic properties. 【对原子来说， $f_{v}$ 包含了原子类型、原子价he和其他原子性质】

For nodes in the scaffolding tree representing substructures, $f_{v}$ is a one-hot vector indicating its substructure index.【对于骨架树中表示子结构的节点， $f_{v}$ 是表示子结构索引的one-hot向量】

On the other hand, on edge level, $f_{uv}$ feature vector for edge (u, v) ∈ E. 【在边的水平上看， $f_{uv}$ 是边uv的特征向量】

N(v) denotes the set of neighbor nodes for node v.【N(v) 代表了结点v周围的邻居结点】

$V_{uv}$ and $V_{vu}$ are the hidden variables that represent the message from node u to v and vice versa.【 $V_{uv}$ 表示从结点u到v传递的隐变量】

They are iteratively updated via a fully-connected neural network g1(·):

where $v_{uv}^{t}$ is the message vector at the t-th iteration, whose initialization is $v_{uv}^{0}=0$ . After T steps of iteration, another network g2(·) is used to aggregate these messages. Each vertex has a latent vector as:

where g2(·) is another fully-connected neural network.

In summary, the encoder module yield embedding vectors for nodes in graph G and scaffolding tree $T_{G}$ , denoted $\mathbf{X}^{G}=\left\{\mathbf{x}_{1}^{G}, \mathbf{x}_{2}^{G}, \cdots\right\}$ and $\mathbf{X}^{\mathcal{T}_{G}}=\left\{\mathbf{x}_{1}^{\mathcal{T}_{G}}, \mathbf{x}_{2}^{\mathcal{T}_{G}}, \cdots\right\}$ , respectively.

2、解码器Decoder

解码器分为骨架树解码器与图解码器，CORE方法对于骨架树解码器具有较好的优化作用。

Once the embedding vectors are constructed, decoder can also be divided into two phases in coarse-to-fine manner: (a) tree decoder; (b) graph decoder. We firstly discuss scaffolding tree decoder. Our method improve the tree decoder in (Jin et al. 2019), so we describe the enhancement in detail.

A. Tree decoder（骨架树解码器）

骨架树解码器的目的是从编码器生成的嵌入中产生新的骨架树。总体思路是从一棵空树开始，一次生成一个子结构，并且每次由CORE方法决定是扩展当前节点还是回溯到其父节点（拓扑预测），以及添加哪个子结构（子结构预测）。一旦达到从根回溯的条件，该骨架树的生成将终止。

The objective of the scaffolding tree decoder is to generate a new scaffolding tree from the embeddings. The overall idea is to generate one substructure at a time from an empty tree, and at each time we decide whether to expand the current node or backtrack to its parent (topological prediction) and which to add (substructure prediction). The generation will terminate once the condition to backtrack from the root is reached. More specifically the tree decoder has two prediction tasks:

• Topological prediction（拓扑预测）:

当解码器访问节点 $i_{t}$ 时，CORE必须对节点进行预测是“扩展一个新节点”还是“回溯到它的父节点

When the decoder visit the node $i_{t}$ , the model has to make a binary prediction on either “expanding a new node” or “backtracking to the parent node of $i_{t}$ ”.

The idea is to first enhance the embedding for node $i_{t}$ via a tree-based RNN (Jin, Barzilay, and Jaakkola 2018), then use the enhanced embedding to predict whether to expand or backtrack. Given scaffolding tree $\mathcal{T}_{G}=(\mathcal{V}, \mathcal{E})$ , the tree decoder uses the tree based RNN with attention mechanism to further improve embedding information learned from the original message-passing embeddings . Since RNN works on a sequence, the tree converts into a sequence of nodes and edges via depth-first search. Specififi-cally, let E˜ = { (i1, j1),(i2, j2), · · · ,(im, jm)} be the edges traversed in depth fifirst search, each edge is visited twice in both directions, so we have m = | ˜E| = 2|E|. Suppose E˜t is the fifirst t edges in E˜, message vector hit,jt is updated as: hit,jt = GRU(fit , { hk,it }(k,it)∈E˜t,k6=jt ). (3) The probability whether to expand or backtrack at node it is computed via aggregating the embeddings XT , XG and the current state fit , P (k,it)∈E˜t hk,it using a neural network g3(·): p topo t = g3(fit , X (k,it)∈E˜t hk,it , XT , XG), where t = 1, · · · , m. (4) Concretely, fifirstly compute context vector c topo t using attention mechanism2 , then concatenate c topo t and fit , followed by a fully connected network with sigmoid activation.

• Substructure prediction（子结构预测）:

If the decoder decides to expand, we have to select which substructure to add by either copying from original input or selecting from the global set of substructures.

拓扑预测

当解码器访问节点时，CORE必须对节点进行预测是“扩展一个新节点”还是“回溯到它的父节点”。思路是：首先通过基于树的RNN加强对其节点的嵌入，然后使用加强后嵌入来预测是扩展还是回溯。给定骨架树，树解码器使用具有注意机制的RNN进一步改善从原始信息传递嵌入中学习到的嵌入信息。信息向量的更新函数为：

在节点处扩展或回溯的概率是通过计算得到：

子结构预测

如果解码器决定扩展，必须通过从原始输入复制或从全局子结构集中来选择要扩展的子结构。本文作者根据经验认为这一步骤最具挑战性，因为它是导致正确率降低的重要原因。首先，使用注意力机制根据当前信息向量和节点嵌入来计算上下文向量：

然后基于注意力向量和信息向量，在此基础上，添加具有softmax激活函数的全连接神经网络来预测子结构：

越大意味着越有可能成为被添加的子结构。

然而，所有可能的子结构的数量通常都非常大，这使得预测更加困难，特别是对于罕见的子结构。受指针网络的启发，作者设计了一种类似的方法，将一些输入序列复制到输出中。但是，指针网络不能处理目标分子包含输入外（OOI）子结构的情况，即新型子结构不是输入分子的一部分。针对这一问题，作者借用从序列到序列模型中的思想设计了一种方法来预测生成新型OOI子结构的权重。

假设权重不仅取决于输入的分子(全局信息)和当前在解码器中的位置(局部信息)，用z表示输入分子的全局信息：

通过计算OOI子结构的权重使得输入分子中的每个子结构都有一个注意力权重（进行过归一化处理，所以总和为1），用它衡量子结构对解码器的贡献，即用它来表示选择每个子结构的概率。

第t次迭代的预测被表示为如下混合形式：

图解码器

图解码器的目标是将骨架树中的节点组装在一起，形成正确的分子图，在学习过程中，所有候选分子结构{Gi}都被列举，并被划分为一个分类问题，其目标是使正确子结构Go的打分函数最大化。

对抗学习

通过对抗训练来进一步提高该模型的性能，其中将整个编码器-解码器体系结构视为生成器G(·）,将目标分子Y视为真实样本，将鉴别器D(·）用来区分实际的分子和由解码器生成的分子。G(·）是一个多层前馈网络。

实验

分子数据库

从ZINC数据库提取的25万个药物分子，表中列出了数据集的基本统计信息。

分子属性

在药物开发中，某些属性对于评估所产生药物的有效性至关重要，本文主要关注以下三个属性：

DRD2：DRD2分数用于衡量分子对称为多巴胺2型受体(DRD2)的生物靶标的生物活性，DRD2分数范围从0到1。
QED：QED评分是药物相似性的指标，范围从0到1。
Penalized LogP：Penalized LogP是一个logP得分，它说明了环尺寸和分子合成的可能性。

对于ZINC中的每个SMILES字符串，使用Rdkit包生成QED，DRD2和LogP分数。对于所有这三个分数，越高越好。因此，对于训练数据对（X,Y），X是得分较低的输入分子，而Y是基于X生成的得分较高的分子。

分子对的产生

对于训练数据集的分子对(X,Y)，其中X是输入分子，Y是具有所需特性的目标分子。X和Y必须满足两个规则：

(1)它们足够相似；

(2)Y相对于X特性具有显着的改善。

罕见子结构

根据研究的观察，如果某个子结构在训练集中出现的次数少于2000次，作者将其称为“罕见子结构”，否则称为“常见子结构”，本文尤其关注罕见子结构的预测。

对比方法

JTVAE：一种深度生成模型，可学习潜在空间以生成所需分子，与CORE一样，它也在骨架树和图级别上使用编码器-解码器体系结构。

Graph-to-Graph：前文提到过的模型，本文就是基于该模型改进的。

GCPN：使用图卷积网络生成具有特定属性的分子结构。

该研究团队还尝试了在SMILES字符串上使用“序列到序列”模型，但是生成的模型生成了太多无效的SMILES字符串，无法与所有其他基于图的方法进行比较，这进一步证实了图生成是分子优化的更有效的方法。

评价指标

相似性：评估了输入分子和生成的分子之间的分子相似性，通过在摩根指纹上的Tanimoto相似性来测量。
生成分子属性：分子属性可以包括使用Rdkit评估的QED-score，DRD2-score和LogP-score。
成功率：该评价标准是同时考虑相似性和属性改进的评价标准。由于任务是生成一个分子，该分子与输入分子相似，并且同时具有改善的特性，所以设计了一个标准来判断它是否满足这两个条件：

（a）输入和生成的分子足够相似，

（b）优化足够大，即

在这些评价标准中，相似性和属性优化是最基本的评价。对于除了运行时间和模型大小之外的所有评价标准，值都是越大越好。

实验结果

与其他方法相比，在所有评价指标中，CORE均表现更好。具体而言，当用成功率SR进行衡量时，CORE绝对比最佳基准提高了约2％。当用SR2进行测量时，它可以在QED和DRD2上实现10％以上的相对改进。

具有罕见子结构的测试子集更具挑战性，因为对于所有方法，性能都会在罕见子集上降低。在罕见子结构的测试子集上进行测量时，与完整测试集相比，CORE可以实现更显着的改进。具体而言，CORE在QED和DRD2中的成功率（SR2）相对提高了21％和18％，而SR（SR1和SR2两者）绝对提高了3％以上。简而言之，与整个测试集相比，CORE在稀有子结构方面获得了更大的改进。

CORE | AAAI2020：分子自动优化模型_DrugAI的博客-CSDN博客

CORE: Automatic Molecule Optimization using Copy and Refine Strategy（论文解读）

背景

方法