1. 类及其继承关系
Model / \ / \ MLP GeneralizedModel / \ / \ Node2VecModel SampleAndAggregate
首先看Model, GeneralizedModel, SampleAndAggregate这三个类的联系。
其中Model与 GeneralizedModel的区别在于,Model的build()函数中搭建了序列层模型,而在GeneralizedModel中被删去。self.ouput必须在GeneralizedModel的子类build()中被赋值。
class Model(object) 中的build()函数如下:
1 def build(self): 2 """ Wrapper for _build() """ 3 with tf.variable_scope(self.name): 4 self._build() 5 6 # Build sequential layer model 7 self.activations.append(self.inputs) 8 for layer in self.layers: 9 hidden = layer(self.activations[-1]) 10 self.activations.append(hidden) 11 self.outputs = self.activations[-1] 12 # 这部分sequential layer model模型在GeneralizedModel的build()中被删去 13 14 # Store model variables for easy access 15 variables = tf.get_collection( 16 tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name) 17 self.vars = {var.name: var for var in variables} 18 19 # Build metrics 20 self._loss() 21 self._accuracy() 22 23 self.opt_op = self.optimizer.minimize(self.loss)
序列层实现的功能是,给输入,通过layer()返回输出,又将这个输出再次作为输入到下一个layer()中,最终,取最后一层layer的结果作为output.
2. class SampleAndAggregate(GeneralizedModel)
1. __init__():
(1) self.features的由来:
para: features tf.get_variable()-> identity features | | self.features self.embeds --> At least one is not None \ / --> Concat if both are not None \ / \ / self.features
(2) self.dims:
self.dims是一个list, 每一位记录各个神经网络层的维数。
self.dims[0]的值相当于self.features的列数 (0 if features is None else features.shape[1]) + identity_dim),(注意:括号里features为传入的参数,而非self.features)
之后各位为各层output_dim,也就是hidden units的个数。
(3) __init()__函数代码
1 def __init__(self, placeholders, features, adj, degrees, 2 layer_infos, concat=True, aggregator_type="mean", 3 model_size="small", identity_dim=0, 4 **kwargs): 5 ''' 6 Args: 7 - placeholders: Stanford TensorFlow placeholder object. 8 - features: Numpy array with node features. 9 NOTE: Pass a None object to train in featureless mode (identity features for nodes)! 10 - adj: Numpy array with adjacency lists (padded with random re-samples) 11 - degrees: Numpy array with node degrees. 12 - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 13 the recursive layers. See SAGEInfo definition above. 14 - concat: whether to concatenate during recursive iterations 15 - aggregator_type: how to aggregate neighbor information 16 - model_size: one of "small" and "big" 17 - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy) 18 ''' 19 super(SampleAndAggregate, self).__init__(**kwargs) 20 if aggregator_type == "mean": 21 self.aggregator_cls = MeanAggregator 22 elif aggregator_type == "seq": 23 self.aggregator_cls = SeqAggregator 24 elif aggregator_type == "maxpool": 25 self.aggregator_cls = MaxPoolingAggregator 26 elif aggregator_type == "meanpool": 27 self.aggregator_cls = MeanPoolingAggregator 28 elif aggregator_type == "gcn": 29 self.aggregator_cls = GCNAggregator 30 else: 31 raise Exception("Unknown aggregator: ", self.aggregator_cls) 32 33 # get info from placeholders... 34 self.inputs1 = placeholders["batch1"] 35 self.inputs2 = placeholders["batch2"] 36 self.model_size = model_size 37 self.adj_info = adj 38 if identity_dim > 0: 39 self.embeds = tf.get_variable( 40 "node_embeddings", [adj.get_shape().as_list()[0], identity_dim]) 41 # self.embeds: record the neigh features embeddings 42 # number of features = identity_dim 43 # number of neighbors = adj.get_shape().as_list()[0] 44 else: 45 self.embeds = None 46 if features is None: 47 if identity_dim == 0: 48 raise Exception( 49 "Must have a positive value for identity feature dimension if no input features given.") 50 self.features = self.embeds 51 else: 52 self.features = tf.Variable(tf.constant( 53 features, dtype=tf.float32), trainable=False) 54 if not self.embeds is None: 55 self.features = tf.concat([self.embeds, self.features], axis=1) 56 self.degrees = degrees 57 self.concat = concat 58 59 self.dims = [ 60 (0 if features is None else features.shape[1]) + identity_dim] 61 self.dims.extend( 62 [layer_infos[i].output_dim for i in range(len(layer_infos))]) 63 self.batch_size = placeholders["batch_size"] 64 self.placeholders = placeholders 65 self.layer_infos = layer_infos 66 67 self.optimizer = tf.train.AdamOptimizer( 68 learning_rate=FLAGS.learning_rate) 69 70 self.build()
(2) sample(inputs, layer_infos, batch_size=None)
对于sample的算法描述,详见论文Appendix A, algorithm 2.
代码:
1 def sample(self, inputs, layer_infos, batch_size=None): 2 """ Sample neighbors to be the supportive fields for multi-layer convolutions. 3 4 Args: 5 inputs: batch inputs 6 batch_size: the number of inputs (different for batch inputs and negative samples). 7 """ 8 9 if batch_size is None: 10 batch_size = self.batch_size 11 samples = [inputs] 12 # size of convolution support at each layer per node 13 support_size = 1 14 support_sizes = [support_size] 15 16 for k in range(len(layer_infos)): 17 t = len(layer_infos) - k - 1 18 support_size *= layer_infos[t].num_samples 19 sampler = layer_infos[t].neigh_sampler 20 21 node = sampler((samples[k], layer_infos[t].num_samples)) 22 samples.append(tf.reshape(node, [support_size * batch_size, ])) 23 support_sizes.append(support_size) 24 25 return samples, support_sizes
sampler = layer_infos[t].neigh_sampler
当函数被调用时,layer_infos会被赋值,在unsupervised_train.py中,其中neigh_sampler被赋为UniformNeighborSampler,其在neigh_samplers.py中定义:class UniformNeighborSampler(Layer)。
目的是对于输入的samples[k] (即为上一步sample得到的节点,如上图依次得到黄色区域表示的samples[0],橙色区域表示的samples[1], 粉色区域表示的samples[2]。其中samples[k]是有由对samples[k - 1]中各节点的邻居采样而得),选取num_samples个数的邻居节点的序号(对应上图N(u))。(返回值是adj_lists, 即为被截断为num_samples列数的邻接矩阵。)
这里注意区别support_size与num_samples:
num_sample为当前深度每个节点u所选取的邻居节点的个数为num_samples;
support_size表示当前节点u的embedding受多少节点信息的影响。其既受当前层num_samples个直接邻居的影响,其邻居也受更先前深度num_samples个邻居的影响,以此类推。故support_size是到目前深度为止的各深度下num_samples的连乘积。则对于batch_size个输入节点,总的support个数为: support_size * batch_size。
最后将support_size存进support_sizes的数组中。
sample() 函数最终返回包含各深度下采样点的samples数组与各深度下各点受支持节点数目的support_sizes数组。
2. def _build(self):
1 self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler( 2 true_classes=labels, 3 num_true=1, 4 num_sampled=FLAGS.neg_sample_size, 5 unique=False, 6 range_max=len(self.degrees), 7 distortion=0.75, 8 unigrams=self.degrees.tolist()))
(1) tf.nn.fixed_unigram_candidate_sampler:
按照用户提供的概率分布进行采样。 如果类别服从均匀分布,我们就用uniform_candidate_sampler; 如果词作类别,我们知道词服从 Zipfian, 我们就用 log_uniform_candidate_sampler; 如果能够通过统计或者其他渠道知道类别满足某些分布,用 nn.fixed_unigram_candidate_sampler; 如果实在不知道类别分布,我们还可以用 tf.nn.learned_unigram_candidate_sampler。 (2) Paras: a. num_sampled: sampling_candidates的元素是在没有替换(如果unique = True)的情况下绘制的, 或者是从基本分布中替换(如果unique = False)。 unique = True 可以看作无放回抽样;unique = False 可以看作有放回抽样。 b. distortion: distortion used the word2vec freq energy table formulation f^(3/4) / total(f^(3/4)) in word2vec energy counted by freq; in graphsage energy counted by degrees so in unigrams = [] each ID recored each node's degree c. unigrams: 各个节点的度。 (3) Returns: a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes. b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes. c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.