TensorFlow GPU集群训练配置 ConfigProto

常用的深度学习训练模型为数据并行化，即TensorFlow任务采用相同的训练模型在不同的小批量数据集上进行训练，然后在参数服务器上更新模型的共享参数。TensorFlow支持同步训练和异步训练两种模型训练方式。

异步训练即TensorFlow上每个节点上的任务为独立训练方式，不需要执行协调操作，如下图所示：

同步训练为TensorFlow上每个节点上的任务需要读入共享参数，执行并行化的梯度计算，然后将所有共享参数进行合并，如下图所示

TensorFlow集群由一系列的任务组成，这些任务执行TensorFlow的图计算。每个任务会关联到TensorFlow的一个服务，该服务用于创建TensorFlow 会话及执行图计算。TensorFlow集群也可以划分为一个或多个作业，每个作业可以包含一个或多个任务。在一个TensorFlow集群中，通常一个任务运行在一个机器上。如果该机器支持多GPU设备，可以在该机器上运行多个任务，由应用程序控制任务在哪个GPU设备上运行。

TensorFlow 的 operation 中兼有 CPU 和 GPU 的实现，当这个算子被指派设备时, GPU 有优先权。比如matmul中 CPU 和 GPU kernel 函数都存在. 那么在 cpu:0 和 gpu:0 中, matmul operation 会被优先指派给 gpu:0。

1. 记录设备指派情况 : tf.ConfigProto(log_device_placement=True)

设置tf.ConfigProto()中参数log_device_placement = True ,可以获取到 operations 和 Tensor 被指派到哪个设备(几号CPU或几号GPU)上运行,会在终端打印出各项操作是在哪个设备上运行的。

import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0], shape=[3, 4], name='a')
b = tf.constant([2.0, 4.0, 6.0, 8.0, 10.0], shape=[4, 3], name='b')
c = tf.matmul(a, b)
# sess = tf.Session()
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

2. 手工指派设备

默认情况下系统会给你指派到 CPU：0或者GPU：0， ID 最小的 GPU 会默认优先使用。

如果你不想使用系统来为 operation 指派设备, 而是手工指派设备, 你可以用 with tf.device 创建一个设备环境, 这个环境下的 operation 都统一运行在环境指定的设备上.

比如我指定operation在 GPU:1

import tensorflow as tf
with tf.device('/gpu:1'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0], shape=[3, 4], name='a')
  b = tf.constant([2.0, 4.0, 6.0, 8.0, 10.0], shape=[4, 3], name='b')
  c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))

>>>[[ 88. 94. 96.]
[150. 170. 180.]
[150. 170. 180.]]

如果你指定的设备不存在, 你会收到 InvalidArgumentError 错误提示:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul':

Operation was explicitly assigned to/device:GPU:2 but available devices are [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device. [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:2"](a, b)]]
反馈错误，找不到GPU:2，但是却同时把计算机存在的设备显示出来 CPU:0,:GPU:0, :GPU:1

为了避免出现你指定的设备不存在这种情况, 你可以在创建的 session 里把参数 allow_soft_placement 设置为 True, 这样允许tensorFlow 自动分配一个存在并且支持的设备来运行 operation.

sess = tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True, log_device_placement=True))

使用多个 GPU

如果你想让 TensorFlow 在多个 GPU 上运行, 你可以建立 multi-tower 结构, 在这个结构里每个 tower 分别被指配给不同的 GPU 运行.

c = []
for d in ['/gpu:0', '/gpu:1']:
  with tf.device(d):
      a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0], shape=[3, 4], name='a')
      b = tf.constant([2.0, 4.0, 6.0, 8.0, 10.0], shape=[4, 3], name='b')
      c.append(tf.matmul(a, b)) #a*b在gpu:0', '/gpu:1计算

with tf.device('/cpu:0'):
  sum = tf.add_n(c) # 两矩阵中元素相加在CPU:0 中

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(sum))

>>>[[176. 188. 192.]
[300. 340. 360.]
[300. 340. 360.]]

控制GPU资源使用率

默认配置下，TensorFlow Session会占用GPU卡上所有内存。但TesnorFlow提供了两个GPU内存优化配置选项。config.gpu_options.allow_growth：根据程序运行情况，分配GPU内存。程序开始的时候分配比较少的内存，随着程序的运行，增加内存的分配，但不会释放已经分配的内存。

sess = tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True)))

按照百分比分配GPU内存，例如0.5表示分配50%的GPU内存

sess = tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5)))

TensorFlow与HDFS集成使用

HDFS（Hadoop Distributed File System）是Hadoop项目的核心子项目，是一个高度容错性的分布式文件系统，能提供高吞吐量的数据访问，非常适合大规模数据集上的应用。

#配置JAVA和HADOOP环境变量
source $HADOOP_HOME/libexec/hadoop-config.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server

#执行TensorFlow运行模型
CLASSPATH=$($HADOOP_HDFS_HOME/bin/hadoop classpath --glob) python tensorflow_model.py

#在TensorFlow模型中定义文件的读取队列
filename_queue = tf.train.string_input_producer(["hdfs://namenode:8020/path/to/fileA.csv", "hdfs://namenode:8020/path/to/fileB.csv"])

#从文件中读取一行数据，value为所对应的行数据
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# 把读取到的value值解码成特征向量，record_defaults定义解码格式及对应的数据类型
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(value, record_defaults=record_defaults)
features = tf.pack([col1, col2, col3, col4])

with tf.Session() as sess:
  # 定义同步对象，并启动相应线程把HDFS文件名插入到队列
 coord = tf.train.Coordinator()
 threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):

    # 从文件队列中读取一行数据
    example, label = sess.run([features, col5])

  #请求停止队列的相关线程（包括进队及出队线程）
  coord.request_stop()
  #等待队列中相关线程结束（包括进队及出队线程）

  coord.join(threads)

参考：http://www.tensorfly.cn/