【mask rcnn】多个GPU训练遇到错误

1、Loaded runtime CuDNN library: 7103 (compatibility version 7100) but source was compiled with 7005 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
解析:cuda与cudnn版本不匹配问题

conda install cudatoolkit=9.0

conda install cudnn=7.1.2

安装参考:http://www.sohu.com/a/225953058_491081

2、InvalidArgumentError (see above for traceback): Integer division by zero
         [[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/cpu:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]]
解析:更新tensorflow-gpu版本到1.7

3、 Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.13GiB. The caller indicates that this 。。。

解决方法:方法一:batch_size 小一点;

                  方法二:(没有使用过)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)

参考:https://stackoverflow.com/questions/36927607/how-can-i-solve-ran-out-of-gpu-memory-in-tensorflow

猜你喜欢

转载自blog.csdn.net/qq_30159015/article/details/83019001
今日推荐