Ubuntu18.04+CUDA9.0+Matlab R2018a崩溃问题记录(未解决 IRQ #211)

接前面的博客,运行着Matlab好好的,好像到绘图的时候出问题,报错日志如下:

Sep 13 14:31:09 hp-server2 kernel: [ 1296.672108] NVRM: GPU at PCI:0000:21:00: GPU-7ce0c4e1-86a8-fe64-288b-da563f52cc95
Sep 13 14:31:09 hp-server2 kernel: [ 1296.672110] NVRM: GPU Board Serial Number:
Sep 13 14:31:09 hp-server2 kernel: [ 1296.672113] NVRM: Xid (PCI:0000:21:00): 62, 26c28(86c4) 00000000 00000000
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661328] NVRM: Xid (PCI:0000:21:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0): Illegal Instruction Encoding
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661339] NVRM: Xid (PCI:0000:21:00): 13, Graphics SM Global Exception on (GPC 4, TPC 0): Physical Multiple Warp Errors
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661347] NVRM: Xid (PCI:0000:21:00): 13, Graphics Exception: ESR 0x524648=0x3d0009 0x524650=0x4 0x524644=0xd3eff2 0x52464c=0x17f
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661378] NVRM: Xid (PCI:0000:21:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2): Illegal Instruction Encoding
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661387] NVRM: Xid (PCI:0000:21:00): 13, Graphics Exception: ESR 0x525648=0x3f0009 0x525650=0x0 0x525644=0xd3eff2 0x52564c=0x17f
Sep 13 14:31:13 hp-server2 kernel: [ 1300.661728] NVRM: Xid (PCI:0000:21:00): 13, Graphics Exception: ChID 0018, Class 0000c197, Offset 00002390, Data 42b60000
Sep 13 14:31:17 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Sep 13 14:31:17 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(0):     recover...
Sep 13 14:31:21 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (II) event2  - PixArt HP USB Optical Mouse: SYN_DROPPED event - some input events have been lost.
Sep 13 14:31:24 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(GPU-0): Failed to initialize DMA.
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(0): Failed to allocate push buffer
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(0): Error recovery failed.
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) NVIDIA(0):  *** Aborting ***
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: Fatal server error:
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) Failed to recover from error!
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: Please consult the The X.Org Foundation support
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: #011 at http://wiki.x.org
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]:  for help.
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) Backtrace:
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4d) [0x55ffdd3208ad]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 1: /usr/lib/xorg/Xorg (0x55ffdd168000+0x1bc649) [0x55ffdd324649]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 2: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f1a4261e000+0x12890) [0x7f1a42630890]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 3: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (0x7f1a3ce52000+0xcd6a9) [0x7f1a3cf1f6a9]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 4: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (0x7f1a3ce52000+0xcd79f) [0x7f1a3cf1f79f]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 5: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (0x7f1a3ce52000+0xb935d) [0x7f1a3cf0b35d]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) 6: /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (0x7f1a3ce52000+0x5d5aa2) [0x7f1a3d427aa2]
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) Segmentation fault at address 0xc
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: FatalError re-entered, aborting
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE) Caught signal 11 (Segmentation fault). Server aborting
Sep 13 14:31:25 hp-server2 /usr/lib/gdm3/gdm-x-session[1888]: (EE)
Sep 13 14:31:31 hp-server2 nautilus-deskto[2320]: nautilus-desktop: Fatal IO error 11 (资源暂时不可用) on X server :1.
Sep 13 14:31:31 hp-server2 nautilus[3010]: nautilus: Fatal IO error 11 (资源暂时不可用) on X server :1.
Sep 13 14:31:31 hp-server2 at-spi-bus-launcher[2002]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":1"
Sep 13 14:31:31 hp-server2 at-spi-bus-launcher[2002]:       after 228 requests (228 known processed) with 1 events remaining.
Sep 13 14:31:31 hp-server2 gsd-wacom[2249]: gsd-wacom: Fatal IO error 11 (资源暂时不可用) on X server :1.
Sep 13 14:31:31 hp-server2 gsd-xsettings[2238]: gsd-xsettings: Fatal IO error 11 (资源暂时不可用) on X server :1.
Sep 13 14:31:31 hp-server2 gsd-clipboard[2258]: gsd-clipboard: Fatal IO error 11 (资源暂时不可用) on X server :1.
Sep 13 14:31:31 hp-server2 gsd-media-keys[2269]: gsd-media-keys: Fatal IO error 11 (资源暂时不可用) on X server :1.


Sep 13 14:31:56 hp-server2 kernel: [ 1343.593689] irq 211: nobody cared (try booting with the "irqpoll" option)
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593698] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           OE    4.15.0-34-generic #37-Ubuntu
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593700] Hardware name: HP HP Z6 G4 Workstation/81C6, BIOS P60 v01.61 06/18/2018
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593701] Call Trace:
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593703]  <IRQ>
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593711]  dump_stack+0x63/0x8b
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593717]  __report_bad_irq+0x35/0xc0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593721]  note_interrupt+0x24b/0x2a0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593726]  handle_irq_event_percpu+0x54/0x80
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593729]  handle_irq_event+0x3b/0x60
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593733]  handle_edge_irq+0x7c/0x190
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593737]  handle_irq+0x20/0x30
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593741]  do_IRQ+0x4e/0xd0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593746]  common_interrupt+0x84/0x84
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593747]  </IRQ>
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593754] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593756] RSP: 0018:ffffaeba864d7e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593759] RAX: ffff8ae37f362880 RBX: 00000138d46c9fca RCX: 000000000000001f
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593761] RDX: 00000138d46c9fca RSI: fffffff6a56ebb6c RDI: 0000000000000000
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593763] RBP: ffffaeba864d7ea8 R08: 00000000ffffffff R09: 0000000000000004
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593765] R10: ffffaeba864d7e38 R11: 0000000000000006 R12: ffffceba7f742b00
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593766] R13: 0000000000000001 R14: ffffffffba571b58 R15: 0000000000000000
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593772]  ? cpuidle_enter_state+0x97/0x2f0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593776]  cpuidle_enter+0x17/0x20
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593782]  call_cpuidle+0x23/0x40
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593785]  do_idle+0x18c/0x1f0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593789]  cpu_startup_entry+0x73/0x80
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593794]  start_secondary+0x1ab/0x200
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593798]  secondary_startup_64+0xa5/0xb0
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593801] handlers:
Sep 13 14:31:56 hp-server2 kernel: [ 1343.593995] [<000000005cd2cbce>] nvidia_isr [nvidia] threaded [<00000000a6cb3a47>] nvidia_isr_kthread_bh [nvidia]
Sep 13 14:31:56 hp-server2 kernel: [ 1343.594186] Disabling IRQ #211

连续发生多次,每次都导致图像界面崩溃,只能重装cuda(自动安装驱动),然后重启才能恢复cuda和桌面环境。

现在已经能确定问题了,就是matlab绘图figure的时候系统图像界面崩溃黑屏。很难找到合适的解决办法,所幸每次apt安装cuda后一切恢复正常,所以我暂时不在Ubuntu下用Matlab了,只能先这样绕过该问题。

顺便提一下,运行的matlab是:

https://github.com/yaksoy/SemanticSoftSegmentation

起初几次还可以运行,绘图也能出来,后面每次都崩溃,已经不能在Ubuntu下用matlab了。虽然在没有cuda和nvidia驱动的情况下,也能用开源驱动,图像系统也正常,但是matlab绘图总是显示“低级错误”而无法绘制。所以有nvidia的时候绘制就崩溃,没有的话则无法绘制。换windows用matlab吧。

更新:

多次运行了tensorflow程序后,也出现了崩溃:

(py3-env) dww@hp-server2:~/workspace/python/SIGGRAPH18SSS$ bash run_extract_feat.sh
2018-09-14 15:43:33.909929: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-09-14 15:43:34.144907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:21:00.0
totalMemory: 10.91GiB freeMemory: 10.43GiB
2018-09-14 15:43:34.144936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-09-14 15:43:34.345380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-14 15:43:34.345417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0
2018-09-14 15:43:34.345423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N
2018-09-14 15:43:34.345644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10088 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:21:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/dww/workspace/python/SIGGRAPH18SSS/deeplab_resnet/hc_deeplab.py:140: calling expand_dims (from tensorflow.python.ops.array_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
 [*] Loading checkpoints...
 [*] Load SUCCESS
0 Processing ./samples/girl.png
2018-09-14 15:44:16.075987: E tensorflow/stream_executor/cuda/cuda_driver.cc:1078] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2018-09-14 15:44:16.076019: E tensorflow/stream_executor/cuda/cuda_timer.cc:55] Internal: error destroying CUDA event in context 0x555a2e20b070: CUDA_ERROR_LAUNCH_FAILED
2018-09-14 15:44:16.076025: E tensorflow/stream_executor/cuda/cuda_timer.cc:60] Internal: error destroying CUDA event in context 0x555a2e20b070: CUDA_ERROR_LAUNCH_FAILED
2018-09-14 15:44:16.076055: F tensorflow/stream_executor/cuda/cuda_dnn.cc:189] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
run_extract_feat.sh: 行 1:  8855 已放弃               (核心已转储) CUDA_VISIBLE_DEVICES=0 python main_hyper.py --data-dir ./samples

对于系统日志记录如下:

Sep 14 15:43:34 hp-server2 kernel: [76463.216598] NVRM: GPU at PCI:0000:21:00: GPU-7ce0c4e1-86a8-fe64-288b-da563f52cc95
Sep 14 15:43:34 hp-server2 kernel: [76463.216602] NVRM: GPU Board Serial Number:
Sep 14 15:43:34 hp-server2 kernel: [76463.216607] NVRM: Xid (PCI:0000:21:00): 62, 1d5e(356c) 00000000 00000000
Sep 14 15:43:44 hp-server2 kernel: [76473.110801] NVRM: Xid (PCI:0000:21:00): 31, Ch 00000019, engmask 00000101, intr 10000000
Sep 14 15:44:15 hp-server2 kernel: [76504.110368] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000003 intr 00800000
Sep 14 15:44:15 hp-server2 kernel: [76504.610361] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000003 intr 00800000
Sep 14 15:44:57 hp-server2 /usr/lib/gdm3/gdm-x-session[1575]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x000025f8, 0x00002600)
Sep 14 15:45:04 hp-server2 /usr/lib/gdm3/gdm-x-session[1575]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x000025f8, 0x00002600)
Sep 14 15:45:17 hp-server2 kernel: [76566.852288] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Sep 14 15:45:26 hp-server2 kernel: [76575.820933] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0
Sep 14 15:45:48 hp-server2 kernel: [76597.822133] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0
Sep 14 15:46:00 hp-server2 /usr/lib/gdm3/gdm-x-session[1575]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0006, 0x000025f8, 0x00002670)
Sep 14 15:46:07 hp-server2 /usr/lib/gdm3/gdm-x-session[1575]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0006, 0x000025f8, 0x00002670)
Sep 14 15:46:20 hp-server2 kernel: [76629.830225] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0
Sep 14 15:46:42 hp-server2 kernel: [76651.837598] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0

是惠普Z6工作站软硬件兼容性有问题吗?原来用的CUDA和Tensorflow也是一样的版本,都没出问题。或者Nvidia的驱动有bug?

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018
GCC version:  gcc version 6.4.0 20180424 (Ubuntu 6.4.0-17ubuntu1)

cat /etc/issue
Ubuntu 18.04.1 LTS \n \l

uname -a
Linux hp-server2 4.15.0-34-generic #37-Ubuntu SMP Mon Aug 27 15:21:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

猜你喜欢

转载自blog.csdn.net/u012911347/article/details/82688655
今日推荐