记录一次RAC节点驱逐的分析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/xxzhaobb/article/details/81877825

RDBMS 12.1.0.1 

OS redhat 6.6 X64 。

在晚上10点多发现节点被驱逐了 。两个节点的alert log分别如下:

IPC Send timeout detected. Sender: ospid 56076 [oracle@XXXXX01 (QM03)]
Receiver: inst 2 binc 4 ospid 32706
Wed Jul 25 22:11:14 2018
IPC Send timeout detected. Sender: ospid 38149 [oracle@XXXXX01 (LMS1)]
Receiver: inst 2 binc 429439508 ospid 31767
Wed Jul 25 22:11:14 2018
IPC Send timeout detected. Sender: ospid 38153 [oracle@XXXXX01 (LMS2)]
Receiver: inst 2 binc 429439276 ospid 31771
Wed Jul 25 22:11:14 2018
IPC Send timeout detected. Sender: ospid 38145 [oracle@XXXXX01 (LMS0)]
Receiver: inst 2 binc 429438517 ospid 31763
Wed Jul 25 22:11:14 2018
IPC Send timeout to 2.1 inc 4 for msg type 65518 from opid 18
Wed Jul 25 22:11:14 2018
Communications reconfiguration: instance_number 2
Wed Jul 25 22:11:16 2018
IPC Send timeout to 2.2 inc 4 for msg type 65518 from opid 19
Wed Jul 25 22:11:16 2018
LMON (ospid: 38135) drops the IMR request from LMS1 (ospid: 38149) because inst 2 is dead.
Wed Jul 25 22:11:16 2018
IPC Send timeout to 2.3 inc 4 for msg type 65518 from opid 20
Wed Jul 25 22:11:16 2018
LMON (ospid: 38135) drops the IMR request from LMS2 (ospid: 38153) because inst 2 is dead.
Wed Jul 25 22:11:35 2018
IPC Send timeout detected. Sender: ospid 38129 [oracle@XXXXX01 (PING)]
Receiver: inst 2 binc 429437484 ospid 31747
Wed Jul 25 22:12:03 2018
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2 
Wed Jul 25 22:12:04 2018
Dumping diagnostic data in directory=[cdmp_20180725221204], requested by (instance=2, osid=31763 (LMS0)), summary=[abnormal instance termination].
Wed Jul 25 22:12:04 2018
Reconfiguration started (old inc 4, new inc 8)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen

 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out
Wed Jul 25 22:12:05 2018
Wed Jul 25 22:12:05 2018
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 0: 3 GCS shadows cancelled, 1 closed, 0 Xw survived
Wed Jul 25 22:12:05 2018
 LMS 2: 2 GCS shadows cancelled, 1 closed, 0 Xw survived
Wed Jul 25 22:12:05 2018
 LMS 1: 9 GCS shadows cancelled, 3 closed, 0 Xw survived

 Set master node info 
 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted
Wed Jul 25 22:12:06 2018
 Post SMON to start 1st pass IR
Wed Jul 25 22:12:06 2018
Instance recovery: looking for dead threads
Wed Jul 25 22:12:06 2018
Beginning instance recovery of 1 threads
 parallel recovery started with 32 processes
Wed Jul 25 22:12:08 2018
Started redo scan
Wed Jul 25 22:12:09 2018
 Submitted all GCS remote-cache requests

 Fix write in gcs resources
Wed Jul 25 22:12:09 2018
Completed redo scan
 read 55369 KB redo, 9589 data blocks need recovery
Wed Jul 25 22:12:10 2018

Reconfiguration complete (total time 5.9 secs) 
Wed Jul 25 22:11:10 2018
IPC Send timeout detected. Receiver ospid 32706 [oracle@XXXXX02 (PPA7)]
Wed Jul 25 22:11:10 2018
Errors in file /u01/oracle/diag/rdbms/XXXXX/XXXXX2/trace/XXXXX2_ppa7_32706.trc:
IPC Send timeout detected. Receiver ospid 31767 [oracle@XXXXX02 (LMS1)]
Wed Jul 25 22:11:14 2018
Errors in file /u01/oracle/diag/rdbms/XXXXX/XXXXX2/trace/XXXXX2_lms1_31767.trc:
IPC Send timeout detected. Receiver ospid 31771 [oracle@XXXXX02 (LMS2)]
Wed Jul 25 22:11:14 2018
Errors in file /u01/oracle/diag/rdbms/XXXXX/XXXXX2/trace/XXXXX2_lms2_31771.trc:
Wed Jul 25 22:12:03 2018
Detected an inconsistent instance membership by instance 1
Wed Jul 25 22:12:03 2018
Wed Jul 25 22:12:03 2018
Received an instance abort message from instance 1Wed Jul 25 22:12:03 2018
Received an instance abort message from instance 1

Received an instance abort message from instance 1
Wed Jul 25 22:12:03 2018
Wed Jul 25 22:12:03 2018
Received an instance abort message from instance 1Received an instance abort message from instance 1

Wed Jul 25 22:12:03 2018
Please check instance 1 alert and LMON trace files for detail.
Wed Jul 25 22:12:03 2018
Wed Jul 25 22:12:03 2018
Please check instance 1 alert and LMON trace files for detail.Wed Jul 25 22:12:03 2018

Please check instance 1 alert and LMON trace files for detail.Wed Jul 25 22:12:03 2018
Please check instance 1 alert and LMON trace files for detail.

Please check instance 1 alert and LMON trace files for detail.
Wed Jul 25 22:12:03 2018
LMS0 (ospid: 31763): terminating the instance due to error 481
Wed Jul 25 22:12:04 2018
System state dump requested by (instance=2, osid=31763 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/oracle/diag/rdbms/XXXXX/XXXXX2/trace/XXXXX2_diag_31743.trc
Wed Jul 25 22:12:08 2018
Instance terminated by LMS0, pid = 31763

根据现象查看MOS,2008933.1 。 修改参数

Increase value of below kernel parameter as mentioned below,
net.ipv4.ipfrag_high_thresh = 16M
net.ipv4.ipfrag_low_thresh = 15M
Units of these values are MB.

随后查看其中一个节点的AWR的报告

标题
标题

查看avg message sent queue time on ksxp(mx)该值176, 改值以不超过1为好。《参考高斌的Oracle RAC核心技术详解》

标题
标题

从以上的AWR的部分信息,结合MOS,初步判断网络的问题比较大。调整OS层面的网络参数后, 1个月内暂时没有发现异常。

END 。

猜你喜欢

转载自blog.csdn.net/xxzhaobb/article/details/81877825