RAC HAIP问题记录

背景介绍:

在Oracle11.2.0.2之前,私网冗余通过在操作系统层面做网卡绑定来实现,从11.2.0.2开始引入了一个新特性叫做Highly Available Virtual IP,简称为HAIP。该特性替代了传统网卡通过绑定技术完成Active-Active的模式进行数据传输。同时实现故障转移功能和负载均衡(减少因为gc等待带来的性能问题)。

HAIP属于ohasd资源,对于Oracle 集群而言私网通信尤为重要,因为节点和节点间通信绝大部分是通过私网来实现的。私网通信基本可以分为两种,第一种为集群层面的通信,特点是持续存在对实时性要求比较高,同时传输的数据量比较小(比如节点间网络心跳);第二种就是实例之间通信,由于内存融合导致节点实例间需要进行数据传输,特点为数据量大传输速度要快。

Oracle一直建议用户对于集群的私网要做高可用以及负载均衡但是10g、11gr1版本Oracle并不提供类似的功能而是建议用户从操作系统的角度去做(bonding),从Oracle11.2.0.2开始Oracle提供了私网的高可用性和负载均衡——HAIP

HAIP工作机制:

HAIP本身就是一个IP地址,部署Oracle集群软件grid时会自动在每一块私网网卡上都绑定一个169.254.*.*网段的IP地址,这个IP地址就是HAIP,数据库实例以及ASM实例之间进行通信时就靠它来完成。当其中一块网卡出现问题时它所绑定的HAIP就会漂移到节点上其他的私有网卡上面来实现私网的高可用。整个过程中haip地址一直存在只不过所绑定的网卡发生了漂移,所以数据库实例和ASM实例一直可以保持正常运行。同样的,如果一个集群包含多个私网网卡,也就意味着会有多个HAIP绑定在每一块网卡上,每一块都担负着集群私网通信的责任,这样也实现了私网的负载均衡。所以HAIP的功能性要高于传统的网卡绑定,同时管理起来也更加简单,安装集群时选择私网网卡,启动集群时HAIP就会自动绑定,支持RAC添加私有网卡。

安装过程指定私网网卡即可
在这里插入图片描述
问题:grid安装

集群安装预检查: 在这里插入图片描述
上述错误对集群安装没有致命影响全部忽略
集群部署过程未出现问题,安装完成后正常执行root脚本,完成后查看集群状态如下:
在这里插入图片描述
集群健康检查如下:
在这里插入图片描述
问题描述:

本地资源正常,集群资源ora.OCR.dg,ora.asm异常。关闭集群,尝试单独拉起rac1集群后启动rac2集群。集群资源被挂靠在rac1,rac2资源offline。再次关闭集群,尝试单独拉起rac2集群后启动rac1集群。集群资源被挂靠在rac2,rac1资源offline。

从集群资源情况来看:

rac1节点,ASM实例未启动,ora.asm不在线。对于这种情况有以下几个排查方向:

  • ASM spfile损坏

  • ASM discovery string不正确,因此无法发现 voting disk/OCR

  • ASMlib 配置问题

  • ASM实例使用不同的cluster_interconnect, 第一个节点 HAIP OFFLINE 导致第二个节点ASM实例无法启动

观察rac2节点crs日志如下:
在这里插入图片描述
根据集群日志可以发现OHAS服务启动后,出现报错OKA-系统版本不支持,紧接着rac2节点服务关闭。但是在报错以后为更新一些系统驱动文件,并且后来查看整个警告日志该报错仅出现1次,因此这个报错应该不是导致问题的原因。

查看异常情况出现时两个节点的集群日志

rac1:

rac1:
2020-12-23 00:15:35.374 [OHASD(18010)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 18010
2020-12-23 00:15:35.475 [OHASD(18010)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2020-12-23 00:15:35.487 [OHASD(18010)]CRS-2112: The OLR service started on node rac1.
2020-12-23 00:15:35.866 [OHASD(18010)]CRS-1301: Oracle High Availability Service started on node rac1.
2020-12-23 00:15:35.866 [OHASD(18010)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2020-12-23 00:15:36.858 [ORAAGENT(18109)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 18109
2020-12-23 00:15:36.877 [CSSDAGENT(18133)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 18133
2020-12-23 00:15:36.891 [ORAROOTAGENT(18119)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 18119
2020-12-23 00:15:36.897 [CSSDMONITOR(18140)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 18140
2020-12-23 00:15:37.453 [ORAAGENT(18222)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 18222
2020-12-23 00:15:37.817 [MDNSD(18249)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 18249
2020-12-23 00:15:37.823 [EVMD(18250)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 18250
2020-12-23 00:15:38.872 [GPNPD(18296)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 18296
2020-12-23 00:15:39.687 [GPNPD(18296)]CRS-2328: GPNPD started on node rac1. 
2020-12-23 00:15:39.909 [GIPCD(18382)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 18382
2020-12-23 00:15:43.853 [OSYSMOND(18646)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 18646
2020-12-23 00:15:43.826 [CSSDMONITOR(18639)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 18639
2020-12-23 00:15:44.310 [CSSDAGENT(18693)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 18693
2020-12-23 00:15:44.782 [OCSSD(18814)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 18814
2020-12-23 00:15:45.871 [OCSSD(18814)]CRS-1713: CSSD daemon is started in hub mode
2020-12-23 00:15:51.681 [OCSSD(18814)]CRS-1707: Lease acquisition for node rac1 number 1 completed
2020-12-23 00:15:52.792 [OCSSD(18814)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in
 /u01/app/grid_base/diag/crs/rac1/crs/trace/ocssd.trc
2020-12-23 00:15:52.793 [OCSSD(18814)]CRS-1617: The information required to do node kill for node rac1 is incomplete; details at (:CSSNM00004:) in /u01/app/grid_ba
se/diag/crs/rac1/crs/trace/ocssd.trc
2020-12-23 00:15:52.795 [OCSSD(18814)]CRS-1605: CSSD voting file is online: /dev/mapper/asm-ocr1; details in /u01/app/grid_base/diag/crs/rac1/crs/trace/ocssd.trc.
2020-12-23 00:15:52.798 [OCSSD(18814)]CRS-1605: CSSD voting file is online: /dev/mapper/asm-ocr2; details in /u01/app/grid_base/diag/crs/rac1/crs/trace/ocssd.trc.
2020-12-23 00:15:52.802 [OCSSD(18814)]CRS-1605: CSSD voting file is online: /dev/mapper/asm-ocr3; details in /u01/app/grid_base/diag/crs/rac1/crs/trace/ocssd.trc.
2020-12-23 00:15:54.294 [OCSSD(18814)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2020-12-23 00:15:55.703 [OCSSD(18814)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2020-12-23 00:15:55.817 [OCTSSD(19540)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 19540
2020-12-23 00:15:56.609 [OCTSSD(19540)]CRS-2403: The Cluster Time Synchronization Service on host rac1 is in observer mode.
2020-12-23 00:15:58.058 [OCTSSD(19540)]CRS-2407: The new Cluster Time Synchronization Service reference node is host rac2.
2020-12-23 00:15:58.059 [OCTSSD(19540)]CRS-2401: The Cluster Time Synchronization Service started on host rac1.
2020-12-23 00:16:05.153 [CRSD(19758)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 19758
2020-12-23 00:16:07.642 [CRSD(19758)]CRS-1012: The OCR service started on node rac1.
2020-12-23 00:16:07.685 [CRSD(19758)]CRS-1201: CRSD started on node rac1.
2020-12-23 00:16:10.270 [OLOGGERD(19846)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 19846
2020-12-23 00:16:10.809 [ORAAGENT(19857)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 19857
2020-12-23 00:16:10.869 [ORAROOTAGENT(19872)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 19872
2020-12-23 00:16:11.953 [ORAAGENT(19857)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:16:16.265 [ORAAGENT(19962)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 19962
2020-12-23 00:16:22.888 [ORAAGENT(19962)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:17:18.286 [ORAAGENT(19962)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:17:18.289 [ORAAGENT(19962)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:17:18.293 [ORAAGENT(19962)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:17:20.352 [ORAAGENT(19962)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/grid_base/diag/crs/rac1/crs/trace/cr
sd_oraagent_grid.trc"
2020-12-23 00:34:08.898 [OHASD(18010)]CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
2020-12-23 00:34:16.284 [ORAROOTAGENT(19872)]CRS-5822: Agent '/u01/app/grid_home/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:4
} in /u01/app/grid_base/diag/crs/rac1/crs/trace/crsd_orarootagent_root.trc.
2020-12-23 00:34:19.158 [MDNSD(18249)]CRS-5602: mDNS service stopping by request.
2020-12-23 00:34:19.772 [MDNSD(18249)]CRS-8504: Oracle Clusterware MDNSD process with operating system process ID 18249 is exiting
2020-12-23 00:34:20.155 [OCTSSD(19540)]CRS-2405: The Cluster Time Synchronization Service on host rac1 is shutdown by user
2020-12-23 00:34:20.156 [OCTSSD(19540)]CRS-8504: Oracle Clusterware OCTSSD process with operating system process ID 19540 is exiting
2020-12-23 00:34:21.165 [OCSSD(18814)]CRS-1603: CSSD on node rac1 has been shut down.
2020-12-23 00:34:24.171 [GPNPD(18296)]CRS-2329: GPNPD on node rac1 shut down. 
2020-12-23 00:34:25.178 [OHASD(18010)]CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed
2020-12-23 00:34:25.190 [ORAROOTAGENT(18119)]CRS-5822: Agent '/u01/app/grid_home/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:8
} in /u01/app/grid_base/diag/crs/rac1/crs/trace/ohasd_orarootagent_root.trc.

RAC2:

rac2:
2020-12-23 00:34:15.639 [OCSSD(80012)]CRS-1625: Node rac1, number 1, was shut down
2020-12-23 00:34:15.652 [OCSSD(80012)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac2 .
2020-12-23 00:34:15.656 [CRSD(80768)]CRS-5504: Node down event reported for node 'rac1'.
2020-12-23 00:34:15.658 [CRSD(80768)]CRS-2773: Server 'rac1' has been removed from pool 'Free'.
2020-12-23 00:34:47.701 [OCSSD(80012)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2020-12-23 00:35:07.748 [CRSD(80768)]CRS-2772: Server 'rac1' has been assigned to pool 'Free'.
2020-12-23 00:36:12.142 [CRSD(80768)]CRS-2807: Resource 'ora.asmgroup' failed to start automatically.
2020-12-23 00:42:07.825 [CVUD(61146)]CRS-10051: CVU found following errors with Clusterware setup : PRVF-5622 : The 'search' entry does not exist in file "/etc/res
olv.conf" on nodes: "rac1".


节点1出现Check of resource “ora.asm” failed,而后集群开始关闭OHAS服务,节点1被踢出集群。根据日志提示排查相应trc文在这里插入图片描述
对应时间存在大量异常终止导致代理通信错误。这个时候基本可以定位为节点间通信问题
在这里插入图片描述

以上ip为rac2节点的haip,网络不通,集群启动后一个节点与另外节点通信存在问题导致无法通信节点被踢出集群。

解决办法:

https://bbs.huaweicloud.com/blogs/173989

此时集群已经部署完成,asm实例参数文件已经生成,手动添加参数cluster_interconnects 指定集群通信ip。通过生成pfile修改后重新启动或者使用存活节点在线修改

alter system set cluster_interconnects='192.168.2.220' sid='orcl1' scope=spfile;
alter system set cluster_interconnects='192.168.2.221' sid='orcl2' scope=spfile;

重启集群,集群状态恢复。在这里插入图片描述

http://blog.chinaunix.net/uid-17069315-id-5714302.html

解决方案:

https://support.oracle.com/CSP/main/article?cmd=show&id=1383737.1&type=NOT

haip:

https://support.oracle.com/CSP/main/article?cmd=show&id=1210883.1&type=NOT

附:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PWM3Kbt4-1611625270465)(https://secure-static.wolai.com/static/YH1FW8DZxmHWKThvkCzEV/image.png)]

猜你喜欢

转载自blog.csdn.net/qq_43250333/article/details/113173430
今日推荐