Quantcast
Channel: ANBOB
Viewing all articles
Browse latest Browse all 705

Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed

$
0
0

前段时间分析的一个问题,节点2驱逐后无法再加入集群,日志显示是网络通信问题,查看开始时驱逐的原因也是VD CRS-1615:No I/O has completed 和 Network communication missing, 同时DISK HB和Network HB同时失败,并且存储和private network是双链路,用的也不是同一交换机。什么会导致同时出问题呢?简单记录一下

# node2 GI alert log

2019-09-17 02:10:06.619: 
[cssd(20050)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.870 seconds
2019-09-17 02:10:06.940: 
[cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 7990 milliseconds
2019-09-17 02:10:06.940: 
[cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 8110 milliseconds
2019-09-17 02:10:08.445: 
[cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 6480 milliseconds
2019-09-17 02:10:08.445: 
[cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 6600 milliseconds
2019-09-17 02:10:14.623: 
[cssd(20050)]CRS-1662:Member kill requested by node node1 for member number 1, group DBORCL
2019-09-17 02:11:41.644: 
[cssd(20050)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds
2019-09-17 02:11:45.645: 
[cssd(20050)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.620 seconds
2019-09-17 02:11:48.268: 
[cssd(20050)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:11:48.268: 
[cssd(20050)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log
2019-09-17 02:11:48.315: 

2019-09-17 02:13:57.761: 
[cssd(1248047)]CRS-1713:CSSD daemon is started in clustered mode
2019-09-17 02:14:13.489: 
[cssd(1248047)]CRS-1707:Lease acquisition for node node2 number 2 completed
2019-09-17 02:14:14.761: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diski; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:14.770: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskj; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:14.779: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskk; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:33.309: 
[cssd(1248047)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.040 seconds
2019-09-17 02:14:40.310: 
[cssd(1248047)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.040 seconds
2019-09-17 02:14:45.312: 
[cssd(1248047)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.040 seconds
2019-09-17 02:14:47.355: 
[cssd(1248047)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:47.355: 
[cssd(1248047)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log
2019-09-17 02:14:47.387: 
[cssd(1248047)]CRS-1603:CSSD on node node2 shutdown by user.
2019-09-17 02:14:47.589: 
[cssd(1248047)]CRS-1660:The CSS daemon shutdown has completed

诊断方法
# on node1
ping node2-priv-ip
traceroute node2-priv-ip

# on node2
ping node1-priv-ip
traceroute node1-priv-ip

— 测试网络也是没有问题的。

cat /etc/*release*|head -n 3
— RHEL 6.6

又看到了这个多病的操作系统版本,想起了以前的老问题 https://www.anbob.com/archives/2851.html

ping 11.11.11.11 -s 8192
— 果然没反映,即使没有使用大帧也应该拆包发送呀

netstat -s |grep reass
-sleep 5
netstat -s |grep reass

— 值在增加,原因就可能在这里了

echo 16777216 > /proc/sys/net/ipv4/ipfrag_low_thresh
echo 15728640 > /proc/sys/net/ipv4/ipfrag_high_thresh
echo 60 > /proc/sys/net/ipv4/ipfrag_time

一切恢复正常

如果你解决不了, 可以联系www.anbob.com 首页的联系方式。


Viewing all articles
Browse latest Browse all 705

Trending Articles