客户一套Oracle 19c Dataguard的数据库环境,standby 端的总是会间隔性出现较大GAP, 同时DB alert log日志出现LGWR (ospid: 105521) waits for event ‘DLM cross inst call completion’ for N secs. 的现象,Standby端并未对外提供查询,同时也禁用了多实例日志应用,同时系统资源空闲LMS进程个数正常, 如果关闭其它节点只留apply log节点并不存在该问题, DLM 是Distributed Lock Manager 属于RAC架构中核心机制,实现多节点资源共享调度。通过interconnect Network传递请求,下面简单记录一下这个案例。
db alert log
PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_3_seq_13586.1479.1077669291 2021-07-12T20:25:29.643687+08:00 PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_2_seq_14361.1072.1077669019 2021-07-12T20:29:38.183656+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs. 2021-07-12T20:29:48.137737+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs. 2021-07-12T20:31:21.952345+08:00 rfs (PID:113884): Selected LNO:26 for T-2.S-14456 dbid 3902007743 branch 1037635587 2021-07-12T20:31:21.987333+08:00 rfs (PID:114704): Error ORA-235 occurred during an un-locked control file rfs (PID:114704): transaction. This error can be ignored. The control rfs (PID:114704): file transaction will be retried. 2021-07-12T20:31:43.532600+08:00 ARC2 (PID:106404): Archived Log entry 9591 added for T-2.S-14455 ID 0xe894b1bf LAD:1 2021-07-12T20:31:47.151671+08:00 rfs (PID:113882): Selected LNO:31 for T-3.S-13731 dbid 3902007743 branch 1037635587 2021-07-12T20:31:49.116049+08:00 rfs (PID:113880): Selected LNO:22 for T-1.S-13006 dbid 3902007743 branch 1037635587 2021-07-12T20:31:53.393547+08:00 ARC3 (PID:106408): Archived Log entry 9592 added for T-1.S-13005 ID 0xe894b1bf LAD:1 2021-07-12T20:32:02.346585+08:00 ARC2 (PID:106404): Archived Log entry 9593 added for T-3.S-13730 ID 0xe894b1bf LAD:1 2021-07-12T20:33:13.805344+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs. 2021-07-12T20:33:13.805470+08:00 LGWR (ospid: 105521) is hung in an acceptable location (inwait 0x1.ffff). 2021-07-12T20:33:21.196764+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs. 2021-07-12T20:33:31.310737+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs. 2021-07-12T20:33:41.223781+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs. 2021-07-12T20:33:51.205776+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs. 2021-07-12T20:34:01.307770+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs. 2021-07-12T20:34:25.440231+08:00 PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_2_seq_14362.1867.1077670807 2021-07-12T20:34:44.864009+08:00 PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_3_seq_13587.691.1077670845 2021-07-12T20:34:45.204773+08:00 PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_1_seq_12934.1156.1077670917 2021-07-12T20:36:09.378685+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs. 2021-07-12T20:36:19.341635+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs. 2021-07-12T20:36:28.416573+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs. 2021-07-12T20:36:38.375742+08:00 LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs.
LGWR trace
*** 2021-07-12T20:33:43.793041+08:00 ((4)) Received ORADEBUG command (#235) 'dump KSTDUMPCURPROC 1' from process '105470' ------------------------------------------------------------------------------- Trace Bucket Dump Begin: default bucket for process 47 (osid: 105521, LGWR) CDB_NAME(CON_ID):CON_UID:TIME(*=approx):SEQ:COMPONENT:FILE@LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA ------------------------------------------------------------------------------- IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823031) with opcode=146 and completion status [DONE] IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances= IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1091:kjci_initreq():4466:40278: 1 2 IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d4e8 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 from callee (inst|pid|psn)=(1|36|1) IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d718 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 from callee (inst|pid|psn)=(2|36|1) IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 and completion status [DONE] IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances= IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1091:kjci_initreq():4466:40278: 1 2 IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d4e8 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 from callee (inst|pid|psn)=(1|36|1) IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d718 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 from callee (inst|pid|psn)=(2|36|1) IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 and completion status [DONE] IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823034) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances= IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1091:kjci_initreq():4466:40278: 1 2
KJCJ ==> (kjci)_processcrq – kernel lock management communication cross instance call
跨节点的通信,MOS中不存在已知BUG, 那先分析网络问题,也可以从进程blocker 如做SSD或查看hangmgr Trace. Oracle 19c CRS中AHF框架自带了OSW。
OSW netstat data
zzz ***Tue Jul 13 00:59:51 CST 2021
...
#kernel
IpInReceives 1456201695 0.0
IpInHdrErrors 0 0.0
IpInAddrErrors 0 0.0
IpForwDatagrams 0 0.0
IpInUnknownProtos 0 0.0
IpInDiscards 0 0.0
IpInDelivers 1085210966 0.0
IpOutRequests 1007206469 0.0
IpOutDiscards 5280 0.0
IpOutNoRoutes 8 0.0
IpReasmTimeout 6333500 0.0
IpReasmReqds 408470736 0.0
IpReasmOKs 37504539 0.0
IpReasmFails 8651478 0.0
IpFragOKs 29029579 0.0
Note:
当前存在较高的ip 重组失败包,这是一个累计值,下面可以查看日常变化。
查看日常IP重组失败情况
awk '/zzz/{d=$3"/"$4" "$5}/IpReasmFails/{curr=$2;diff=curr-prev;if(diff>5)print d,diff,prev,curr;prev=curr}' *.dat Jul/13 00:00:16 8620039 8620039 Jul/13 00:00:46 185 8620039 8620224 Jul/13 00:01:16 242 8620224 8620466 Jul/13 00:01:46 324 8620466 8620790 Jul/13 00:02:16 279 8620790 8621069 Jul/13 00:02:46 325 8621069 8621394 Jul/13 00:03:16 325 8621394 8621719 Jul/13 00:03:46 247 8621719 8621966 Jul/13 00:04:16 246 8621966 8622212 Jul/13 00:04:46 210 8622212 8622422 Jul/13 00:05:16 327 8622422 8622749 Jul/13 00:05:46 247 8622749 8622996 Jul/13 00:06:16 238 8622996 8623234 Jul/13 00:06:46 219 8623234 8623453 Jul/13 00:07:16 262 8623453 8623715 Jul/13 00:07:46 254 8623715 8623969 Jul/13 00:08:16 179 8623969 8624148 Jul/13 00:08:46 294 8624148 8624442
Note:
可见平时也存在较高的IP重组失败,下面尝试使用ping验证网络
使用ping 验证
— on node1
ping -s 4000 {node2-privateIP}
Note:
这里忘了保留历史输出,是发现有12% package loss,说明当前和心跳网络并不健康,不过是使用的2个网卡的做BOND,当前是Active-Backup主备模式,可以尝试切换另一个网卡。
网卡切换
cat /proc/net/bonding/bond0 <、pre> Note: 检查 当前主卡是ens9f0,切换到备卡ens9f1
ifenslave -c bond0 ens9f1
做了主备网卡切换后,ping 正常,IP重组失败消失,DLM cross inst call completion未在出现,DG同步正常,问题得到解决。