Quantcast
Channel: ANBOB
Viewing all articles
Browse latest Browse all 702

12c R2 RAC instance crash due to CKPT is hung, using Multipathing

$
0
0

前段时间朋友一套12C R2   3Nodes RAC on Linux using multipath 环境的crash总是不定时的crash(几天一次),  使用的是LINUX的多路径软件的默认配置。

— alert log

2017-11-21T21:52:48.474246+08:00
Thread 3 advanced to log sequence 26159 (LGWR switch)
  Current log# 17 seq# 26159 mem# 0: +DATADG/anbob/ONLINELOG/group_17
2017-11-21T21:55:42.614168+08:00
IMR0 (ospid: 53353) waits for event 'control file sequential read' for 58 secs.
2017-11-21T21:55:42.614240+08:00
IMR0 (ospid: 53353) is hung in an acceptable location (cfio 0x11.00).
2017-11-21T21:55:55.374988+08:00
IMR0 (ospid: 53353) waits for event 'control file sequential read' for 81 secs.
2017-11-21T21:55:58.554970+08:00
IMR has experienced some problems and can't re-enable, retry count 1
2017-11-21T21:56:20.580385+08:00
CKPT (ospid: 53295) waits for event 'control file parallel write' for 95 secs.
2017-11-21T21:56:22.776564+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_lmhb_53275.trc  (incident=824802) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process CKPT (OSID 53295) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob3/incident/incdir_824802/anbob3_lmhb_53275_i824802.trc
2017-11-21T21:56:25.265258+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_anbob by memno 0 
ERROR: Some process(s) is not making progress.
LMHB (ospid: 53275) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 53275): terminating the instance due to error 29770
2017-11-21T21:56:25.763179+08:00
System state dump requested by (instance=3, osid=53275 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_diag_53235_20171121215625.trc
2017-11-21T21:56:27.521552+08:00
License high water mark = 172
2017-11-21T21:56:31.945109+08:00
Instance terminated by LMHB, pid = 53275
2017-11-21T21:56:31.945889+08:00
Warning: 2 processes are still attach to shmid 458758:
 (size: 24576 bytes, creator pid: 53153, last attach/detach pid: 53235)
2017-11-21T21:56:32.530103+08:00
USER (ospid: 34072): terminating the instance
2017-11-21T21:56:32.535249+08:00
Instance terminated by USER, pid = 34072

— lmnb trace anbob3_lmhb_53275_i824802.trc

----- Beginning of Customized Incident Dump(s) -----
==================================================
=== CKPT (ospid: 53295) Heartbeat Report
==================================================
CKPT (ospid: 53295) has no heartbeats for 103 sec. (threshold 70) 
  : heartbeat state 0x11.00 (cfio) pso-flag 0x100
  : waiting for event 'control file parallel write' for 97 secs with wait_id 4255818.
===[ Wait Chain ]===
Wait chain is empty.
=[SNAPSHOT SUMMARY]=
Process moved according to collected snapshots
===[NO-HB PROCESS CALLSTACK]===
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196
<-ksfdwtio()+306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios()
+409<-kfk_io1()+1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest()
+742<-ksfdafWait()+371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp()
+1035<-kccbmp_wdl()+257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti()
+247<-ksbabs()+11524<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()
+262<-__libc_start_main()+253
==============================
==============================
Dumping PROCESS CKPT (ospid: 53295) States
==============================
===[ System Load State ]===
  CPU Total 56 Raw 56 Core 28 Socket 2
  Load normal: Cur 1144 Highmark 71680 (4.46 280.00)
===[ Latch State ]===
  Not in Latch Get
===[ Session State Object ]===
  ----------------------------------------
  SO: 0x23251ba70, type: 4, owner: 0x1f1d4c328, flag: INIT/-/-/-/0x00 if: 0x3 c: 0x3
   proc=0x1f1d4c328, name=session, file=ksu.h LINE:15737, pg=0, conuid=1
   SGA version=(1,0)
  (session) sid: 4720 ser: 27225 trans: (nil), creator: 0x1f1d4c328
            flags: (0x51) USR/- flags2: (0x409) -/-/INC
            flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
            DID: 0003-0027-0000000A0000-0000-00000000, short-term DID: 
            txn branch: (nil)
            con_id/con_uid/con_name: 1/1/CDB$ROOT
            con_logonuid: 1 con_logonid: 1
            con_scuid: 1 con_scid: 1
            edition#: 0            user#/name: 0/SYS
            oct: 0, prv: 0, sql: (nil), psql: (nil)
            stats: 0x1ffa5a0b0, PX stats: 0x1101de44
  service name: SYS$BACKGROUND
  Current Wait Stack:
   0: waiting for 'control file parallel write'
      files=0x1, block#=0x27, requests=0x1
      wait_id=4255818 seq_num=61824 snap_id=1
      wait times: snap=1 min 37 sec, exc=1 min 37 sec, total=1 min 37 sec
      wait times: max=infinite, heur=1 min 37 sec
      wait counts: calls=0 os=0
      in_wait=1 iflags=0x5a0
  There are 4 sessions blocked by this session.
  Dumping one waiter:
    inst: 3, sid: 3389, ser: 36254
    wait event: 'enq: CF - contention'
      p1: 'name|mode'=0x43460004
      p2: '0'=0x0
      p3: 'operation'=0x0
    row_wait_obj#: 213, block#: 1969, row#: 0, file# 1
    min_blocked_time: 92 secs, waiter_cache_ver: 59413
  Wait State:
    fixed_waits=0 flags=0x22 boundary=(nil)/-1
  Session Wait History:
      elapsed time of 0.000027 sec since current wait
   0: waited for 'control file parallel write'
      files=0x1, block#=0x29, requests=0x1
      wait_id=4255817 seq_num=61823 snap_id=1
      wait times: snap=0.000462 sec, exc=0.000462 sec, total=0.000462 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000029 sec of elapsed time
   1: waited for 'control file parallel write'
      files=0x1, block#=0x372, requests=0x1
      wait_id=4255816 seq_num=61822 snap_id=1
      wait times: snap=0.000589 sec, exc=0.000589 sec, total=0.000589 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000051 sec of elapsed time
   2: waited for 'control file sequential read'
      file#=0x0, block#=0x371, blocks=0x1
      wait_id=4255815 seq_num=61821 snap_id=1
      wait times: snap=0.000279 sec, exc=0.000279 sec, total=0.000279 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000010 sec of elapsed time
   3: waited for 'control file sequential read'
      file#=0x0, block#=0x2a, blocks=0x1
      wait_id=4255814 seq_num=61820 snap_id=1
      wait times: snap=0.000272 sec, exc=0.000272 sec, total=0.000272 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000004 sec of elapsed time
   4: waited for 'control file sequential read'
      file#=0x0, block#=0x28, blocks=0x1
      wait_id=4255813 seq_num=61819 snap_id=1
      wait times: snap=0.000280 sec, exc=0.000280 sec, total=0.000280 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000009 sec of elapsed time
   5: waited for 'control file sequential read'
      file#=0x0, block#=0x1, blocks=0x1
      wait_id=4255812 seq_num=61818 snap_id=1
      wait times: snap=0.000319 sec, exc=0.000319 sec, total=0.000319 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000070 sec of elapsed time
   6: waited for 'enq: CF - contention'
      name|mode=0x43460005, 0=0x0, operation=0x0
      wait_id=4255811 seq_num=61817 snap_id=1
      wait times: snap=0.000257 sec, exc=0.000257 sec, total=0.000257 sec
      wait times: max=0.000000 sec
      wait counts: calls=3 os=3
      occurred after 0.000800 sec of elapsed time
   7: waited for 'rdbms ipc message'
      timeout=0x12c, =0x0, =0x0
      wait_id=4255810 seq_num=61816 snap_id=1
      wait times: snap=3.002193 sec, exc=3.002193 sec, total=3.002193 sec
      wait times: max=3.000000 sec
      wait counts: calls=1 os=1
      occurred after 0.000124 sec of elapsed time
   8: waited for 'control file sequential read'
      file#=0x0, block#=0x34a, blocks=0x1
      wait_id=4255809 seq_num=61815 snap_id=1
      wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000005 sec of elapsed time
   9: waited for 'control file sequential read'
      file#=0x0, block#=0x2a, blocks=0x1
      wait_id=4255808 seq_num=61814 snap_id=1
      wait times: snap=0.000341 sec, exc=0.000341 sec, total=0.000341 sec
      wait times: max=infinite
      wait counts: calls=0 os=0
      occurred after 0.000006 sec of elapsed time
	  
	  ...
	  ...
*** 2017-11-21T21:56:24.918643+08:00
loadavg : 4.47 2.09 1.21
System user time: 0.00 sys time: 0.00 context switch: 19791
Memory (Avail / Total) = 107148.16M / 257929.37M
Swap (Avail / Total) = 32768.00M /  32768.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle   53295     1  0  80   0 - 19975755 read_e Nov13 ?     00:06:16 ora_ckpt_anbob3
Name:	ora_ckpt_anbob3
State:	S (sleeping)
Tgid:	53295
Pid:	53295
PPid:	1
TracerPid:	0
Uid:	602	602	602	602
Gid:	601	604	604	604
Utrace:	0
FDSize:	512
Groups:	601 602 603 604 606 
VmPeak:	79907024 kB
VmSize:	79903020 kB
VmLck:	       0 kB
VmHWM:	 3446908 kB
VmRSS:	 3438240 kB
VmData:	   25352 kB
VmStk:	     164 kB
VmExe:	  342688 kB
VmLib:	   30328 kB
VmPTE:	   14816 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	1/1031607
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000016400207
SigCgt:	00000003c9887cf8
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffffffffffff
Cpus_allowed:	ffffff,ffffffff
Cpus_allowed_list:	0-55
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	3914186
nonvoluntary_ctxt_switches:	43390
Short stack dump: 
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196<-ksfdwtio()
+306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios()+409<-kfk_io1()
+1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest()+742<-ksfdafWait()
+371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp()+1035<-kccbmp_wdl()
+257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti()+247<-ksbabs()+11524
<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+253
 

Note:
从日志看是CKPT进程hang超过了70秒所以LMON进程终止了实例, 从CKPT的trace文件中Call Stack看I/O的可能性较大, 开始有说这个数据库使用的是linux的多路径软件,使用的默认配置, 通常来说I/O错误对于ASM镜像和多路径以上的应用进程在做I/O操作时是通明的。但是在I/O重试期间可能会导致进程hang, 对于像CKPT\LMON 长时间的HANG是致命的, 在MOS中没有找到相关的BUG个NOTE但有一篇Extend RAC因远程IO失败导致数据库crash的case, 参照尝试性的调整了多路径的配置参数如下:

polling_interval= 10
no_path_retry = 3
/sys/module/scsi_transport_fc/parameters/dev_loss_tmo = 30

polling_interval: Specifies the interval between two path checks in seconds. For properly functioning paths, the interval between checks will gradually increase to (4 * polling_interval). The default value is 5.

no_path_retry: A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value of fail indicates immediate failure, without queuing.A value of queue indicates that queuing should not stop until the path is fixed.

The “dev_loss_tmo” option controls how long the SCSI layer waits after a SCSI device fails before marking it as failed.

未提供OS,和多路径日志。 但做了以前调整重启了多路径服务后,观察了1个月时间数据库实例没有再crash过。


Viewing all articles
Browse latest Browse all 702

Trending Articles