# db alert log
Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details. Tue Apr 23 08:54:27 2019 WARNING: Heavy swapping observed on system in last 5 mins. pct of memory swapped in [3.68%] pct of memory swapped out [13.12%]. Please make sure there is no memory pressure and the SGA and PGA are configured correctly. Look at DBRM trace file for more details. Tue Apr 23 08:56:27 2019 Thread 1 cannot allocate new log, sequence 10395 Private strand flush not complete Current log# 2 seq# 10394 mem# 0: /hescms/oradata/anbob/redo02a.log Current log# 2 seq# 10394 mem# 1: /hescms/oradata/scms/redo02b.log Thread 1 advanced to log sequence 10395 (LGWR switch) Current log# 3 seq# 10395 mem# 0: /hescms/oradata/scms/redo03a.log Current log# 3 seq# 10395 mem# 1: /hescms/oradata/scms/redo03b.log Tue Apr 23 08:56:41 2019 Archived Log entry 10505 added for thread 1 sequence 10394 ID 0xaef43455 dest 1: Tue Apr 23 09:08:37 2019 System state dump requested by (instance=1, osid=8886 (PMON)), summary=[abnormal instance termination]. Tue Apr 23 09:08:37 2019 PMON (ospid: 8886): terminating the instance due to error 471 System State dumped to trace file /ora/diag/rdbms/scms/scms/trace/scms_diag_8896_20190423090837.trc Tue Apr 23 09:08:37 2019 opiodr aborting process unknown ospid (22614) as a result of ORA-1092 Tue Apr 23 09:08:38 2019 opiodr aborting process unknown ospid (27627) as a result of ORA-1092 Instance terminated by PMON, pid = 8886 Tue Apr 23 09:18:18 2019 Starting ORACLE instance (normal)
# OS log /var/log/messages
Apr 23 08:52:18 anbobdb kernel: NET: Unregistered protocol family 36 Apr 23 09:07:28 anbobdb kernel: oracle invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0 Apr 23 09:07:32 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action. Apr 23 09:07:47 anbobdb kernel: oracle cpuset=/ mems_allowed=0-4 Apr 23 09:07:47 anbobdb kernel: Pid: 22753, comm: oracle Not tainted 2.6.32-431.el6.x86_64 #1 Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads. Apr 23 09:07:47 anbobdb kernel: Call Trace: Apr 23 09:07:47 anbobdb kernel: [] ? dump_header+0x90/0x1b0 Apr 23 09:07:47 anbobdb kernel: [] ? security_real_capable_noaudit+0x3c/0x70 Apr 23 09:07:47 anbobdb kernel: [] ? oom_kill_process+0x82/0x2a0 Apr 23 09:07:47 anbobdb kernel: [] ? select_bad_process+0xe1/0x120 Apr 23 09:07:47 anbobdb kernel: [] ? out_of_memory+0x220/0x3c0 Apr 23 09:07:47 anbobdb kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0 Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads. Apr 23 09:07:48 anbobdb kernel: [] ? alloc_pages_current+0xaa/0x110 Apr 23 09:07:52 anbobdb kernel: [] ? pte_alloc_one+0x1b/0x50 Apr 23 09:07:52 anbobdb kernel: [] ? __pte_alloc+0x32/0x160 Apr 23 09:07:52 anbobdb kernel: [] ? handle_mm_fault+0x1c0/0x300 Apr 23 09:07:52 anbobdb kernel: [] ? down_read_trylock+0x1a/0x30
Note: OS messages indicating resource shortage, OOM killer etc (TFA will collect this)
What is OOM Killer?
The OOM killer, a feature enabled by default on Linux kernel, is a self protection mechanism employed the Linux kernel when under severe memory pressure.
If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes. it will sacrifice one or more processes in order to free up memory for the system when all else fails.
The behavior of OOM killer in principle is as follows:
Lose the minimum amount of work done
Recover as much as memory it can
Do not kill anything actually not using a lot memory alone
Kill the minimum amount of processes (one)
Try to kill the process the user expects to kill
Reason Probable Cause:
1 Spike in memory usage based on a load event (additional processes are needed for increased load).
2 Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
3 Spike in memory usage due to failed hardware such as a DIMM memory module.
4 Spike in memory usage due to undersizing of hardware resources for the running application(s).
5 There’s a memory leak in a running application.
If the application uses mlock() or HugeTLB pages (HugePages), it may not be able to use its swap space for that application (because locked pages or HugePages are not swappable). If this happens, SwapFree may still have a very large value when the OOM occurs. However overusing them may exhaust system memory and leave the system with no other recourse.
Troubleshooting
Check to see how often the Out of memory (OOM) killer process is running.
$ egrep ‘Out of memory:’ /var/log/messages
Check to see how large the memory consumption is of the processes being killed.
$ egrep ‘total-vm’ /var/log/messages
Further analysis, we can check the system activity reporter (SAR) data to see what it’s captured about the OS.
Check swap statistics with the -S
flag: A high % of swpused
indicates swapping and memory shortage
$ sar -S -f /var/log/sa/sa2
Check CPU and IOwait statistics: High %user
or %system
indicate a busy system, also high %iowait
the system is spending important time waiting on underlying storage
$ sar -f /var/log/sa/sa31
Check memory statistics: High %memused
and %commit
values tell us the system is using nearly all of its memory, and memory that is committed to processes (high %commit
is more concerning)
$ sar -r -f /var/log/sa/sa
Lastly, check the amount of memory on the system, and how much is free/available:
$ free -m or cat /proc/meminfo or dmidecode -t memory
In the oracle environment, first check whether the SGA and PGA configuration is reasonable. In this case, we later reduced the size of these memory areas, reserved more available memory for the operating system, and configured hugepage. The benefits of hugepage are not described in multiple descriptions, BTW, If you increase the hugepages, check if the check has reached the upper limit of kernel.shmall. and check application process memory leak, even PGA leak. config hugepage linux
References Linux: Out-of-Memory (OOM)Killer (文档 ID 452000.1) and RHEL online docs.