前几日,有客户报一个备用库实例中有多个回话hang住的问题,在hang住前本地有维护人员执行了truncate表的操作。
同事前往客户现场进一步确认了问题,并传回了当时hang状况下的266级systemstate文件。该实例的并发回话数量较少,所以实例状态抓取后总的信息量并不多,这种情形中通过分析systemstate信息往往要好于分析hanganalyze信息。
通过著名的源自于metalink的ass awk脚本可以很快找出各进程的状态,以及重要资源的持有者:
awk -f ass109.awk systemstate.txt
Starting Systemstate 1
…………………………………..
Ass.Awk Version 1.0.9 – Processing systemstate.txt
System State 1
~~~~~~~~~~~~~~~~
1:
2: last wait for ‘pmon timer’
3: waiting for ‘rdbms ipc message’ wait
4: waiting for ‘rdbms ipc message’ wait
5: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW0
6: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW1
7: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW2
8: waiting for ‘rdbms ipc message’ wait
9: waiting for ‘rdbms ipc message’ wait
10: waiting for ‘rdbms ipc message’ wait
11: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-0001000B-00000001] wait
12: waiting for ‘rdbms ipc message’ wait
13: waiting for ‘rdbms ipc message’ wait
14: waiting for ‘rdbms ipc message’ wait
15: waiting for ‘rdbms ipc message’ wait
16: waiting for ‘SQL*Net message from client’ wait
17: waiting for ‘SQL*Net message from client’ wait
18: waiting for ‘SQL*Net message from client’ wait
19: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait
20: waiting for ‘SQL*Net message from client’ wait
21: waiting for ‘SQL*Net message from client’ wait
22: last wait for ‘SQL*Net message from client’ [DEAD]
Cmd: Delete
23: waiting for ‘SQL*Net message from client’ wait
24: waiting for ‘SQL*Net message from client’ wait
25:
26: waiting for ‘SQL*Net message from client’ wait
27: last wait for ‘ksdxexeotherwait’
28: waiting for ‘enq: TX – row lock contention'[Enqueue TX-00120003-00002CC0] wait
Cmd: Update
29: waiting for ‘SQL*Net message from client’ wait
30: waiting for ‘SQL*Net message from client’ wait
31: waiting for ‘SQL*Net message from client’ wait
32: waiting for ‘SQL*Net message from client’ wait
33: waiting for ‘SQL*Net message from client’ wait
34: waiting for ‘Streams AQ: qmn coordinator idle wait’ wait
35: for ‘Streams AQ: waiting for time management or cleanup tasks’ wait
36: waiting for ‘Streams AQ: qmn slave idle wait’ wait
37: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-00010025-00000001] wait
38: waiting for ‘SQL*Net message from client’ wait
39: waiting for ‘SQL*Net message from client’ wait
41: waiting for ‘SQL*Net message from client’ wait
42: last wait for ‘enq: TX – row lock contention’ wait
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of ‘???’ implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 70000048c8c66b0 ??? Blocker
Enqueue RO-0001000B-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-0001000B-00000001 11: 11: is waiting for 10: 11:
Enqueue TX-00120003-00002CC0 22: last wait for ‘SQL*Net message from client’
Enqueue RO-00010025-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-00010025-00000001 37: 37: is waiting for 10: 37:
Object Names
~~~~~~~~~~~~
Latch 70000048c8c66b0 holding (efd=14) 70000048c8c66b0 Child ca
Enqueue RO-0001000B-00000001
Enqueue TX-00120003-00002CC0
Enqueue RO-00010025-00000001
首先注意到的是RO队列锁,RO即(REUSE OBJECT),该锁用以协调前台进程与后台进程DBWR和CKPT之间的工作,该队列一般只在drop或truncate对象时可见到。可以看到pid为11和37的进程均在等待pid为10的进程,因为pid=10的进程持有着它们锁需要的RO-0001000B-00000001和RO-00010025-00000001,分析dump文件可以发现该持有进程正是CKPT后台进程;
而该检查点进程也处于BUSY的非空闲等待中,其等待事件为’rdbms ipc message’,即它在等待另一个后台进程给它发送信息。这个时候我们来观察其他忙碌的后台进程可以发现,pid为5,6,7的进程均在等待同一个栓’latch: cache buffers chains'[Latch 70000048c8c66b0];这个三个进程均为DBWR进程,此外还有一个DBW3进程处于’rdbms ipc message’等待中,多个DBWR进程是由于设置了db_writer_processes参数;看起来是ckpt进程准备对需要truncate的对象做对象级别的检查点,以保证该对象所有脏块均已写到磁盘上,所以对dbwr进程发出需要写出的message,继而进入’rdbms ipc message’等待直到dbwr进程完成写出任务,但由于dbwr进程长期无法获取某脏块对应的latch: cache buffers chains,故写出工作一直处于pending状态,这样一个hang链就十分清晰了。
我们来分析’latch: cache buffers chains'[Latch 70000048c8c66b0]这个栓,ass分析systemstate dump时将该栓的holder归为’???’,即无法自dump文件中找到该栓的持有者;进一步直接分析dump文件可以发现:
PROCESS 2:
—————————————-
SO: 70000048f529d40, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 70000047d0e75a0/70000048f8956d0, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403177803376304 122 2
last post received-location: kslges
last process to post me: 70000048e533a88 1 6
last post sent: 0 0 148
last post sent-location: ktmpsm
last process posted by me: 70000048e535228 1 22
(latch info) wait_event=0 bits=2
holding (efd=14) 70000048c8c66b0 Child cache buffers chains level=1 child#=61214
Location from where latch is held: kcbgcur: kslbegin:
Context saved from call: 336311247
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
waiters [orapid (seconds since: put on list, posted, alive check)]:
5 (480, 1278471465, 0)
6 (480, 1278471465, 0)
19 (480, 1278471465, 0)
7 (216, 1278471465, 0)
waiter count=4
Process Group: DEFAULT, pseudo proc: 70000048e5f91d8
O/S info: user: oracle, term: UNKNOWN, ospid: 74476
OSD pid info: Unix process pid: 74476, image: oracle@DR_570 (PMON)
可以看到pid=2的PMON进程holding 该cache buffers chains子栓,而dbwx进程即waiters 5,6,7;
一般情况下pmon是不会去持有类似于cache buffers chains子栓这样的低级栓的,除非在cleanup失败会话或关闭实例情况下。从ass分析信息来看,当时确实有一个进程处于DEAD状态,即22: last wait for ‘SQL*Net message from client’ [DEAD],而该会话最后所做的是delete from “LINC”.”MSDB_ACCGL” t操作。
虽然无法证明,但极有可能是22号进程在进行delete过程中发生会话失败,PMON进程尝试清理该进程,并获取了相关栓。但该进程始终没有被杀死,即便使用OS 命令:kill -9 处理该进程后仍可以在systemstate中找到该进程的信息。据同事分析,当时之前曾有一度PMON的CPU使用率达到100%,之后PMON进程进入’pmon timer’空闲等待,且一直没有释放对应子栓,令DBWR进程处于长期无法获得栓资源的进而hang住的状态。
在MOS上搜索PMON+cache buffers chains可以发现几个PMON长期持有该类子栓且从不释放的Bug,但版本为Oracle 8等较老版本,且都是不能reproduceable的case。谨以录之:
Hdr: 4126734 8.1.7.4.0 RDBMS 8.1.7.4.0 BUFFER CACHE PRODID-5 PORTID-59
Abstract: ORACLE PROCESS GOES TO CPU LOOP WHILE HOLDING “CHILD CACHE BUFFERS CHAINS” LATCH*** 01/17/05 07:08 pm ***
TAR:
—-
4249496.999PROBLEM:
——–
1. Clear description of the problem encountered:Oracle process goes to CPU loop while holding “Child cache buffers chains
level=1” latch. This condition causes other processes to wait on
either “latch free” or “Child cache buffers chains” which completely hangthe instance.
When looping Oracle process is killed, PMON also goes to CPU loop
cleaning
up that process. PMON also hold the same “Child cache buffers chains
level=1” latch while instance continues to hang.Hang disappear only after instance is bounced.
2. Pertinent configuration information (MTS/OPS/distributed/etc)
11i Apps Install (version 11.5.9.) running on Oracle 8.1.7.4.0 database.
3. Indication of the frequency and predictability of the problem
Process running pl/sql procedure doing dbms_lob.instr() and
dbms_lob.read()
seem to encounter this problem.4. Sequence of events leading to the problem
5. Technical impact on the customer. Include persistent after effects.
In a bad day, ct encounters 2 occurence of this problem a day causing
loss
of service in peak processing period.DIAGNOSTIC ANALYSIS:
——————–
1. Got 3 systemstate dumps while hang is occuring
2. Got 3 errorstack dumps of PMON while it is looping in CPUWORKAROUND:
———–
bounce instanceRELATED BUGS:
————-
bug 2361194REPRODUCIBILITY:
—————-
– always reproducible on test system at ct site.TEST CASE:
———-
– unable to reproduce on 8.1.7.4.0 test system in-houseSTACK TRACE:
————
PMON stack while looping:
ksedmp ksdxfdmp ksdxcb sspuser _sigreturn kggchk kcbso1 kcbpsod kcbsod
kssxdl kssdch ksudlc kssxdl ksudlp kssxdl ksuxdl ksuxda ksucln ksbrdp
opirip opidrv sou2o main $START$SUPPORTING INFORMATION:
———————–
Customers running Release 11i of the E-Business Suite will be entitled
to complimentary Extended Maintenance Support through July 31, 2005.