一套Linux x86-64上的11.2.0.1 4节点RAC系统中LMS GCS服务进程遭遇到内部错误ORA-00600[kjbmprlst:shadow],导致节点实例意外终止,具体日志如下:
Fri Jul 08 02:04:43 2011 Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc (incident=1011732): ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/PROD/PROD1/incident/incdir_1011732/PROD1_lms1_536_i1011732.trc Fri Jul 08 02:04:44 2011 Trace dumping is performing id=[cdmp_20110708020444] Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc: ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc: ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], [] LMS1 (ospid: 536): terminating the instance due to error 484 Fri Jul 08 02:04:45 2011 opiodr aborting process unknown ospid (27387) as a result of ORA-1092 System state dump is made for local instance System State dumped to trace file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_diag_513.trc Fri Jul 08 02:04:54 2011 Termination issued to instance processes. Waiting for the processes to exit Fri Jul 08 02:04:58 2011 ORA-1092 : opitsk aborting process
该ORA-00600[kjbmprlst:shadow]错误定位为11.2.0.1上的Bug 10121589或Bug 9458781:
Bug 10121589 ORA-600 [kjbmprlst:shadow] can occur in RAC Affects: Product (Component) Oracle Server (Rdbms) Range of versions believed to be affected Versions BELOW 12.1 Versions confirmed as being affected 11.2.0.1 Platforms affected Generic (all / most platforms affected) Fixed: This issue is fixed in 12.1 (Future Release) 11.2.0.2 Bundle Patch 2 for Exadata Database 11.2.0.1 Bundle Patch 7 for Exadata Database Symptoms: Related To: Internal Error May Occur (ORA-600) ORA-600 [kjbmprlst:shadow] RAC (Real Application Clusters) / OPS Description An ORA-600 [kjbmprlst:shadow] can occur if the fix for bug 9979039 is present. Note: One off patches for 10200390 should also include this fix. Bug 9458781 Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error Affects: Product (Component) Oracle Server (Rdbms) Range of versions believed to be affected Versions >= 11.2.0.1 but BELOW 11.2.0.2 Versions confirmed as being affected 11.2.0.1 Platforms affected Generic (all / most platforms affected) Fixed: This issue is fixed in 11.2.0.2 (Server Patch Set) 11.2.0.1 Bundle Patch 4 for Exadata Database Symptoms: Related To: Instance May Crash Internal Error May Occur (ORA-600) ORA-600 [KJBMPRLST:SHADOW] ORA-600 [KJBMOCVT:RID] ORA-600 [KJBRREF:PKEY] ORA-600 [KJBRASR:PKEY] RAC (Real Application Clusters) / OPS Description A lock is closed without sending a message to the master. This causes closed lock dangling at the master crashing the instance with different internal errors. Reported internal errors so far are : - KJBMPRLST:SHADOW - KJBMOCVT:RID - KJBRREF:PKEY - KJBRASR:PKEY
该kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞,目前除了安装补丁外没有已验证的workaround办法(disable drm似乎是无效的):
oradebug lkdebug (track resources, take dumps) KCL history KJBL history KJL history PCM (GCS) and non-PCM (GES) resources are kept separate and use separate code paths. GES: Resource table: kjr and kjrt Lock table: kjlt Processes: kjpt GCS: Resource table: kjbr Lock table: kjbl DLM Structures (continued) /* PCM resource structure */ typedef struct kjbr { /* 68 bytes on sun4u */ kjsolk hash_q_kjbr; /* hash list : hp */ ub4 resname_kjbr[2]; /* the resource name */ kjsolk scan_q_kjbr; /* chain to lmd scan q of grantable resources */ kjsolk grant_q_kjbr; /* list of granted resources */ kjsolk convert_q_kjbr; /* list of resources being converted */ ub4 diskscn_bas_kjbr; /* scn(base) known to be on disk */ ub2 diskscn_wrap_kjbr; /* scn(wrap) known to be on disk */ ub2 writereqscn_wrap_kjbr; /* scn(wrap) requested for write */ ub4 writereqscn_bas_kjbr; /* scn(base) requested for write */ struct kjbl *sender_kjbr; /* lock elected to send block */ ub2 senderver_kjbr; /* version# of above lock */ ub2 writerver_kjbr; /* version# of lock below */ struct kjbl *writer_kjbr; /* lock elected to write block */ ub1 mode_role_kjbr; /* one of 'n', 's', 'x' && one of 'l' or 'g' */ ub1 flags_kjbr; /* ignorewip, free etc. */ ub1 rfpcount_kjbr; /* refuse ping counter */ ub1 history_kjbr; /* resource operation history */ kxid xid_kjbr; /* split transaction ID */ } kjbr ; /* kjbl - PCM lock structure ** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns */ typedef struct kjbl { /* 52 bytes on sun4u */ union { /* discriminate lock@master and lock@client */ struct { /* for lock@master */ kgglk state_q_kjbl; /* link to chain to resource */ kjbopqi *rqinfo_kjbl; /* target bid */ struct kjbr *resp_kjbl; /* pointer to my resource */ } kjbllam; /* KJB Lock Lock At Master */ struct { /* for lock@client */ ub4 disk_base_kjbl; /* disk version(base) for replay */ ub2 disk_wrap_kjbl; /* disk version(wrap) for replay */ ub1 master_node_kjbl; /* master instance# */ ub1 client_flag_kjbl; /* flags specific to client locks */ ub2 update_seq_kjbl; /* last update to master */ } kjbllac; /* KJB Lock Lock At Client */ } kjblmcd; /* KJB Lock Master Client Discrimnant */ void *remote_lockp_kjbl; /* pointer to client lock or shadow */ ub2 remote_ver_kjbl; /* remote lock version# */ ub2 ver_kjbl; /* my version# */ ub2 msg_seq_kjbl; /* client->master seq# */ ub2 reqid_kjbl; /* requestid for convert */ ub2 creqid_kjbl; /* requestid for convert that has been cancelled */ ub2 pi_wrap_kjbl; /* scn(wrap) of highest pi */ ub4 pi_base_kjbl; /* scn(base) of highest pi */ ub1 mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g' */ ub1 state_kjbl; /* _L|_R|_W|_S, notify, which q, lock type */ ub1 node_kjbl; /* instance lock belongs to */ ub1 flags_kjbl; /* lock flag bits */ ub2 rreqid_kjbl; /* save the reqid */ ub2 write_wrap_kjbl; /* last write request version(wrap) */ ub4 write_base_kjbl; /* last write request version(base) */ ub4 history_kjbl; /* lock operation history */ } kjbl; PCM DLM locks that are owned by the local instance are allocated and embedded in an LE structure. PCM DLM locks that are owned by remote instances and mastered by the local instance are allocated in SHARED_POOL. PCM Locks and Resources Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or name(id1,id2); kcllemode or held-mode; kclleacq or acquiring; kcllelck or DLM lock. Fields of interest in the kjbr structure: resname_kjbr[2] or resource name; grant_q_kjbr or grant queue; convert_q_kjbr or convert queue; mode_role_kjbr, which is a bitwise merge of grant mode and role-interpreted NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1 Global with PI (0x018). The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock mode: 0x00 if grant NULL; 0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened at master; 0x08 if global role (otherwise local); 0x10 has one or more PI; 0x20 if request CR; 0x40 if request S; 0x80 if request X. Someone has to keep a list of all buffers and where they are mastered This is called Global Resource Directory (GRD) GRD is present on all the instances of the cluster To find out the master: select b.dbablk, r.kjblmaster master_node from x$le l, x$kjbl r, x$bh b where b.obj = and b.le_addr = l.le_addr and l.le_kjbl = r.kjbllockp
Oracle Support宣称可以通过11.2.0.2 (Server Patch Set)或11.2.0.1 Bundle Patch 4 for Exadata Database修复该bug,但是有迹象表明在11.2.0.2上仍可能发生该ORA-00600[kjbmprlst:shadow]内部错误,同时该bug更多地发生在超过2个节点的RAC系统中。
SO: 0xc91b29550, type: 4, owner: 0xc814385d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0xc814385d0, name=session, file=ksu.h LINE:11467, pg=0
(session) sid: 1847 ser: 1 trans: (nil), creator: 0xc814385d0
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x408) -/-
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 0.328702 sec ago
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.328731 sec since last wait
0: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066846 seq_num=26443 snap_id=1
wait times: snap=0.037705 sec, exc=0.037705 sec, total=0.037705 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000026 sec of elapsed time
1: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066845 seq_num=26442 snap_id=1
wait times: snap=0.003826 sec, exc=0.003826 sec, total=0.003826 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000033 sec of elapsed time
2: waited for ‘ges message buffer allocation’
pool=0x0, request=0x1, allocated=0x0
wait_id=2105066844 seq_num=26441 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000008 sec of elapsed time
3: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066843 seq_num=26440 snap_id=1
wait times: snap=0.004991 sec, exc=0.004991 sec, total=0.004991 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000019 sec of elapsed time
4: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066842 seq_num=26439 snap_id=1
wait times: snap=0.007574 sec, exc=0.007574 sec, total=0.007574 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000008 sec of elapsed time
5: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066841 seq_num=26438 snap_id=1
wait times: snap=0.000036 sec, exc=0.000036 sec, total=0.000036 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000019 sec of elapsed time
6: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066840 seq_num=26437 snap_id=1
wait times: snap=0.009855 sec, exc=0.009855 sec, total=0.009855 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000015 sec of elapsed time
7: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066839 seq_num=26436 snap_id=1
wait times: snap=0.000061 sec, exc=0.000061 sec, total=0.000061 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000014 sec of elapsed time
8: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066838 seq_num=26435 snap_id=1
wait times: snap=0.000071 sec, exc=0.000071 sec, total=0.000071 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000020 sec of elapsed time
9: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=2105066837 seq_num=26434 snap_id=1
wait times: snap=0.003144 sec, exc=0.003144 sec, total=0.003144 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000023 sec of elapsed time
Hdr: 9491134 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 ORA-600
Abstract: LMD CRASHES INSTANCE WITH ORA-600 [KJMSCNDSCQ:TIMEOUT]
*** 03/19/10 07:55 am ***
TAR:
—-
PROBLEM:
——–
1. Clear description of the problem encountered:
– frequent instance crashes due to LMD crashing with :
ORA-600: internal error code, arguments: [kjmscndscq:timeout]
– 4 instances, 11.2.0.1 on Linux X86_64
– Oracle RDS/IP (generic)
– rds-tools Package 1.4-1 11-SEP-09
– Using Infiniband
– For example : Mar 18 at 02.06 inst#2 crashed :
Thu Mar 18 02:06:56 2010
Errors in file /u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_lmd0_7935.trc
(incident=124183):
ORA-600: internal error code, arguments: [kjmscndscq:timeout], [], [], [],
[], [], [], [], [], [], [], []
Incident details in:
/u01/app/oracle/diag/rdbms/pr/PR2/incident/incdir_124183/PR2_lmd0_7935_i124183
.trc
…
LMD0 (ospid: 7935): terminating the instance due to error 482
System state dump is made for local instance
System State dumped to trace file
/u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_diag_7921.trc
Thu Mar 18 02:07:04 2010
ORA-1092 : opitsk aborting process
Thu Mar 18 02:07:04 2010
License high water mark = 70
Thu Mar 18 02:07:09 2010
Instance terminated by LMD0, pid = 7935
2. Sequence of events leading to the problem
– Has seen it happen when :
1) When RMAN was backing up the FRA to tape (netbackup).
2) running a select on a large table that has thousands of inserts per
second.
3. Technical impact on the customer. Include persistent after effects.
– Instances restart.
DIAGNOSTIC ANALYSIS:
——————–
– “PR2_lmd0_7935_i124183.trc”
ACTION NAME:() 2010-03-18 02:06:56.860
Dump continued from file:
/u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_lmd0_7935.trc
ORA-600: internal error code, arguments: [kjmscndscq:timeout], [], [], [],
[], [], [], [], [], [], [], []
– “PR2_lmon_7933.trc”
2010-03-18 02:04:04.149
CGS recovery timeout = 85 sec
Begin DRM(23) (swin 0)
—> file ends here
– “PR2_diag_7921.trc”
PROCESS 12:
…
last post sent-location: kji.h LINE:2816 ID:kjata: wake up enqueue owner
..
OSD pid info: Unix process pid: 7935, image:
oracle@orlandop02.tranquilidade.cst (LMD0)
…
Current Wait Stack:
Not in wait; last wait ended 0.302150 sec ago
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.302169 sec since last wait
0: waited for ‘ges LMD suspend for testing event’
=0x32, =0x0, =0x0
wait_id=944384 seq_num=26897 snap_id=1
wait times: snap=0.066003 sec, exc=0.066003 sec, total=0.066003 sec
wait times: max=0.060000 sec
wait counts: calls=1 os=1
occurred after 0.000021 sec of elapsed time
…
———-resource 0x197f9cda80———————-
resname : [0x0][0x0],[RM]
…
CONVERT_Q:
lp 0x18254c5908 gl KJUSERNL rl KJUSEREX rp 0x197f9cda80 [0x0][0x0],[RM]
master 2 pid 7935 bast 0 rseq 1 mseq 0 history 0x55a5555a
– It looks like it’s in mid-DRM and LMD is trying to get an RM enqueue, but I
can’t see the holder of this resource on the other instances though.
WORKAROUND:
———–
– Suggested ct. disable DRM for now to get some stability, waiting for
feedback on that.
RELATED BUGS:
————-
– many bugs reported for this ORA-600 but didn’t see clear match with this
case.
REPRODUCIBILITY:
—————-
TEST CASE:
———-
STACK TRACE:
————
SUPPORTING INFORMATION:
———————–
– uploaded ‘logs.tar.gz’ containing four tarballs, one for each node’s trace
dump location.
– Each one unzips into a folder ‘diag’ so rename them as they are extracted.
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-
DIAL-IN INFORMATION:
——————–
IMPACT DATE:
————
Hdr: 5624062 10.2.0.1.0 RDBMS 10.2.0.1.0 PRODID-5 PORTID-59 ORA-600
Abstract: LMS PROCESS DOWN OWING TO ORA-600 [KJMVALIDATE:!PCM], [2], [32], [0], []
*** 10/26/06 11:56 pm ***
TAR:
—-
5910169.993
PROBLEM:
——–
NODE 2 >>
Thu Oct 26 22:01:35 2006
Thread 2 advanced to log sequence 6258
Current log# 6 seq# 6258 mem# 0: /dev/vg02/rredolog61
Current log# 6 seq# 6258 mem# 1: /dev/vg02/rredolog62
Thu Oct 26 22:04:08 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
[0], [], [], [], []
Thu Oct 26 22:04:10 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
[0], [], [], [], []
Thu Oct 26 22:04:10 2006
Trace dumping is performing id=[cdmp_20061026220410]
Thu Oct 26 22:04:17 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
[0], [], [], [], []
Thu Oct 26 22:04:17 2006
LMS1: terminating instance due to error 484
Thu Oct 26 22:04:17 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lmon_13387.trc:
ORA-484: LMS* process terminated with error
Thu Oct 26 22:04:17 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms0_13409.trc:
ORA-484: LMS* process terminated with error
Thu Oct 26 22:04:17 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lmd0_13389.trc:
ORA-484: LMS* process terminated with error
Thu Oct 26 22:04:17 2006
Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms2_13422.trc:
ORA-484: LMS* process terminated with error
Thu Oct 26 22:04:19 2006
System state dump is made for local instance
System State dumped to trace file
/oracle/app/oracle/admin/APOLLO/bdump/apollo2_diag_13383.trc
Thu Oct 26 22:04:22 2006
Instance terminated by LMS1, pid = 13420
Fri Oct 27 09:17:58 2006
Starting ORACLE instance (normal)
DIAGNOSTIC ANALYSIS:
——————–
apollo2_lms1_13420.trc >>
MSG [32:KJX_B_OCR] [0x25d54.c0000] inc=26 len=352 sender=(0,0) seq=2521440
fg=p stat=KJUSERSTAT_DONE spid=10552 flg=xff
*** 22:04:08.761
ksedmp: internal or fatal error
ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
[0], [], [], [], []
—– Call Stack Trace —–
kgerinv kgerinv – KGE Record Internal (Named) error (with Va_list)
*/
ksbrdp KSB: run a detached (background) process
kgerinv kgeasnmierr kjmvalidate kjmpbmsg kjmsm ksbrdp
apollo2_diag_13383.trc >>
5: last wait for ‘rdbms ipc message’ (LMON)
6: last wait for ‘ges remote message’ (LMD0)
7: last wait for ‘gcs remote message’ (LMS0)
8: last wait for ‘ges LMD suspend for testing event’ (LMS1)
9: last wait for ‘gcs remote message’ (LMS2)
WORKAROUND:
———–
none
RELATED BUGS:
————-
REPRODUCIBILITY:
—————-
TEST CASE:
———-
STACK TRACE:
————
apollo2_lms1_13420.trc >>
MSG [32:KJX_B_OCR] [0x25d54.c0000] inc=26 len=352 sender=(0,0) seq=2521440
fg=p stat=KJUSERSTAT_DONE spid=10552 flg=xff
*** 22:04:08.761
ksedmp: internal or fatal error
ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
[0], [], [], [], []
—– Call Stack Trace —–
kgerinv kgerinv – KGE Record Internal (Named) error (with Va_list)
*/
ksbrdp KSB: run a detached (background) process
kgerinv kgeasnmierr kjmvalidate kjmpbmsg kjmsm ksbrdp
Hdr: 9677815 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 9458781
Abstract: INSERT INTO TABLE HANGS IN ‘ LOCK DEADLOCK RETRY’ – ‘GES MESSAGE BUFFER ALLOCATI
*** 05/03/10 12:57 pm ***
—-
PROBLEM:
——–
3 node rac system. The same insert statement executed only on instance 2
hung 2 times. No other session in cluster was accessing the same table.
The statement completed after the instance was restarted.
DIAGNOSTIC ANALYSIS:
——————–
Trace file /u01/app/oracle/diag/rdbms/gdhp/GDHP2/trace/GDHP2_ora_30308.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 – 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage
Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: Linux
Release: 2.6.18-128.1.16.0.1.el5
Version: #1 SMP Tue Jun 30 16:48:30 EDT 2009
Machine: x86_64
Instance name: GDHP2
Redo thread mounted by this instance: 2
Oracle process number: 91
*** 11:33:55.997
*** ID:(282.5434) 2010-05-02 11:33:55.997
*** ID:() 2010-05-02 11:33:55.997
*** NAME:(GDHPETL) 2010-05-02 11:33:55.997
11:33:55.997
*** NAME:() 2010-05-02 11:33:55.997
Received ORADEBUG command (#1) ‘event ‘10046 trace name context forever,
level 12” from process ‘Unix process pid: 19303, image: ‘
*** 11:33:56.004
Error encountered while processing ORADEBUG command (#1):
ORA-49100: Failed to process event statement [‘10046 trace name context
forever, level 12’]
*** 11:36:23.584
Received ORADEBUG command (#2) ‘event 10046 trace name context forever, level
12’ from process ‘Unix process pid: 19303, image: ‘
*** 11:36:23.584
Finished processing ORADEBUG command (#2) ‘event 10046 trace name context
forever, level 12′
WAIT #2: nam=’lock deadlock retry’ ela= 2838 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183585014
WAIT #2: nam=’ges message buffer allocation’ ela= 2 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183585051
WAIT #2: nam=’lock deadlock retry’ ela= 10719 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183596011
WAIT #2: nam=’ges message buffer allocation’ ela= 1 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183596068
WAIT #2: nam=’lock deadlock retry’ ela= 10682 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183607055
WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183607105
WAIT #2: nam=’lock deadlock retry’ ela= 10600 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183618017
WAIT #2: nam=’ges message buffer allocation’ ela= 2 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183618069
WAIT #2: nam=’lock deadlock retry’ ela= 10742 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183629049
WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183629100
WAIT #2: nam=’lock deadlock retry’ ela= 10563 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183639989
WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
allocated=0 obj#=16540 tim=1272818183640043
————
Wait event of “lock deadlock retry” is set in kcbgcur ( KCB: Get a block in
current mode)
and from 10046 there are many tried per waiting on ‘ges message buffer
allocation’
There seeme to be 2 issues here:
1. The wait on GES message buffer allocation is set up even when there is
no need to wait in kjcsmba.
The root cause seems related to ‘ges message buffer allocation’
bug 8630374 has same issue fixed in 12.1 ( Distributed Lock Manager (DLM)
issue )
REDISCOVERY INFORMATION:
Check if there are a large number of wait event of “ges message buffer
allocation” with very small elapse time when Event 10046 level 8 is set.
2. WAIT #2: nam=’lock deadlock retry’ ela= 10600 p1=0 p2=0 p3=0 obj#=16540
tim=1272818183618017
This wait is more serious and has much higher impact.
WORKAROUND:
———–
Restart the instance
RELATED BUGS:
————-
8630374
REPRODUCIBILITY:
—————-
2 times
TEST CASE:
———-
STACK TRACE:
————
STG_T_RN_TRADE(RN_TRADE_ID,SRC_SUR_NK,HOSTTRAN_TP,HOSTTRAN_SERIAL_NBR,HOSTTRAN
_ACCT_TP,HOSTTRAN_LEDGER,HOSTTRAN_GRP_ACCT_CD,HOSTTRAN_HOST_LEDGER,HOSTTRAN_GR
P_REF_1,HOSTTRAN_GRP_REF_2,HOSTTRAN_GRP_REF_3,HOSTTRAN_GRP_REF_4,HOSTTRAN_GRP_
REF_5,HOSTTRAN_CLEARING_CD,HOSTTRAN_OTHER_SIDE_LEDGER,HOSTTRAN_FAS_CURR,H
short stack:
ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-se
1<-kcbgcur()+13746<-kcbnew()+18953<-ktspfmdb()+403<-ktspfmtrng()+1051<-ktspfsa
ll()+1835<-ktspfsrch()+839<-ktspscan_bmb()+505<-ktspgsp_main()+875<-kdtgsp()+6
04<-kdtgrs()+1692<-kdtInsRow()+1008<-qerltcNoKdtBufferedInsRowCBK()+797<-qerlt
cLoadStateMachine()+811<-qerltcFetch()+3372<-insexe()+1581<-opiexe()+22572<-kp
oal8()+2288<-opiodr()+1149<-ttcpip()+1251<-opitsk()
wait history:
* time between current wait and wait #1: 0.000009 sec
1. event: 'gc current request'
time waited: 0.000149 sec
wait id: 1837759 p1: 'file#'=0xc
p2: 'block#'=0x9b345
p3: 'id#'=0x2000001
* time between wait #1 and #2: 0.000014 sec
2. event: 'ges message buffer allocation'
time waited: 0.000002 sec
wait id: 1837758 p1: 'pool'=0x0
p2: 'request'=0x1
p3: 'allocated'=0x0
* time between wait #2 and #3: 0.000023 sec
3. event: 'lock deadlock retry'
time waited: 0.010681 sec
wait id: 1837757