troubleshooting details

一次Exadata上的ORA-600[kjbmprlst:shadow]故障分析

2011/10/31 by admin 5 Comments

这是一套Exadata 上的11.2.0.1 四节点RAC 系统，从今年初开始频繁地因为LMS后台进程出现内部错误ORA-600[kjbmprlst:shadow]而导致实例意外终止，虽然是4节点RAC 保证了其高可用性，但是仍因为实例经常意外终止导致应用程序交易失败。

实际我在7月份已经分析过该问题了，详见<Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例>一文， kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞。实际上当时我已经给出了禁用DRM 可能可以避免该ORA-600[kjbmprlst:shadow] 的看法，但是翻阅MOS 上所有现有的ORA-600[kjbmprlst:shadow] case均没有提及disable DRM可以是一种workaround的途径，提交SR 也没有得到Oracle GCS关于该看法的肯定回答，此外还考虑到核心的产品环境使用11.2.0.1 这个bug 较多的版本确实不合适，升级到 11.2.0.2 已修复该Bug 10121589 的bundle Patch 也算是一种不错的方案，所以也就没有深究下去。

之后就一直在做升级Exadata到11.2.0.2 的一些准备工作，但是用户最后考虑到升级Exadata的步骤过于复杂，很多步骤是不可回退的，而且在国内升级的案例目前也不多，推翻了升级的方案，于是有了下面的这一段故障分析

故障特征

如上面所说的这是一套Exadata上的4节点 11.2.0.1 RAC , 应用程序是 Oracle 自己的Billing and Revenue Management (BRM) 。从今年初开始多次出现ORA-600[kjbmprlst:shadow]内部错误，引发该内部错误的是RAC 关键后台进程 LMS ( global cache server process) ，结果是导致实例意外终止。

该ORA-600[kjbmprlst:shadow]在10月份频发的发生(几乎一个礼拜2次)，故障发生还存在以下2个特征:

主要集中在凌晨03:00 以后的20分钟内发生
1、2、4节点都陆续发生过故障，唯有 3号节点从未发生过 ( 这个后面会介绍)

初步分析

通过应用方拿到了某次故障发生当天每一个小时的AWR报告，这次故障发生在凌晨 03:05 的 1号节点上，于是开始分析当时的 1号实例的性能负载。

分析AWR 报告的一种方法就是通过对比多个时间段的报告来了解性能变化，当然前提是知道问题发生的时段，这里我们来对比2个时段的信息，这三个时段分别是:

1. 问题发生前的几分钟，因为1号实例上 3点以后的AWR 报告缺失，所以这里使用的是 02:00 ~ 03:00 的AWR
2. 问题发生前的1小时， 01:00~02:00 的AWR

首先观测DB TIME：

	Snap Id	Snap Time	Sessions	Cursors/Session
Begin Snap:	14225	12-Oct-11 02:00:05	284	152.2
End Snap:	14226	12-Oct-11 02:59:59	288	150.1
Elapsed:		59.91 (mins)
DB Time:		809.98 (mins)

	Snap Id	Snap Time	Sessions	Cursors/Session
Begin Snap:	14224	12-Oct-11 01:00:10	284	152.3
End Snap:	14225	12-Oct-11 02:00:05	284	152.2
Elapsed:		59.91 (mins)
DB Time:		257.62 (mins)

AWR中的DB TIME可以一定程度上反映数据库实例的繁忙程度。这里 01:00 到 02:00 时间段的AAS= 257/60 = 4.28，而问题发生的时间段的 AAS = 809 / 60 = 13 ，说明实例的工作负载明显上升了。

再来看主要等待事件:

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
enq: TX – row lock contention	10,072	16,079	1596	33.08	Application
log file sync	166,546	10,579	64	21.77	Commit
gc buffer busy acquire	55,828	8,017	144	16.50	Cluster
gc buffer busy release	21,203	5,904	278	12.15	Cluster
gc current block busy	29,196	3,362	115	6.92	Cluster

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
log file sync	258,415	4,875	19	31.54	Commit
gc buffer busy acquire	44,471	3,349	75	21.66	Cluster
DB CPU		1,880		12.16
gc current block busy	42,722	1,748	41	11.31	Cluster
gc buffer busy release	10,730	1,127	105	7.29	Cluster

问题发生时段的主要等待事件是 enq: TX – row lock contention 这说明存在大量的行锁争用,此外gc buffer busy acquire/release 和 gc current block busy 全局缓存争用等待时长也明显较之前的 01:00- 02:00 多了数倍，这说明问题发生时段节点间对全局缓存资源的争用进一步被加剧了。01:00 – 02:00 时段的 enq: TX – row lock contention 不是前5的等待事件，但实际每秒处理的事务数更多为71.8个(问题发生时段为每秒46.3个事务)，这说明造成大量行所争用的的原因不是事务变多, 而是应用程序特性造成的。

引起频繁 enq: TX – row lock contention等待事件和全局缓存资源争用的最主要SQL语句均是1usg75g3cx4n6:

Cluster Wait Time (s)	Executions	%Total	Elapsed Time(s)	%Clu	%CPU	%IO	SQL Id	SQL Module	SQL Text
4,047.30	20,470	22.75	17,790.11	21.01	0.04	0.00	1usg75g3cx4n6		update

SQL “1usg75g3cx4n6” 是一条Update更新语句，在此不列出。

问题发生时段该语句 1 号实例上运行了20470次，平均每次运行耗时 0.87s , cluster wait time 占该SQL总执行时间的22.75% ，而在01:00 – 02:00 期间该语句共运行了32845 次，平均每次运行耗时 0.02s , cluster wait time 占该SQL总执行时间的 67.44 %。

以上信息说明1号实例在问题发生时段由SQL语句1usg75g3cx4n6 因为enq: TX – row lock contention(主要因素) 和 gc 全局缓存争用 (次要因素)相互作用引起了巨大的负载。引起enq: TX – row lock contention行锁争用的主要原因是并发的DML更新相关的行数据。

深入分析

除了在问题发生时段数据库负载加剧，出现大量的行锁和全局缓存争用外，问题时段还出现了大规模的动态资源管理操作DRM( Dynamic Resource Management)。

引入DRM的目的是尽可能减少RAC 中CPU和消息通信的消耗，DRM的资源管理决定依据相关对象的全局缓存操作历史，该特性会定期自动找出集群中频繁访问某个缓存对象的节点实例，并将该缓存对象re-master到该实例上以优化RAC性能。 DRM 由 LMON、LMD 和 LMS 进程协同完成， LMS负责传输实际的 gcs resouce和 lock 锁信息， 11g中受到_gc_policy_time隐藏参数的影响，默认最短每10分钟可能被触发。当设置_gc_policy_time=0 ，则object affinity 和 11g 中引入的read mostly policies新特性均将被禁用。

AWR报告中1号实例问题时段的 DRM 统计:

Name	Total	per Remaster Op	Begin Snap	End Snap
remaster ops	2	1.00
remastered objects	2	1.00
replayed locks received	215,517	107,758.50
replayed locks sent	339,349	169,674.50
resources cleaned	0	0.00
remaster time (s)	7.4	3.70
quiesce time (s)	4.3	2.13
freeze time (s)	0.1	0.04
cleanup time (s)	0.4	0.21
replay time (s)	1.0	0.48
fixwrite time (s)	0.9	0.47
sync time (s)	0.8	0.38
affinity objects			525	525

以下为01:00- 02:00 时段的DRM 统计:

Name	Total	per Remaster Op	Begin Snap	End Snap
remaster ops	5	1.00
remastered objects	8	1.60
replayed locks received	110,902	22,180.40
replayed locks sent	68,890	13,778.00
resources cleaned	0	0.00
remaster time (s)	12.9	2.57
quiesce time (s)	1.4	0.28
freeze time (s)	0.2	0.03
cleanup time (s)	0.9	0.17
replay time (s)	0.5	0.10
fixwrite time (s)	1.9	0.38
sync time (s)	8.0	1.60
affinity objects			526	525

可以看到在问题发生时段 DRM 发送(sent)了更多的replayed locks锁资源，这里的replay是指将本地锁信息传送到新的master 实例(REPLAY – Transfer of the local lock information to the new master.) ，而负责发送这些replayed locks锁资源的正是LMS 进程。

如上文所述1、2、4节点均出现过该ORA-600[kjbmprlst:shadow]故障，但是唯有3号节点从没有出现过该问题，打开3号节点在问题发生时段(03:00-04:00) 的AWR报告后就有重大的发现：

Elapsed Time (s)	Executions	Elapsed Time per Exec (s)	%Total	%CPU	%IO	SQL Id	SQL Module	SQL Text
3,590.52	0		17.30	34.71	63.63	59v4zh1ac3v2a	DBMS_SCHEDULER	DECLARE job BINARY_INTEGER := …
3,589.83	0		17.29	51.69	37.98	b6usrg82hwsa3	DBMS_SCHEDULER	call dbms_stats.gather_databas…
2,173.79	1	2,173.79	10.47	62.16	28.75	b20w3p880j2gs	DBMS_SCHEDULER	/* SQL Analyze(1) / select /…

可以发现3号节点当时正在运行”call dbms_stats.gather_database_stats_job_proc ( )” 11g中自动收集统计信息的作业，该作业运行了较长时间(一个小时内尚未完成)，我们知道10g以后引入了默认的自动收集统计信息的作业，在11g中得到了加强，这一作业会自动收集自上一次收集以来更新超过总数据量10%的对象的统计信息，因为该核心产品数据库是大型的OLTP应用，在03:00之前已经发生了大量的数据更改，所以这导致gather_database_stats_job_proc 要收集许多大对象上的统计信息，而这会引发该3号实例频繁Request之前在 1、2号实例中被更新的全局缓存资源，频繁地访问影响了DRM的decision，导致大量的object 被 re-master 到 3号节点上。

Name	Total	per Remaster Op	Begin Snap	End Snap
remaster ops	14	1.00
remastered objects	25	1.79
replayed locks received	1,088,009	77,714.93
replayed locks sent	202,112	14,436.57
resources cleaned	0	0.00
remaster time (s)	30.8	2.20
quiesce time (s)	8.5	0.60
freeze time (s)	0.4	0.03
cleanup time (s)	3.3	0.24
replay time (s)	6.4	0.46
fixwrite time (s)	7.7	0.55
sync time (s)	4.5	0.32
affinity objects			4,294,967,289	4,294,967,289

可以看到3号实例接收了大量的replayed lock (replayed locks received) ，而发送的lock则很少。而实际发送(sent) 给 3号实例这些锁信息资源的正是 1号实例上的LMS进程。

翻阅alert.log 和视图我发现每一次的gather_database_stats_job_proc均在凌晨3点的 3号实例发生，因此每一次故障发生时3号实例只要接收这些replayed locks 资源即可，其LMS进程仅发送少量的 replayed locks ，所以造成了上面所说的3号实例从不发生该ORA-600[kjbmprlst:shadow]故障，这一古怪的现象。

我们再来阅读上发生ORA-600[kjbmprlst:shadow]故障的LMS进程的trace文件中的部分内容:

*** 2011-10-12 02:25:22.770
DRM(22812) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(1) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:29.717
 lms 1 finished fixing gcs write protocol
DRM(22813) win(2) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:30.312
DRM(22813) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(4) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(5) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:31.280
 lms 1 finished fixing gcs write protocol
DRM(22813) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(7) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:32.269
DRM(22813) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(1) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(2) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:33.479
 lms 1 finished fixing gcs write protocol
DRM(22814) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(4) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:34.333
 lms 1 finished fixing gcs write protocol
DRM(22814) win(5) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:35.315
DRM(22814) win(7) lms 1 finished replaying gcs resources
Incident 1035732 created, dump file:
/maclean1_lms1_20075_i1035732.trc
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []

2011-10-12 03:05:36.811503 : kjbmprlst: internal error on replaying 0x618cd.371 from 3
2011-10-12 03:05:36.811541 : kjbmbassert [0x618cd.371]
2011-10-12 03:05:36.811559 : kjbmsassert(0x618cd.371)(2)
2011-10-12 03:05:36.811636 : kjbmsassert(0x618cd.371)(3)
2011-10-12 03:05:36.811667 : kjbmsassert(0x618cd.371)(4)
kjmpbmsg fatal error on 39
MSG [39:KJX_B_REPLAY] .Yq inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
    flow ctrl: ver=60 flag=169 len=136 tkts=0 seq=0 wrp=0
               sndr=3 dest=1 rcvr=2
   FUSION MSG 0x2b178baa5120,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x618cd.371, 574166.0] c[0x59f904e08,-17597] [0x1,xc69]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99097 drmbkt# 99097
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 9 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99097, drmb 99097, apifrz 1
   FUSION MSG DUMP END
MSG [65521:PBATCH] inc=64 len=8376 sender=3 seq=279807301 fg=q
    flow ctrl: ver=1 flag=37 len=8376 tkts=0 seq=279807301 wrp=1
               sndr=3 dest=1 rcvr=2
MSG [39:KJX_B_REPLAY] .Y'z inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
   FUSION MSG 0x2b178baa31c8,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x527f5.27a, 574166.0] c[0xa2f7eddc0,2158] [0x1,xf33]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99582 drmbkt# 99582
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 33 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99582, drmb 99582, apifrz 1
   FUSION MSG DUMP END

.........................

    ------------process 0xcf22a6d40--------------------
    proc version      : 0
    Local node        : 0
    pid               : 20075
    lkp_node          : 0
    svr_mode          : 0
    proc state        : KJP_NORMAL
    Last drm hb acked : 114687
    flags             : x80
    current lock op    : (nil)
    Total accesses    : 11763805
    Imm.  accesses    : 11763774
    Locks on ASTQ     : 0
    Locks Pending AST : 0
    Granted locks     : 0
    AST_Q: 
    PENDING_Q: 
    GRANTED_Q: 
    KJM HIST LMS1: 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 1:0 
      13:65521:0 20:39:0 20:39:0 20:39:0 
    START DEFER MSG QUEUE 0 ON LMS1 flg x87:

MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262147
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca303f2f0 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=109225856
     fg=s stat=KJUSERSTAT_ATTACHFAILED spnum=11 flg=x0
MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262146
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca405e758 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=6400108
     fg=s stat= spnum=11 flg=x20
    END DEFER MSG QUEUE 0 DUMP ON LMS1
     SEQUENCES: 
    [sidx 0]  0:0.0  1:711171046.0  2:279807300.1  3:43212975.0
     DEFER MSG QUEUE 1 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 1]  0:0.0  1:711171047.0  2:279807292.1  3:43212974.0
     DEFER MSG QUEUE 2 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 2]  0:0.0  1:711171047.0  2:279807300.1  3:43212975.0

可以看到在实际出发ORA-600[kjbmprlst:shadow]之前，LMS后台进程确实在进行频繁地DRM操作。

我们不难看出出发ORA-600[kjbmprlst:shadow]的原因是：数据库较高的性能负载， 1号实例上SQL 语句引发的大量行锁争用和 3号节点上的 gather_database_stats_job_proc 自动统计信息作业同时存在，导致1号实例上的LMS进程频发地执行DRM操作，最终在较高的工作负载情况下触发了管理kjbm shadow锁的代码bug 。

经过分析发现2、4号实例发生该ORA-600[kjbmprlst:shadow]故障与1号节点的情形相似，整个RAC中唯有3号节点例外，因为总是由它来执行gather_database_stats_job_proc，导致它总是成为DRM的受益者，而其他3个节点成为受害者。

但是注意可能你要问那不就是简单的由自动统计作业引发的故障吗？用得着分析这么多吗，直接关掉该作业或者换个时段就可以了。

回答：实际上如果仅仅是gather_database_stats_job_proc的话是不足以引发该故障的，　一方面由于节点之间的所跑得应用程序都是一样的，也没有采取分区或者分schema ，这导致节点之间的全局缓存资源争用加剧，另一方面因为每个Buffer Cache都较大，均存放了大量的全局缓存资源，且数据的更新十分频繁，这导致gather_database_stats_job_proc作业开始执行后大量的全局资源涌向3号实例，同时也触发了DRM的决定，以至于1、2、4号节点上的LMS 在短期内有极大的工作压力，最终导致了 bug 被触发。仅仅禁用该gather_database_stats_job_proc 自动信息搜集作业，或者修改该作业的运行时间都不足以彻底规避该ORA-600[kjbmprlst:shadow]故障，因为只要是LMS后台进程处于较高的工作负载情况下都仍有可能触发bug。

虽然以上分析了一大堆但是我对解决该ORA-600[kjbmprlst:shadow]故障的看法没有丝毫改变，而且更坚定了。Disable DRM 禁用DRM特性可以从根本上改变LMS进程的行为方式，彻底杜绝该ORA-600[kjbmprlst:shadow]故障发生的可能性。

但我仍需要一些佐证！

资料信息

翻阅MOS 一个Note 映入我的眼帘，<Bug 12834027 – ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking [ID 12834027.8]> 这个Note 我在上一次分析这个问题的时候已经阅读过了，当时该Note的描述该问题会在11.2.0.1 上发生，不提供任何workaround 方法，唯一的解决途径是打上11.2.0.2 + Bp 的补丁。

但是我发现这个Note 在最近的 21-OCT-2011 被修改了，内容如下:

Bug 12834027  ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
 This note gives a brief overview of bug 12834027.
 The content was last updated on: 21-OCT-2011
 Click here for details of each of the sections below.
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.2 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release) 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow]
    ORA-600 [kjbrasr:pkey] 

    RAC (Real Application Clusters) / OPS
    _gc_read_mostly_locking 

Description

    ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] can occur with RAC
    read mostly locking.

    Workaround
      Disable DRM
     or
      Disable read-mostly object locking
      eg: Run with "_gc_read_mostly_locking"=FALSE

这一次翻到这个Note 是在10月28日，而这个Note是在10月21日更新的，不得不说这个更新来的很及时，提供了2种 workaround 的方式，包括彻底禁用DRM ，和之前所说的禁用_gc_read_mostly_locking 特性，相关的bug note 说明了另一个Exadata case的经过:

Hdr: 12834027 N/A RDBMS 11.2.0.2.0 RAC PRODID-5 PORTID-226 12730844
Abstract: ORA-600 [KJBMPRLST:SHADOW] & [KJBRASR:PKEY] IN A READ MOSTLY & SKIP LOCK ENV

*** 08/04/11 04:22 am ***

PROBLEM:
--------
3 nodes got crashed and restarted in exadata cluster.

DIAGNOSTIC ANALYSIS:
--------------------
8 node Exadata RAC instance. Below instances got terminated and it rejoined
cluster around 11:46 Today (01-Aug).

Instance2 (gmfcdwp2) got ORA-600 [kjbmprlst:shadow] at 11:46 AM and
instance got terminated by LMS1 process.
LMS1 (ospid: 14922): terminating the instance due to error 484

instance got terminated by LMS1 process
LMS1 (ospid: 28176): terminating the instance due to error 484

instance got terminated by LMS1 process

which was seen in instance 4 and 8.

这个case和我们的惊人的相似，也是几个Exadata上特定的instance number 会出现ORA-600 [kjbmprlst:shadow]。

另一方面需要评估禁用DRM 对于整体性能的影响，其实这一点我已经在<Important parameters For Oracle BRM Application in 11gR2>中介绍了， Oracle 官方推荐为 BRM 应用程序禁用 DRM 特性 (真绕口) ，而因为这套系统使用的是默认的Exadata 出场的实例参数设置，所以。。。

解决方案

1. 禁用DRM 特性，可以彻底避免LMS进程因为DRM操作而引发bug ，具体做法是设置隐藏参数_gc_policy_time为0。

影响评估：

DRM 特性可以一定程度上优化数据库性能，但是对于没有分区或分schema 的应用程序而言 DRM并不会带来好处。 Oracle 官方推荐在BRM应用程序上要禁用DRM特性，见Metalink 文档<Questions About BRM Support For Oracle DB 11GR2 And Exadata [ID 1319678.1]>:

“_gc_policy_time=0 to disable DRM.” 修改该_gc_policy_time参数需要重启实例

2. 部分禁用DRM 特性，设置 _gc_read_mostly_locking= False。设置 _gc_read_mostly_locking= False可以避免replayed lock 大量传输，极大程度上减少LMS触发bug的可能性。修改该参数同样需要重启实例

案例总结

写了这么久有点累了，也有点不知所云。可能会再写一篇介绍11g 中新加入的Read-mostly locking DRM 新特性，本来想在这篇里一并介绍但是发现篇幅有点长了。

有时候我们解决一个问题只需要某个恰当的方案就足够了，这个方案可以来源于Metalink 、Google 或者朋友，我们并不知道这种做法、解决方式实际这么生效，但是总归或多或少它生效了。但是这不代表真正意义上理解这个故障或者问题了，譬如我花了大约 3-4个小时的时间来分析这个问题，虽然得到的结论几乎和我在7月份的那一次是一样的，但是透过分析的过程我还是获益良多，这是唯有自己分析方能产生的结果。

同学们，享受这分析过程带来的乐趣吧！

Filed Under: Exadata, Oracle, Oracle 11g新特性 Tagged With: drm, kjbmprlst:shadow, ORA-600, troubleshooting details

ksvcreate: Process creation failed故障一例

2011/03/23 by admin Leave a Comment

今天上午接到电话，用户反映一套10.2.0.4的数据库出现用户无法登录的症状，随即在家里连上VPN远程支持，登上主机以后尝试测试操作系统认证登录和远程登录，发现已经能够正常登录了；为了了解起因，去查看了告警日志alert.log，发现在11:00左右出现大量的ksvcreate:process creation failed错误，具体错误日志如下:

Wed Mar 23 10:00:17 2011
Process m000 died, see its trace file
Wed Mar 23 10:00:17 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:01:18 2011
Process m000 died, see its trace file
Wed Mar 23 10:01:18 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:02:19 2011
Process m000 died, see its trace file
Wed Mar 23 10:02:19 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:54:41 2011
Process P007 died, see its trace file
Wed Mar 23 10:56:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:57:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:58:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:59:18 2011
Process J000 died, see its trace file
Wed Mar 23 10:59:18 2011
kkjcre1p: unable to spawn jobq slave process
Wed Mar 23 10:59:18 2011
Errors in file /oracle/app/oracle/admin/sdh/bdump/sdh_cjq0_717010.trc:

Wed Mar 23 11:01:00 2011
Process m000 died, see its trace file
Wed Mar 23 11:01:00 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:01 2011
Process m000 died, see its trace file
Wed Mar 23 11:02:01 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:40 2011
Process P007 died, see its trace file
Wed Mar 23 11:03:02 2011
Process m000 died, see its trace file
Wed Mar 23 11:03:02 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:03 2011
Process m000 died, see its trace file
Wed Mar 23 11:04:03 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:40 2011
Process P007 died, see its trace file

ksvcreate: Process creation failed错误信息一般在Oracle实例在创建一些辅助后台进程(如mmon的子进程m00x或者并行子进程p00x等)时出现进程启动失败时出现，而造成该错误的可能性有多种，包括Oracle实例资源不足、操作系统资源不足等等。其中较为常见的是实例instance的process使用达到上限，可以通过查询v$resource_limit视图来了解实例生命周期内是否发生过process总数暴满的情况:

SQL> select * from v$resource_limit;

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes                                       79             800        800                  800
sessions                                        80             813        885                  885
enqueue_locks                                   18             303      10910                10910
enqueue_resources                               18             166       4112            UNLIMITED
ges_procs                                        0               0          0                    0
ges_ress                                         0               0          0            UNLIMITED
ges_locks                                        0               0          0            UNLIMITED
ges_cache_ress                                   0               0          0            UNLIMITED
ges_reg_msgs                                     0               0          0            UNLIMITED
ges_big_msgs                                     0               0          0            UNLIMITED
ges_rsv_msgs                                     0               0          0                    0
gcs_resources                                    0               0          0                    0
gcs_shadows                                      0               0          0                    0
dml_locks                                        0             153       3892            UNLIMITED
temporary_table_locks                            0               3  UNLIMITED            UNLIMITED
transactions                                  3125            3156        973            UNLIMITED
branches                                         0               3        973            UNLIMITED
cmtcallbk                                        0               2        973            UNLIMITED
sort_segment_locks                               0               8  UNLIMITED            UNLIMITED
max_rollback_segments                           19              43        973                65535
max_shared_servers                               0               0  UNLIMITED            UNLIMITED
parallel_max_servers                             6             122        120                 3600

/* 可以看到processes的MAX_UTILIZATION最大使用数目曾到过LIMIT_VALUE限定的800，
   sessions也有类似的情况 */

从以上V$resource_limit视图的输出来看，极有可能是processes总数达到上限导致了新的后台辅助进程创建失败，其实我们可以很方便地验证这一点:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE	10.2.0.1.0	Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> alter system set processes=20 scope=spfile;
System altered.

/* 将processes参数修改到一个较小值 */

SQL> startup force;
ORACLE instance started.

SQL> select resource_name,MAX_UTILIZATION,LIMIT_VALUE from v$resource_limit where resource_name in ('processes','sessions');

RESOURCE_NAME               MAX_UTILIZATION LIMIT_VALUE
------------------------------ --------------- ---------------
processes                    20           20
sessions                     19           27

 /* 从以上输出可以看到process总数曾经到达20的上限 */

/* 观察告警日志可以看到相关ksvcreate: Process(m000) creation failed的记录 */

[maclean@rh8 bdump]$ tail -20 alert_PROD1.log
ksvcreate: Process(m000) creation failed
Process m000 died, see its trace file
Wed Mar 23 22:01:52 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 22:01:55 2011
Process q000 died, see its trace file
Wed Mar 23 22:01:55 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:07 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:07 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:19 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:19 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:31 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:31 2011
ksvcreate: Process(q000) creation failed

不同于10g，在11g中类似的错误出现后会在告警日志中说明process creation即进程创建失败的具体原因，如进程总数达到上限，那么就会出现ORA-00020错误(maximum number of processes (%s) exceeded. All process state objects are in use.Increase the value of the PROCESSES initialization parameter),类似以下日志:

ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Process m001 submission failed with error = 20
Process m002 submission failed with error = 20
Process m003 submission failed with error = 20
Process m003 submission failed with error = 20
Process m002 submission failed with error = 20
2011-03-23 22:10:07.037000 +08:00
Process q001 submission failed with error = 20

实际上ksvcreate: Process creation failed错误在能够了解其发生的root cause的情况下，并不难解决。在上例中我们可以清楚地了解到是因为数据库在实际运行中出现了processes进程总数达到参数设定上限从而导致问题出现，那么可以合理增加初始化参数processes或者通过修正异常频繁的程序客户端登录来解决该问题。

Filed Under: Oracle Tagged With: processes, troubleshooting details

Oracle内部错误:ORA-00600[15801], [1]一例

2011/01/05 by admin Leave a Comment

一套Sparc Solaris上的11.1.0.7系统，在创建索引时频繁出现ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []内部错误，日志信息如下:

Tue Aug 17 17:34:21 2010
WARNING: Oracle executable binary mismatch detected.
Binary of new process does not match binary which started instance
issue alter system set "_disable_image_check" = true to disable these messages
Tue Aug 17 17:34:21 2010
Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p023_22262.trc (incident=12505):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/incident/incdir_12505/ORAHCMU_p023_22262_i12505.trc
Tue Aug 17 17:34:21 2010
Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p021_22258.trc (incident=12489):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/incident/incdir_12489/ORAHCMU_p021_22258_i12489.trc

Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p015_9328.trc (incident=19909):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p043_9388.trc (incident=20133):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Mon Aug 23 14:43:42 2010
Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p087_9668.trc (incident=20485):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Mon Aug 23 14:43:42 2010
Errors in file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p012_9322.trc (incident=19885):
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/incident/incdir_19789/ORAHCMU_ora_8602_i19789.trc
Mon Aug 23 14:43:43 2010
WARNING: Oracle executable binary mismatch detected.
Binary of new process does not match binary which started instance
issue alter system set "_disable_image_check" = true to disable these messages

Dump continued from file: /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_ora_8602.trc
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []

*** 2010-08-23 14:43:42.974
----- Current SQL Statement for this session (sql_id=00abhfx460qm9) -----
CREATE UNIQUE iNDEX PS_HM_BEN_GP_STG ON PS_HM_BEN_GP_STG (CAL_ID, GP_PAYGROUP, 
EMPLID, EMPL_RCD, HM_INCURRED_BY, HM_SUM_ASSURED) TABLESPACE PSINDEX STORAGE 
(INITIAL 40000 NEXT 100000 MAXEXTENTS UNLIMITED PCTINCREASE 0) PCTFREE 10 PARALLEL NOLOGGING

----- Call Stack Trace -----
ksedst1 ksedst dbkedDefDump dbgexPhaseII dbgexProcessError dbgePostErrorKGE kgeade kgerem
kxfpProcessError kxfpqidqr kxfpqdqr kxfxgs kxfxcp qerpxSendParse kxfpValidateSlaveGroup kxfpgsg
kxfrAllocSlaves kxfrialo kxfralo qerpx_rowsrc_start qerpxStart kdicrws kdicdrv opiexe opiosq0
kpooprx kpoal8 opiodr ttcpip opitsk opiino opiodr opidrv sou2o main



SO: 0x3bf0bbf20, type: 4, owner: 0x3bf5452d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x3bf5452d0, name=session, file=ksu.h LINE:10719 ID:, pg=0
(session) sid: 217 ser: 767 trans: 0x3bc0660f8, creator: 0x3bf5452d0
flags: (0x8000041) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x44008) DDLT1/-
DID: , short-term DID:
txn branch: 0x0
oct: 9, prv: 0, sql: 0x3b5d14510, psql: 0x3b6d59820, user: 31/SYSADM
ksuxds FALSE at location: 0
service name: ORAHCMU
client details:
O/S info: user: Administrator, term: UJWALTPVM, ospid: 304:2892
machine: WORKGROUP\UJWALTPVM program: pside.exe
client info: ujwal,Administrator,UJWALTPVM,,pside.exe,
application name: pside.exe, hash value=2824484291
Current Wait Stack:
Not in wait; last wait ended 2.475286 sec ago
Wait State:
auto_close=0 flags=0x21 boundary=0x0/-1
Session Wait History:
0: waited for 'lient'
=c8, =1, =0
wait_id=10483 seq_num=10484 snap_id=1
wait times: snap=0.168502 sec, exc=0.168502 sec, total=0.168502 sec
wait times: max=2.000000 sec
wait counts: calls=1 os=1
occurred after 0.000903 sec of elapsed time
1: waited for ' waiting for ruleset'
=10010063, =1, =0
wait_id=10482 seq_num=10483 snap_id=1
wait times: snap=0.008580 sec, exc=0.008580 sec, total=0.008580 sec
wait times: max=2.000000 sec
wait counts: calls=1 os=1
occurred after 0.000731 sec of elapsed time
2: waited for ' waiting for ruleset'
=1001004f, =4, =0
wait_id=10481 seq_num=10482 snap_id=1
wait times: snap=0.000132 sec, exc=0.000132 sec, total=0.000132 sec
wait times: max=2.000000 sec
wait counts: calls=1 os=1
occurred after 0.000074 sec of elapsed time
3: waited for ' waiting for ruleset'
=1001004f, =3, =0
wait_id=10480 seq_num=10481 snap_id=1
wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec
wait times: max=2.000000 sec
wait counts: calls=1 os=1
occurred after 0.000065 sec of elapsed time

----- Session Cursor Dump -----
Current cursor: 1, pgadep=0

Open cursors(pls, sys, hwm, max): 3(0, 2, 64, 300)
NULL=1 SYNTAX=0 PARSE=0 BOUND=1 FETCH=0 ROW=1
Cached frame pages(total, free):
4k(14, 14), 8k(1, 1), 16k(1, 1), 32k(0, 0)

----- Current Cursor -----


----- Plan Table -----

============
Plan Table
============
----------------------------------------------------+-----------------------------------+-------------------------+
| Id | Operation | Name | Rows | Bytes | Cost | Time | TQ |IN-OUT|PQ Distrib |
----------------------------------------------------+-----------------------------------+-------------------------+
| 0 | CREATE INDEX STATEMENT | | | | 2 | | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (ORDER) | :TQ10001 | 82 | 4510 | | |:Q1001| P->S |QC (ORDER) |
| 3 | INDEX BUILD UNIQUE | PS_HM_BEN_GP_STG| | | | |:Q1001| PCWP | |
| 4 | SORT CREATE INDEX | | 82 | 4510 | | |:Q1001| PCWP | |
| 5 | PX RECEIVE | | 82 | 4510 | 2 | 00:00:01 |:Q1001| PCWP | |
| 6 | PX SEND RANGE | :TQ10000 | 82 | 4510 | 2 | 00:00:01 |:Q1000| P->P |RANGE |
| 7 | PX BLOCK ITERATOR | | 82 | 4510 | 2 | 00:00:01 |:Q1000| PCWC | |
| 8 | TABLE ACCESS FULL | PS_HM_BEN_GP_STG| 82 | 4510 | 2 | 00:00:01 |:Q1000| PCWP | |
----------------------------------------------------+-----------------------------------+-------------------------+


----------------------------------------
Cursor#1(0xffffffff7ce31928) state=BOUND curiob=0xffffffff7ce57d28
curflg=4c fl2=0 par=0x0 ses=0x3bf0bbf20
----- Dump Cursor sql_id=00abhfx460qm9 xsc=0xffffffff7ce57d28 cur=0xffffffff7ce31928 -----
Dump Parent Cursor sql_id=00abhfx460qm9 phd=0x3b5d14510 plk=0x3b0bb3318
sqltxt(0x3b5d14510)=CREATE UNIQUE iNDEX PS_HM_BEN_GP_STG ON PS_HM_BEN_GP_STG 
(CAL_ID, GP_PAYGROUP, EMPLID, EMPL_RCD, HM_INCURRED_BY, HM_SUM_ASSURED) 
TABLESPACE PSINDEX STORAGE (INITIAL 40000 NEXT 100000 MAXEXTENTS UNLIMITED PCTINCREASE 0) 
PCTFREE 10 PARALLEL NOLOGGING
hash=616eaa631fc21f4c0029707748605a69
parent=0x3ae539590 maxchild=01 plk=0x3b0bb3318 ppn=n
cursor instantiation=0xffffffff7ce57d28 used=1282545779 exec_id=16777216 exec=1
child#0(0x3b5d05e10) pcs=0x3b678c128
clk=0x3b7e200d0 ci=0x3b5b204c8 pn=0x39955d2b8 ctx=0x3b86ee988
kgsccflg=0 llk[0xffffffff7ce57d30,0xffffffff7ce57d30] idx=0
xscflg=c0102276 fl2=c000400 fl3=2202008 fl4=100
Frames pfr 0xffffffff7ce67098 siz=85976 efr 0xffffffff7ce66fb8 siz=85960
Cursor frame dump
enxt: 7.0x00000168 enxt: 6.0x00008000 enxt: 5.0x00008000 enxt: 4.0x00003978
enxt: 3.0x00000490 enxt: 2.0x000000b8 enxt: 1.0x00000fa0
pnxt: 1.0x00000010
kxscphp=0xffffffff7dd80a18 siz=984 inu=312 nps=312
kxscwhp=0xffffffff7ddd2cc8 siz=8136 inu=6264 nps=3968
kxscefhp=0xffffffff7ce51468 siz=88456 inu=86128 nps=86128


FileName
----------------
ORAHCMU_ora_8602.trc

FileComment
----------------------


Oracle Support - August 27, 2010 6:13:39 PM GMT+08:00 [ODM Data Collection]
Name
--------
=== ODM Data Collection ===

=== ODM Data Collection ===

Trace file /u04/app/oracle/diag/rdbms/orahcmu/ORAHCMU/trace/ORAHCMU_p012_9322.trc


*** 2010-08-23 14:43:00.472
WARNING: Oracle executable binary mismatch detected.
Binary of new process does not match binary which started instance
issue alter system set "_disable_image_check" = true to disable these messages
startup image information
iid info sz=245752512 inode=65458 ts=0x4c6df668
current process image information
iid info sz=245750720 inode=65427 ts=0x4c7204b0
set _disable_image_check = TRUE to disable this check
qksceLinearToCe error

*** 2010-08-23 14:43:42.974
*** SESSION ID:(220.111) 2010-08-23 14:43:42.974
*** CLIENT ID:(ujwal) 2010-08-23 14:43:42.974
*** SERVICE NAME:(ORAHCMU) 2010-08-23 14:43:42.974

DDE: Problem Key 'ORA 600 [15801]' was flood controlled (0x6) (incident: 19885)
ORA-00600: internal error code, arguments: [15801], [1], [], [], [], [], [], [], [], [], [], []
kxfxdss
KXFXSLAVESTATE dump [0, 0]
(pgakid: 0 oercnt: 0 oerrcd: -2224892588)
kxfxdss
no current cursor context.
kxfxdss
no cursors.

关于binary no match的问题已知是由于在实例启动情况下relink导致的；这个case提交了SR，metalink认为ORA-600 15801一般由QC与服务子进程通信问题引起:

The ORA-600 15801 is reporting a communication problem between QC and slaves related with messages sent/received.
Alert log reports several of the following error on the ASM instance:
ORA-600: internal error code, arguments: [15801], [1], [], [], [], [], [], 
[]

last wait was for 'eq: Msg Fragment' 

DIAGNOSTIC ANALYSIS:
--------------------
There were also several of the following message in the alert log:
WARNING: Oracle executable binary mismatch detected.
 Binary of new process does not match binary which started instance
issue alter system set "_disable_image_check" = true to disable these 
messages

So, I asked the customer to set the "_disable_image_check" = true 
This had no impact on the ora-600 errors as expected.

ORA-600 [15801] is signalled when a message overflow occurs between  PQ 
processes.

WORKAROUND:
-----------
none 
RELATED BUGS:
-------------
none
REPRODUCIBILITY:
----------------
intermittent but frequently - occurs at all different times of the day.
STACK TRACE:
------------
*** ID:(29.2904) 2006-07-05 15:50:57.972
qksceLinearToCe error
*** 15:50:58.233
ksedmp: internal or fatal error
ORA-600: internal error code, arguments: [15801], [1], [], [], [], [], [], 
[]
----- Call Stack Trace -----

kxfxGeter qks3tttdefReceive kxfxsui kxfxsp kxfxmai kxfprdp 

    SO: 0x67977018, type: 4, owner: 0x6793f208, flag: INIT/-/-/0x00
    (session) sid: 29 trans: (nil), creator: 0x6793f208, flag: (c0000041) 
USR/- BSY/-/-/-/-/-
              DID: 0000-0012-0000FADB, short-term DID: 0000-0000-00000000
              txn branch: (nil)
              oct: 3, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    O/S info: user: oracle, term: , ospid: 4558, machine: 
    last wait for 'eq: Msg Fragment' blocking sess=0x(nil) seq=2 
wait_time=4441 seconds since wait started=3
                ct path write=1002ffff, ct path write temp=2, Network=0
    Dumping Session Wait History
     for 'eq: Msg Fragment' count=1 wait_time=4441
                ct path write=1002ffff, ct path write temp=2, Network=0
     for 'eq: Msg Fragment' count=1 wait_time=31
                ct path write=1002ffff, ct path write temp=1, Network=0
    temporary object counter: 0

最后这个case通过设置10235和10501事件后错误不再产生了:

event = "10235 trace name context forever, level 2"  

10235, 00000, "check memory manager internal structures" 

event = "10501 trace name context forever, level 1"
  
10501, 00000, "periodically check selected heap"
// *Cause:
// *Action:
//    Level:  0x01 PGA
//            0x02 SGA
//            0x04 UGA
//            0x08 current call
//            0x10 user call
//            0x20 large allocation pool

Filed Under: Oracle, Oracle Bug Tagged With: ora-00600, Oracle Bug, troubleshooting details

Oracle内部错误:ORA-00600[17175]一例

2010/12/31 by admin Leave a Comment

一套HP-UX上的10.2.0.4系统出现ORA-00600[17175] Oracle600内部错误，相关的日志信息如下:

Wed Dec 1 01:57:55 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_pmon_3250.trc:
ORA-00600: internal error code, arguments: [17175], [255], [], [], [], [], [], []
ORA-00601: cleanup lock conflict
Wed Dec 1 01:57:57 2010
Trace dumping is performing id=[cdmp_20101201015757]
Wed Dec 1 01:58:05 2010
LGWR: terminating instance due to error 472
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms1_3291.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms2_3293.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms3_3295.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms0_3289.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lmon_3283.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lmd0_3287.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Shutting down instance (abort)
License high water mark = 421

/u01/app/oracle/admin/xgp2/bdump/xgp21_pmon_3250.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name: HP-UX
Node name: XGP2_db1
Release: B.11.31
Version: U
Machine: ia64
Instance name: xgp21
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 3250, image: oracle@XGP2_db1 (PMON)

*** SERVICE NAME:(SYS$BACKGROUND) 2010-12-01 01:57:55.933
*** SESSION ID:(333.1) 2010-12-01 01:57:55.933
*** 2010-12-01 01:57:55.933
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [17175], [255], [], [], [], [], [], []
ORA-00601: cleanup lock conflict


ksedst <- ksedmp <- ksfdmp <- kgeriv <- kgesiv
<- kgesic1 <- kghcln <- kslilcr <- $cold_ksl_cleanup <- ksepop
<- kgepop <- kgesev <- ksesec0 <- $cold_kslges <- ksl_get_child_latch
<- kslgpl <- es <- ksfglt <- kghext_numa <- ksmasgn
<- kghnospc <- $cold_kghalo <- ksmdacnk <- ksmdget <- ksosp_alloc
<- ksoreq_submit <- ksbsrv <- kmmssv <- kmmlsa <- kmmlod
<- ksucln <- ksbrdp <- opirip <- $cold_opidrv <- sou2o
<- $cold_opimai_real <- main <- main_opd_entry

PROCESS STATE
-------------
Process global information:
process: c00000018d000078, call: c00000018d252238, xact: 0000000000000000, curses: c00000018d2508a8, usrses: c00000018d2508a8
----------------------------------------
SO: c00000018d000078, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: c00000018d252238/c00000018d252238, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: c00000018d037978 1 64
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c00000018d001058 1 6
(latch info) wait_event=0 bits=90
holding (efd=5) c00000020001d500 Parent+children shared pool level=7
Location from where latch is held: kghfrunp: alloc: clatch nowait:
Context saved from call: 0
state=busy, wlstate=free
holding (efd=5) c00000020000b5f8 OS process allocation level=4
Location from where latch is held: ksoreq_submit:
Context saved from call: 13835058076152957304
state=busy, wlstate=free
Process Group: DEFAULT, pseudo proc: c0000004dd263230
O/S info: user: oracle, term: UNKNOWN, ospid: 3250
OSD pid info: Unix process pid: 3250, image: oracle@XGP2_db1 (PMON)


SO: c0000004df4d5f28, type: 19, owner: c00000018d000078, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=4
outq=0 sndq=0 opid=2 prmb=0x0
mbg[i]=(2 19) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(4 1) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=20 mop[q]=1 pendq=0 zmbq=0
nonksxp_recvs=0
------------process 0xc0000004df4d5f28--------------------
proc version : 0
Local node : 0
pid : 3250
lkp_node : 0
svr_mode : 0
proc state : KJP_NORMAL
Last drm hb acked : 0
Total accesses : 181
Imm. accesses : 180
Locks on ASTQ : 0
Locks Pending AST : 0
Granted locks : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
----------------------------------------
SO: c00000018d2f3610, type: 11, owner: c00000018d000078, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c00000018d000078,
event: 1, last message event: 1,
last message waited event: 1, messages read: 0
channel: (c0000004dd29fdb0) scumnt mount lock
scope: 1, event: 19, last mesage event: 0,
publishers/subscribers: 0/19,
messages published: 0
SO: c00000018d2508a8, type: 4, owner: c00000018d000078, flag: INIT/-/-/0x00
(session) sid: 333 trans: 0000000000000000, creator: c00000018d000078, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0002-00000003, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
last wait for 'latch: shared pool' blocking sess=0x0000000000000000 seq=342 wait_time=175677 seconds since wait started=0
address=c0000002000fff60, number=d6, tries=7
Dumping Session Wait History
for 'latch: shared pool' count=1 wait_time=175677
address=c0000002000fff60, number=d6, tries=7
for 'latch: shared pool' count=1 wait_time=97554
address=c0000002000fff60, number=d6, tries=6
for 'latch: shared pool' count=1 wait_time=78023
address=c0000002000fff60, number=d6, tries=5
for 'latch: shared pool' count=1 wait_time=38978
address=c0000002000fff60, number=d6, tries=4
for 'latch: shared pool' count=1 wait_time=38942
address=c0000002000fff60, number=d6, tries=3
for 'latch: shared pool' count=1 wait_time=19435
address=c0000002000fff60, number=d6, tries=2
for 'latch: shared pool' count=1 wait_time=12655
address=c0000002000fff60, number=d6, tries=1
for 'latch: shared pool' count=1 wait_time=8
address=c0000002000fff60, number=d6, tries=0
for 'os thread startup' count=1 wait_time=144253
=0, =0, =0
for 'os thread startup' count=1 wait_time=141360
=0, =0, =0


SO: c00000018d2f3500, type: 11, owner: c00000018d000078, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c00000018d000078,
event: 2, last message event: 40,
last message waited event: 40, messages read: 1
channel: (c0000004dd29bbd8) system events broadcast channel
scope: 2, event: 224634, last mesage event: 40,
publishers/subscribers: 0/161,
messages published: 1


SO: c00000018d252238, type: 3, owner: c00000018d000078, flag: INIT/-/-/0x00
(call) sess: cur c00000018d2508a8, rec 0, usr c00000018d2508a8; depth: 0
----------------------------------------
SO: c00000018d2594b0, type: 5, owner: c00000018d252238, flag: INIT/-/-/0x00
(enqueue) PR-00000000-00000000 DID: 0001-0002-00000003
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0xc0000004df401718, mode: X, lock_flag: 0x0
own: 0xc00000018d2508a8, sess: 0xc00000018d2508a8, proc: 0xc00000018d000078, prv: 0xc0000004df401728
----------------------------------------
SO: c00000018d30b710, type: 16, owner: c00000018d000078, flag: INIT/-/-/0x00
(osp req holder)
CHILD REQUESTS:
(osp req) type=2(BACKGROUND) flags=0x20001(STATIC/-) state=1(INITED) err=0
pg=0 arg1=0 arg2=(null) reply=(null) pname=S018
pid=0 parent=c00000018d30b710 fulfill=0000000000000000
----------------------------------------
SO: c0000004dbff09c0, type: 192, owner: c0000004dbff09c0, flag: -/-/-/0x00

在metalink上搜索600[17175]内部错误相关的文档，可以找到该错误的大量信息:

Keywords: ora-00600 [17175]

1. Bug 6250251: ORA-00600 17175 DURING KGI CLEANUP - DUMP - ORADEBUG
--ora-600 followed by ora-601 and instance crash with ORA-17175.
--Also, setting of heap check event triggers this problem. In this case
--it is event="10235 trace name context forever, level 27"

2. Bug 4216668 - Dump from INSERT / MERGE on internal columns (Doc ID 4216668.8)
--INSERT or MERGE commands might core dump if operating on object types and internal columns are involved.

3. Bug 7590297: ORA-600 [17175] [255] ORA-601: CLEANUP LOCK CONFLICT CRASHED THE DATABASE

4. SR 3-2296150050
--The error has occurred when Oracle was cleaning shared pool latch/heap information about the process
which died in middle.
--There is no data corruption associated with this error.
--This is evident from the function kghcln in the trace stack at which it failed.
--This problem is usually the symptom of some earlier problem with the latch.
--Either after a process has died, or a process has signaled an error while holding a shared pool latch,
and the index to the shared pool latch is invalid.

--There was a Bug 7590297 raised for this issue which could not be progressed due to unavailability of information.
--From few earlier known issues - This can be due to PMON may sometimes signal ORA-601
while trying to start up additional shared servers or dispatchers.
--There the workaround suggested was to Start the instance with max # of shared servers.

--Can you reproduce the problem?If the instance has been restated the issue may not persist as it is related to memory.
--If the issue persists then we have to perform the following to monitoring the instance to investigate further:

--1. Set the following event in parameter file:
--event="10257 trace name context forever, level 10"
--event="601 trace name SYSTEMSTATE level 10"

--The first event will cause PMON to dump info about shared server startup.
--The second event will cause PMON to do a system state dump when the 601 occurs.

--2. You should also have the track of this in intervals and save the historical results from:

--SQL> select e.total_waits, e.total_timeouts, e.time_waited from v$session_event e, v$session s
, v$bgprocess b where b.name='PMON' and s.paddr=b.paddr and e.sid=s.sid and e.event='process startup';

5. SR 3-2123025401
--=== ODM Solution / Action Plan ===
--Disabled NUMA for resolution

6. SR 7314313.994

Analysis:

Bug 6250251 and bug 4216668 are not applicable to this case.
Bug 7590297 is applicable to this case, as the call stack, error message are the same with this case.
But this patch is suspended as requested info is not available.

SR 3-2296150050: same error message, same DB version, similar call stack; closed without solution.
SR 3-2123025401: same error message, same DB version, similar call stack.
The issue happened twice in that SR and solved by disabling NUMA
SR 7314313.994: same error message, same DB version, similar call stack; closed without solution.

ERROR:
ORA-600 [17175] [a]

VERSIONS:
versions 9.2 to 10.1

DESCRIPTION:

This error occurs when we are cleaning up a shared pool latch (either after a process has died,
or a process has signaled an error while holding a shared pool latch),
and the index to the shared pool latch is invalid.

ARGUMENTS:
Arg [a] index of the latch recovery structure - usually 255

FUNCTIONALITY:
Generic Heap Manager

IMPACT:
INSTANCE HANG
PROCESS FAILURE
INSTANCE FAILURE

以下为Oracle GCS给出的行动计划，GCS认为绝大多数ORA-00600 [17xxx]是由memory相关的问题引起的，这些问题往往在重启实例后就可以得到解决。并建议可以设置shared_servers=max_shared_servers后进一步观察:

From the uploaded files it looks like you were reported with ORA-00600 [17175] errors 
and crashed the instance.What is the current status after the restart of the database. 
Are you still reported with the same errors and crashing the instance ?
Mostly the ORA-00600 [17xxx] errors are memory releated and might have got resolved after the database restart.


Further looking at the uploaded trace file the failing functions and the error closely 
matches Bug 6958493and is closed as duplicate of BaseBug 6962340which is closed 
as could not able to reproduce the error.


Also a smillar issue is reported inBug 3104250which is fixed in 10g, but that doesn't mean 
you cannot get this error for a new reason and that the same workaround would fix it.
We need to implement the workaround and set: shared_servers=max_shared_servers 
if the error reproduces again. If this is still repeated issue then we can file a new bug with development for the same.


ACTION PLAN
===========
1. Monitor the alertlog for the ORA-00600 [17175] errors for the next few days and if the database still crashes then please
set shared_servers=max_shared_servers and see if the problem resolves or not.

Filed Under: Oracle, Oracle Bug Tagged With: ora-00600, Oracle Bug, troubleshooting details

Oracle中RMAN恢复目录性能问题研究

2009/09/15 by admin Leave a Comment

下列文本摘自metalink:

Applies to:

Oracle Server – Enterprise Edition – Version: 9.0.1.0 to 11.1.0.8
Information in this document applies to any platform.

Purpose

This note discusses what to look for when queries against the RMAN repository (whether in the controlfile or a catalog database) are performing badly and what diagnostics needs to be gathered when raising a Service Request with Oracle Support Services.

This note is intended for use by Database Administrators and Support personnel investigating RMAN performance problems relating specifically to catalog resyncs or queries against the controlfile or catalog database whilst running RMAN.

Last Review Date

November 7, 2008

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Identifying a problem with the RMAN repository

The RMAN repository is updated whenever new backups are taken or housekeeping is done to maintain the backup metadata; it is queried when RMAN reports are run and during restore to check backup history and determine the best and most efficient way to carry out the restore. If a catalog is used, a number of resyncs are done first to update the catalog with any new information in the controlfile but once the resyncs are complete, ALL subsequent queries of RMAN metadata (during crosscheck, report, restore etc) are run against the catalog database only.

You know there is a problem if:

An explicit catalog resync takes a long time to complete

Statspack or AWR reports for a nocatalog run shows top wait events in the target are for CPU or control file IO:

“control file parallel write”

“control file sequential read”

Statspack or AWR reports for catalog run shows top wait events in the catalog are for CPU or the catalog datafile IO:

“db file scattered read”

“db file sequential read”

Statspack or AWR reports shows most expensive queries in terms of CPU or elapsed time are those issued against the controlfile or rman catalog tables

You have worked through Note 360443.1: RMAN Backup Performance, and identified the bottleneck to be in the Oracle Layer before the backup actually starts

You have worked through Note 740911.1: RMAN Restore Performance, and identified the bottleneck to be in the Oracle Layer before the restore actually starts

Control_file_record_keep_time and recovery window

Control_file_record_keep_time specifies the minimum amount of RMAN metadata guarranteed to be retained in the controlfile; the default is 7 days. In practice the rate at which records age out of the controlfile depends on how long it takes for a recyclable section to fill up which may well be more than 7 days. It is important to note that control_file_record_keep_time and recovery window are NOT the same; the former governs when records can be overwritten in the controlfile, primarily to prevent unlimited controlfile growth whilst the latter is concerned with retention of RMAN metadata according to business requirements (see Bug 6448254: RAISE WARNING IF RECOVERY WINDOW IS > CONTROL_FILE_RECORD_KEEP_TIME IN NOCAT MODE). If a catalog is used and RMAN is run daily or at a frequency LESS than control_file_record_keep_time days, RMAN metadata will never be lost as implicit catalog resyncs ensure data is regularly propogated into the catalog.

Integrity of RMAN metadata

Constraints in the catalog ensure the integrity of the rman metadata. For example, a backup_set must have a corresponding backup_piece which must have corresponding backup_X (datafile, redolog ) entries. There are no such constraints in a controlfile; records in different sections can age out at different rates. For example, the one:many relationship between backup_set and backup_redolog entries means the backup_redologs section will fill up faster than the backup_set section and over time, a backup_redolog entry may age out of the controlfile BEFORE its corresponding backup_set entry, leaving a backup_set entry with missing backup_redolog entries (mis-matched). Running housekeeping operations (report/delete obsolete, crosscheck etc) against such a CONTROLFILE with mismatched entries will produce unpredictable results (see Bug 7357779: RMAN REPORT/DELETE OBSOLETE RETURNS ARCHIVELOG BACKUPS INSIDE RETENTION POLICY).

To avoid any issues with mis-matched entries, set control_file_record_keep_time equal to the recovery window +1 which ensures that RMAN metadata INSIDE the retention policy will never be affected. In practice this may not be possible if a very large recovery window is needed so typically a catalog is used, a retention policy configured and control_file_record_keep_time left to its default of 7 days. But there may still be times when the catalog database is unavailable (it is being upgraded perhaps or is under investigation) and an RMAN backup must still proceed in which case:

set control_file_record_keep_time days to the recovery window plus 1 to ensure no loss of metadata whilst the catalog is unavailable

DO NOT RUN ANY MAINTENANCE OPERATIONS until the catalog is again available expecially if your recovery window is greater than the original setting of control_file_record_keep_time days

Tackling a performance issue

AWR snapshot reports from the target (nocatalog) or catalog database should identify the specific SQL that is causing a problem in terms of CPU, high IO or elapsed times.

Refer to the following notes to see if the problematic SQL is a known issue:

Note 247611.1:Known RMAN Performance Problems
Note 463227.1: Oracle10g RMAN Recovery Catalog Known Performance Issues

If none of the issues in the above notes are relevant, then further investigation is needed and the following diagnostics will be useful and should be uploaded if a Service Request is raised with Oracle Support Services. There are three processes to consider:

The RMAN client – this should always be traced

The 1st default channel (for nocatalog operations only, this process runs all the rman queries against the controlfile)

The RMAN session in the catalog instance

All the traces will generate a lot of information so consider FIRST how to minimise the output by stripping the job down to a SINGLE rman command. Execute each command in a script in isolation to see which command produces the problem and trace just that one command. It is always worth running an explicit catalog resync first to see if the real problem lies in the implicit resync done for virtually all commands.

Tracing the RMAN client

Ideally, set debug trace on the RMAN command line and execute a simplied script to illustrate the problem eg

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug run { allocate channel t1 type sbt parms=(…………); backup database plus archivelog all; }

Otherwise trace only the command that you are interested in. The following example traces only the crosscheck command:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc run { allocate channel t1 type sbt parms=(….); allocate channel t2 type sbt parms=(….); backup database; backup archivelog all until time ' sysdate –7' delete all input; debug on; crosscheck backup; debug off; }

Tracing queries against the RMAN repository

Make sure before you invoke rman that the following environmental variables are set:

NLS_LANG='american_america.<charset>' NLS_DATE_FORMAT='dd-mon-rr hh24:mi:ss'

And for the target database (nocatalog operations) or catalog database (catalog operations):

TIME_STATISTICS=TRUE MAX_DUMP_FILE_SIZE=UNLIMITED

If a recovery catalog is normally used, try again WITHOUT the catalog; if the problem persists then for diagnostic purposes, get the trace against the CONTROLFILE only. Tracing the RMAN catalog session should only be done if the problem only occurs when a catalog is used.

The Controlfile (nocatalog)

%rman target / log rman.log trace rman.trc debug run { sql "alter session set events ''10046 trace name context forever, level 12''"; set command id to 'rman'; allocate channel for maintenance type sbt…………..; backup database; }

Note: Two single quotes before : 10046
Two single quotes and a double quote after : level 12

This traces the first default channel; trace file will be generated in the target udump directory using the format <SID>_ora_<spid>.trc. To find the <spid> of the 1st default channel:

col program format a20 col action format a20 select s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time, seconds_in_wait AS sec_wait from v$session s, v$process p where s.paddr = p.addr and s.program like '%rman%';

Typical output:

SID SPID PROGRAM CLIENT_INFO --------------------------------------------------------- ACTION SEQ# EVENT WAIT_TIME SEC_WAIT ----------------------------------------------------------------------------- 139 24822 rman@celcsol4 (TNS V1-V3) 0000012 FINISHED129 853 SQL*Net message from client 0 1794 147 4574 rman@celcsol4 (TNS V1-V3) 115 SQL*Net message from client 0 1797 135 5056 rman@celcsol4 (TNS V1-V3) rman channel=ORA_DISK_1 0000046 FINISHED129 2219 SQL*Net message from client 0 1794

Note:
– sid 147 is the polling channel; it never has a value in ACTION
– sid 139 is the 1st default Channel (CLIENT_INFO is null)
– sid 135 is the allocated channel ORA_DISK_1 (per CLIENT_INFO)

The Catalog

Allow the rman job to make the catalog connection first – it may help to use the host command to allow time to find the catalogn session:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug run { set command id to 'rman'; allocate channel for maintenance type sbt…………..; host; backup database; }

The host command will halt execution and return control to the OS prompt.
To identify the RMAN session in the catalog database:

SELECT s.sid, s.serial#, p.spid, s.program, s.action FROM v$session s, v$process p WHERE s.paddr = p.addr AND s.program like '%rman%';

Note the spid value eg 3164 then do:

oradebug setospid 3164 oradebug unlimit oradebug event 10046 trace name context forever, level 12 exit

Once this is done, to return to the RMAN session and to continue execution of the script type: exit
A trace file will be generated in the CATALOG udump directory using the format <SID>_ora_<spid>.trc.

Upload Diagnostics

1. RMAN client trace file
2. 1st default channel 10046 trace file (nocatalog only)
3. RMAN catalog 10046 trace file (catalog issues only)
4. Related AWR/Statspack reports

Filed Under: Oracle Tagged With: backups, catalog database, database administrators, diagnostic tools, housekeeping, long time, metadata, metalink, Oracle, oracle server, oracle support services, performance problems, queries, repository, resync, server enterprise, service request, target, troubleshooting details, troubleshooting guide