近距离接触RAC DRM

drm 是Oracle rac中独有的动态资源管理操作, 我们听了很多关于DRM的理论, 但是你是否亲眼见证过DRM, 今天我们就来看一下:

 

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL>
SQL> select count(*) from gv$instance;

COUNT(*)
----------
2

SQL> create table drm_mac(t1 int) tablespace users;

Table created.

SQL> insert into drm_mac values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> alter system flush buffer_cache;

System altered.

SQL>select * from drm_mac;

T1
----------
1

SQL> select object_id,data_object_id from dba_objects where object_name='DRM_MAC';

OBJECT_ID DATA_OBJECT_ID
---------- --------------
64046 64046

--FIRST FIND THE OBJECT IDS
With OBJ_IDS As
(Select DATA_Object_Id OBJECT_ID From Dba_Objects
Where Object_Name = 'DRM_MAC'), --注意替换这里!!!!!!!!!!!!!!!!!
Addr As (Select /*+ materialize */
Le_Addr, class, state From X$Bh, OBJ_IDS
Where Object_Id = Obj),
--NOW GET THE RESOURCE NAME IN HEX
Hexnames As (Select
Rtrim(B.Kjblname, ' '||chr(0)) Hexname
From X$Le A, X$Kjbl B, Addr
Where A.Le_Kjbl=B.Kjbllockp
and class = 1 and state <> 3
And A.Le_Addr = Addr.Le_Addr)
-- NOW FIND THE NODE MASTERS
Select A.Master_Node Mast, Count(*)
From Gv$Dlm_Ress A, Hexnames H
Where
Rtrim(A.Resource_Name, ' '||chr(0)) = H.Hexname
Group by A.Master_Node;

MAST COUNT(*)
---------- ----------
0 5

==> 说明DRM_MAC MASTER 在0节点,共5个BLOCk

登录 2节点

对2节点的LMS进程做DRM TRACE
oradebug setospid 29295;
oradebug event trace[RAC_DRM] disk highest;

之后使用 lkdebug -m pkey data_object_id 来手动出发对DRM_MAC的REMASTER操作

注意即便是手动触发DRM也仅仅是将该OBJECT放入DRM ENQUEUE

*** 2012-11-27 09:27:20.728
Processing Oradebug command 'lkdebug -m pkey 64045'
*****************************************************************
GLOBAL ENQUEUE SERVICE DEBUG INFORMATION
----------------------------------------
Queued DRM request for pkey 64045.0, tobe master 1, flags x1
***************** End of lkdebug output *************************
Processing Oradebug command 'lkdebug -m pkey 64045'
*****************************************************************
GLOBAL ENQUEUE SERVICE DEBUG INFORMATION
----------------------------------------
kjdrrmq: pkey 64045.0 is already on the affinity request queue, old tobe == new tobe 1
***************** End of lkdebug output *************************
SQL> oradebug setmypid
Statement processed.
SQL> oradebug lkdebug -m pkey 64046;

--FIRST FIND THE OBJECT IDS
With OBJ_IDS As
(Select DATA_Object_Id OBJECT_ID From Dba_Objects
Where Object_Name = 'DRM_MAC'), --Customize
Addr As (Select /*+ materialize */
Le_Addr, class, state From X$Bh, OBJ_IDS
Where Object_Id = Obj),
--NOW GET THE RESOURCE NAME IN HEX
Hexnames As (Select
Rtrim(B.Kjblname, ' '||chr(0)) Hexname
From X$Le A, X$Kjbl B, Addr
Where A.Le_Kjbl=B.Kjbllockp
and class = 1 and state <> 3
And A.Le_Addr = Addr.Le_Addr)
-- NOW FIND THE NODE MASTERS
Select A.Master_Node Mast, Count(*)
From Gv$Dlm_Ress A, Hexnames H
Where
Rtrim(A.Resource_Name, ' '||chr(0)) = H.Hexname
Group by A.Master_Node;

MAST COUNT(*)
---------- ----------
1 5

通过 oradebug lkdebug -m pkey 将 MACLEAN_DRM REMASTER到2节点。
SQL> select * from v$gcspfmaster_info where DATA_OBJECT_ID=64045;
FILE_ID DATA_OBJECT_ID GC_MASTERIN CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
---------- -------------- ----------- -------------- --------------- ------------
0 64046 Affinity 1 0 1

SQL> select * from gv$policy_history
where DATA_OBJECT_ID=64046
order by EVENT_DATE;

no rows selected
SQL> select * from x$object_policy_statistics;

no rows selected

*** 2013-11-15 04:08:35.191
2013-11-15 04:08:35.191065 : kjbldrmnextpkey: AFFINITY pkey 64046.0
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0
benefit 0, total 0, remote 0, cr benefit 0, cr total 0, cr remote 0
kjblpkeydrmqscchk: Begin for pkey 64046/0 bkt[0->8191] read-mostly 0
kjblpkeydrmqscchk: End for pkey 64046.0
2013-11-15 04:08:35.191312 : DRM(20) quiesced basts [0-8191]
2013-11-15 04:08:35.197815 :
* lms acks DRMFRZ (8191)
* lms 0 finished parallel drm freeze in DRM(20) window 1, pcount 47
DRM(20) win(1) lms 0 finished drm freeze
2013-11-15 04:08:35.200436 :
* kjbrrcres: In this window: 278 GCS resources, 0 remastered

* In all windows: 278 GCS resources, 0 remastered
2013-11-15 04:08:35.201606 : kjbldrmnextpkey: AFFINITY pkey 64046.0
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0
benefit 0, total 0, remote 0, cr benefit 0, cr total 0, cr remote 0
kjbldrmrpst: no more locks for pkey 64046.0, #lk 0, #rm 0, #replay 0, #dup 0 objscan 4
DRM(20) Completed replay (highb 8191)
DRM(20) win(1) lms 0 finished replaying gcs resources
kjbmprlst: DRMFRZ replay msg from 1 (clockp 0x92f96c18, 3)
0x1fe6a.5, 64046.0, x0, lockseq 0x13c5, reqid 0x3
state 0x100, 1 -> 0, role 0x0, wrt 0x0.0
GCS RESOURCE 0xb28963a0 hashq [0xbb4cdd38,0xbb4cdd38] name[0x1fe6a.5] pkey 64046.0
grant 0xb19a8370 cvt (nil) send (nil)@0,0 write (nil),0@65536
flag 0x0 mdrole 0x1 mode 1 scan 0.0 role LOCAL
disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
xid 0x0000.000.00000000 sid 0 pkwait 0s rmacks 0
refpcnt 0 weak: 0x0000.00000000
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0
benefit 0, total 0, remote 0, cr benefit 0, cr total 0, cr remote 0
hv 3 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 3, dom 0]
kjga st 0x4, step 0.36.0, cinc 15, rmno 19, flags 0x20
lb 0, hb 8191, myb 8141, drmb 8141, apifrz 1
GCS SHADOW 0xb19a8370,4 resp[0xb28963a0,0x1fe6a.5] pkey 64046.0
grant 1 cvt 0 mdrole 0x1 st 0x100 lst 0x40 GRANTQ rl LOCAL
master 2 owner 1 sid 0 remote[0x92f96c18,3] hist 0x10c1430c0218619
history 0x19.0xc.0x6.0x1.0xc.0x6.0x5.0x6.0x1.0x0.
cflag 0x0 sender 0 flags 0x0 replay# 0 abast (nil).x0.1 dbmap (nil)
disk: 0x0000.00000000 write request: 0x0000.00000000
pi scn: 0x0000.00000000 sq[0xb28963d0,0xb28963d0]
msgseq 0x1 updseq 0x0 reqids[3,0,0] infop (nil) lockseq x13c5
GCS SHADOW END
GCS RESOURCE END
2013-11-15 04:08:35.204351 : 0 write requests issued in 72 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.204493 : 0 write requests issued in 99 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.204648 : 0 write requests issued in 54 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.204790 : 0 write requests issued in 35 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.204948 : 0 write requests issued in 18 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.205048 : 0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.205327 : 0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs
2013-11-15 04:08:35.205590 : 0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs

* kjbrrcfwst: resp x0xb28963a0, id1 x1fe6a, drm
2013-11-15 04:08:35.205799 : 0 write requests issued in 1 GCS resources
0 PIs marked suspect, 0 flush PI msgs
DRM(20) win(1) lms 0 finished fixing gcs write protocol
2013-11-15 04:08:35.207051 : kjbldrmnextpkey: AFFINITY pkey 64046.0
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0
benefit 0, total 0, remote 0, cr benefit 0, cr total 0, cr remote 0
kjblpkeydrmqscchk: Begin for pkey 64046/0 bkt[8192->16383] read-mostly 0
kjblpkeydrmqscchk: End for pkey 64046.0
2013-11-15 04:08:35.207595 : DRM(20) quiesced basts [8192-16383]
2013-11-15 04:08:35.209146 :
* lms acks DRMFRZ (16383)
* lms 0 finished parallel drm freeze in DRM(20) window 2, pcount 47
DRM(20) win(2) lms 0 finished drm freeze
2013-11-15 04:08:35.211327 :
* kjbrrcres: In this window: 256 GCS resources, 0 remastered

* In all windows: 534 GCS resources, 0 remastered
2013-11-15 04:08:35.214797 : kjbldrmnextpkey: AFFINITY pkey 64046.0
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0
benefit 0, total 0, remote 0, cr benefit 0, cr total 0, cr remote 0
kjbldrmrpst: no more locks for pkey 64046.0, #lk 0, #rm 0, #replay 0, #dup 0 objscan 4
DRM(20) Completed replay (highb 16383)
DRM(20) win(2) lms 0 finished replaying gcs resources
kjbmprlst: DRMFRZ replay msg from 1 (clockp 0x92f96d28, 3)
0x1fe6f.5, 64046.0, x0, lockseq 0x13ca, reqid 0x3
state 0x100, 1 -> 0, role 0x0, wrt 0x0.0
GCS RESOURCE 0xb2896520 hashq [0xbb4edd38,0xbb4edd38] name[0x1fe6f.5] pkey 64046.0
grant 0xb19a3750 cvt (nil) send (nil)@0,0 write (nil),0@65536
flag 0x0 mdrole 0x1 mode 1 scan 0.0 role LOCAL
disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
xid 0x0000.000.00000000 sid 0 pkwait 0s rmacks 0
refpcnt 0 weak: 0x0000.00000000
pkey 64046.0 undo 0 stat 5 masters[1, 1->2] reminc 13 RM# 18
flg x7 type x0 afftime x8b0628ee, acquire time 0
nreplays by lms 0 = 0

SQL> alter system set "_serial_direct_read"=never;

System altered.

SQL> select count(*) from large_drms;

COUNT(*)
----------
999999

SQL> select blocks from dba_tables where table_name='LARGE_DRMS';

BLOCKS
----------
13233

SQL> select object_id,data_object_id from dba_objects where object_name='LARGE_DRMS';

OBJECT_ID DATA_OBJECT_ID
---------- --------------
64474 64474

With OBJ_IDS As
(Select DATA_Object_Id OBJECT_ID From Dba_Objects
Where Object_Name = 'LARGE_DRMS'), --Customize
Addr As (Select /*+ materialize */
Le_Addr, class, state From X$Bh, OBJ_IDS
Where Object_Id = Obj),
--NOW GET THE RESOURCE NAME IN HEX
Hexnames As (Select
Rtrim(B.Kjblname, ' '||chr(0)) Hexname
From X$Le A, X$Kjbl B, Addr
Where A.Le_Kjbl=B.Kjbllockp
and class = 1 and state <> 3
And A.Le_Addr = Addr.Le_Addr)
-- NOW FIND THE NODE MASTERS
Select A.Master_Node Mast, Count(*)
From Gv$Dlm_Ress A, Hexnames H
Where
Rtrim(A.Resource_Name, ' '||chr(0)) = H.Hexname
Group by A.Master_Node;

MAST COUNT(*)
---------- ----------
1 6605
0 6473

oradebug setmypid
oradebug lkdebug -m pkey 64474;

MAST COUNT(*)
---------- ----------
1 13078

 

DRM一张103MB大小的表可以通过AWR了解因为DRM操作而消耗的带宽等资源:

 

 

 

 

 

 

其中来源于RAC AWR的AWRGRPT报告的ininterconnect_device_statistics显示了正确的DRM网络流量, 而普通AWRRPT中的Interconnect Device Statistics似乎并不能正式反映网络流量。

Script: 收集RAC DRM 诊断信息

以下脚本可以用于收集 11gR2中 RAC DRM 动态资源管理 特性的诊断信息:

 

REM for 11.2 only
REM written by Maclean.Liu

set linesize 140 pagesize 1400

select DRMS, AVG_DRM_TIME, QUIESCE_T,FRZ_T,CLEANUP_T,REPLAY_T,FIXWRITE_T,SYNC_T from X$KJDRMAFNSTATS
/

select * from GV$DYNAMIC_REMASTER_STATS
/

select object, node, sopens, xopens, xfers
from x$object_policy_statistics
-- where object=&object_id
/

select data_object_id, current_master, previous_master, remaster_cnt from gv$gcspfmaster_info
/

select * from gv$policy_history
-- where object=&object_id
order by EVENT_DATE
/

select name, value
from v$sysstat
where name in ('gc local grants',
'gc remote grants')
/

一次Exadata上的ORA-600[kjbmprlst:shadow]故障分析

 

这是一套Exadata 上的11.2.0.1   四节点RAC 系统,从今年初开始频繁地因为LMS后台进程出现内部错误ORA-600[kjbmprlst:shadow]而导致实例意外终止, 虽然是4节点RAC 保证了其高可用性, 但是仍因为 实例经常意外终止 导致应用程序交易失败。

实际我在7月份已经分析过该问题了, 详见<Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例>一文 , kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞。 实际上当时我已经给出了禁用DRM 可能可以避免该ORA-600[kjbmprlst:shadow] 的看法 ,但是 翻阅MOS 上所有现有的ORA-600[kjbmprlst:shadow] case均没有提及disable DRM可以是一种workaround的途径, 提交SR 也没有得到Oracle GCS关于该看法的 肯定回答, 此外还考虑到 核心的产品环境 使用11.2.0.1 这个bug 较多的版本确实不合适, 升级到 11.2.0.2 已修复该Bug 10121589 的bundle Patch 也算是一种不错的方案 ,所以也就没有深究下去。

之后就一直在做升级Exadata到11.2.0.2 的一些准备工作, 但是用户最后考虑到升级Exadata的步骤过于复杂, 很多步骤是不可回退的, 而且在国内升级的案例目前也不多, 推翻了 升级的方案, 于是有了下面的这一段故障分析

 

故障特征

 

如上面所说的这是一套Exadata上的4节点 11.2.0.1 RAC , 应用程序是 Oracle 自己的Billing and Revenue Management (BRM) 。从今年初开始多次出现ORA-600[kjbmprlst:shadow]内部错误, 引发该内部错误的是RAC 关键后台进程 LMS (  global cache server process) ,结果是导致实例意外终止。

 

该ORA-600[kjbmprlst:shadow]在10月份频发的发生(几乎一个礼拜2次),故障发生还存在以下2个特征:

  1. 主要集中在凌晨03:00 以后的20分钟内发生
  2. 1、2、4节点都陆续发生过故障 , 唯有 3号节点从未发生过 ( 这个后面会介绍)

 

初步分析

 

通过应用方拿到了某次故障发生当天每一个小时的AWR报告, 这次故障发生在凌晨 03:05 的 1号节点上, 于是开始分析当时的 1号实例的性能负载。

分析AWR 报告的一种方法就是 通过对比多个时间段的报告 来了解性能变化 ,当然前提是知道问题发生的时段, 这里我们来对比2个时段的信息,这三个时段分别是:

1. 问题发生前的几分钟, 因为1号实例上 3点以后的AWR 报告缺失,所以这里使用的是 02:00 ~ 03:00 的AWR
2. 问题发生前的1小时 , 01:00~02:00 的AWR

 

首先观测DB TIME:

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14225 12-Oct-11 02:00:05 284 152.2
End Snap: 14226 12-Oct-11 02:59:59 288 150.1
Elapsed: 59.91 (mins)
DB Time: 809.98 (mins)

 

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14224 12-Oct-11 01:00:10 284 152.3
End Snap: 14225 12-Oct-11 02:00:05 284 152.2
Elapsed: 59.91 (mins)
DB Time: 257.62 (mins)

 

AWR中的DB TIME可以一定程度上反映数据库实例的繁忙程度。这里 01:00 到 02:00 时间段的AAS= 257/60 = 4.28,而问题发生的时间段的 AAS = 809 / 60 = 13 , 说明实例的工作负载明显上升了。

 

再来看主要等待事件:

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: TX – row lock contention 10,072 16,079 1596 33.08 Application
log file sync 166,546 10,579 64 21.77 Commit
gc buffer busy acquire 55,828 8,017 144 16.50 Cluster
gc buffer busy release 21,203 5,904 278 12.15 Cluster
gc current block busy 29,196 3,362 115 6.92 Cluster

 

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
log file sync 258,415 4,875 19 31.54 Commit
gc buffer busy acquire 44,471 3,349 75 21.66 Cluster
DB CPU 1,880 12.16
gc current block busy 42,722 1,748 41 11.31 Cluster
gc buffer busy release 10,730 1,127 105 7.29 Cluster

 

问题发生时段的主要等待事件是 enq: TX – row lock contention 这说明存在大量的行锁争用,此外gc buffer busy acquire/release 和 gc current block busy 全局缓存争用等待时长也明显较 之前的 01:00- 02:00 多了数倍,这说明问题发生时段 节点间对全局缓存资源的争用进一步被加剧了。01:00 – 02:00 时段的 enq: TX – row lock contention 不是前5的等待事件,但实际每秒处理的事务数更多为71.8个(问题发生时段为每秒46.3个事务),这说明造成大量行所争用的的原因不是事务变多, 而是应用程序特性造成的。

 

引起频繁 enq: TX – row lock contention等待事件 和 全局缓存资源争用 的最主要SQL语句均是1usg75g3cx4n6:

 

Cluster Wait Time (s) Executions %Total Elapsed Time(s) %Clu %CPU %IO SQL Id SQL Module SQL Text
4,047.30 20,470 22.75 17,790.11 21.01 0.04 0.00 1usg75g3cx4n6 update

 

SQL “1usg75g3cx4n6” 是一条Update更新语句,在此不列出。

 

问题发生时段该语句 1 号实例上运行了20470次,平均每次运行耗时 0.87s , cluster wait time 占该SQL总执行时间的22.75% ,而在01:00 – 02:00 期间该语句共运行了32845 次,平均每次运行耗时 0.02s , cluster wait time 占该SQL总执行时间的 67.44 %。

以上信息说明1号实例在问题发生时段 由SQL语句1usg75g3cx4n6 因为enq: TX – row lock contention(主要因素) 和 gc 全局缓存争用 (次要因素)相互作用引起了巨大的负载。引起enq: TX – row lock contention行锁争用的主要原因是 并发的DML更新相关的行数据。

 

深入分析

除了在问题发生时段数据库负载加剧,出现大量的行锁和全局缓存争用外,问题时段还出现了大规模的动态资源管理操作DRM( Dynamic Resource Management)。

引入DRM的目的是尽可能减少RAC 中CPU和 消息通信的消耗,DRM的资源管理决定依据相关对象的全局缓存操作历史, 该特性会定期自动找出集群中频繁访问某个缓存对象的节点实例,并将该缓存对象re-master到该实例上以优化RAC性能。  DRM 由 LMON、LMD 和 LMS 进程协同完成, LMS负责传输实际的 gcs resouce和 lock 锁信息 , 11g中受到_gc_policy_time隐藏参数的影响,默认最短每10分钟可能被触发。当设置_gc_policy_time=0 ,则object affinity 和 11g 中引入的read mostly policies新特性均将被禁用。

 

AWR报告中1号实例问题时段的 DRM 统计:

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 2 1.00
remastered objects 2 1.00
replayed locks received 215,517 107,758.50
replayed locks sent 339,349 169,674.50
resources cleaned 0 0.00
remaster time (s) 7.4 3.70
quiesce time (s) 4.3 2.13
freeze time (s) 0.1 0.04
cleanup time (s) 0.4 0.21
replay time (s) 1.0 0.48
fixwrite time (s) 0.9 0.47
sync time (s) 0.8 0.38
affinity objects 525 525

 

以下为01:00- 02:00 时段的DRM 统计:

Name Total per Remaster Op Begin Snap End Snap
remaster ops 5 1.00
remastered objects 8 1.60
replayed locks received 110,902 22,180.40
replayed locks sent 68,890 13,778.00
resources cleaned 0 0.00
remaster time (s) 12.9 2.57
quiesce time (s) 1.4 0.28
freeze time (s) 0.2 0.03
cleanup time (s) 0.9 0.17
replay time (s) 0.5 0.10
fixwrite time (s) 1.9 0.38
sync time (s) 8.0 1.60
affinity objects 526 525

 

可以看到在问题发生时段 DRM 发送(sent)了更多的replayed locks锁资源 ,这里的replay是指将本地锁信息传送到新的master 实例(REPLAY – Transfer of the local lock information to the new master.) , 而负责发送这些replayed locks锁资源的 正是LMS 进程。

如上文所述1、2、4节点均出现过该ORA-600[kjbmprlst:shadow]故障,但是唯有3号节点从没有出现过该问题, 打开3号节点在问题发生时段(03:00-04:00) 的AWR报告后就有重大的发现:

 

Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
3,590.52 0 17.30 34.71 63.63 59v4zh1ac3v2a DBMS_SCHEDULER DECLARE job BINARY_INTEGER := …
3,589.83 0 17.29 51.69 37.98 b6usrg82hwsa3 DBMS_SCHEDULER call dbms_stats.gather_databas…
2,173.79 1 2,173.79 10.47 62.16 28.75 b20w3p880j2gs DBMS_SCHEDULER /* SQL Analyze(1) */ select /*…

 

可以发现3号节点当时正在运行”call dbms_stats.gather_database_stats_job_proc ( )”  11g中自动收集统计信息的作业,该作业运行了较长时间(一个小时内尚未完成), 我们知道10g以后引入了默认的自动收集统计信息的作业,在11g中得到了加强, 这一作业会自动收集自上一次收集以来更新超过总数据量10%的 对象的统计信息, 因为该核心产品数据库是大型的OLTP应用,在03:00之前已经发生了大量的数据更改,所以这导致gather_database_stats_job_proc 要收集许多大对象上的统计信息,而这会引发该3号实例频繁Request之前在 1、2号实例中被更新的全局缓存资源,频繁地访问影响了DRM的decision, 导致大量的object 被 re-master 到 3号节点上。

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 14 1.00
remastered objects 25 1.79
replayed locks received 1,088,009 77,714.93
replayed locks sent 202,112 14,436.57
resources cleaned 0 0.00
remaster time (s) 30.8 2.20
quiesce time (s) 8.5 0.60
freeze time (s) 0.4 0.03
cleanup time (s) 3.3 0.24
replay time (s) 6.4 0.46
fixwrite time (s) 7.7 0.55
sync time (s) 4.5 0.32
affinity objects 4,294,967,289 4,294,967,289

 

可以看到3号实例接收了大量的replayed lock (replayed locks received) ,而发送的lock则很少。而实际发送(sent) 给 3号实例这些锁信息 资源的 正是 1号实例上的LMS进程。

翻阅alert.log 和视图 我发现每一次的gather_database_stats_job_proc均在凌晨3点的 3号实例发生, 因此每一次故障发生时3号实例只要接收这些replayed locks   资源即可,其LMS进程 仅发送少量的 replayed locks , 所以造成了上面所说的3号实例从不发生该ORA-600[kjbmprlst:shadow]故障 , 这一古怪的现象。

 

我们再来阅读上发生ORA-600[kjbmprlst:shadow]故障的LMS进程的trace文件中的部分内容:

*** 2011-10-12 02:25:22.770
DRM(22812) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(1) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:29.717
 lms 1 finished fixing gcs write protocol
DRM(22813) win(2) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:30.312
DRM(22813) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(4) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(5) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:31.280
 lms 1 finished fixing gcs write protocol
DRM(22813) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(7) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:32.269
DRM(22813) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(1) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(2) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:33.479
 lms 1 finished fixing gcs write protocol
DRM(22814) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(4) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:34.333
 lms 1 finished fixing gcs write protocol
DRM(22814) win(5) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:35.315
DRM(22814) win(7) lms 1 finished replaying gcs resources
Incident 1035732 created, dump file:
/maclean1_lms1_20075_i1035732.trc
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []

2011-10-12 03:05:36.811503 : kjbmprlst: internal error on replaying 0x618cd.371 from 3
2011-10-12 03:05:36.811541 : kjbmbassert [0x618cd.371]
2011-10-12 03:05:36.811559 : kjbmsassert(0x618cd.371)(2)
2011-10-12 03:05:36.811636 : kjbmsassert(0x618cd.371)(3)
2011-10-12 03:05:36.811667 : kjbmsassert(0x618cd.371)(4)
kjmpbmsg fatal error on 39
MSG [39:KJX_B_REPLAY] .Yq inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
    flow ctrl: ver=60 flag=169 len=136 tkts=0 seq=0 wrp=0
               sndr=3 dest=1 rcvr=2
   FUSION MSG 0x2b178baa5120,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x618cd.371, 574166.0] c[0x59f904e08,-17597] [0x1,xc69]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99097 drmbkt# 99097
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 9 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99097, drmb 99097, apifrz 1
   FUSION MSG DUMP END
MSG [65521:PBATCH] inc=64 len=8376 sender=3 seq=279807301 fg=q
    flow ctrl: ver=1 flag=37 len=8376 tkts=0 seq=279807301 wrp=1
               sndr=3 dest=1 rcvr=2
MSG [39:KJX_B_REPLAY] .Y'z inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
   FUSION MSG 0x2b178baa31c8,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x527f5.27a, 574166.0] c[0xa2f7eddc0,2158] [0x1,xf33]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99582 drmbkt# 99582
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 33 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99582, drmb 99582, apifrz 1
   FUSION MSG DUMP END

.........................

    ------------process 0xcf22a6d40--------------------
    proc version      : 0
    Local node        : 0
    pid               : 20075
    lkp_node          : 0
    svr_mode          : 0
    proc state        : KJP_NORMAL
    Last drm hb acked : 114687
    flags             : x80
    current lock op    : (nil)
    Total accesses    : 11763805
    Imm.  accesses    : 11763774
    Locks on ASTQ     : 0
    Locks Pending AST : 0
    Granted locks     : 0
    AST_Q: 
    PENDING_Q: 
    GRANTED_Q: 
    KJM HIST LMS1: 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 1:0 
      13:65521:0 20:39:0 20:39:0 20:39:0 
    START DEFER MSG QUEUE 0 ON LMS1 flg x87:

MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262147
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca303f2f0 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=109225856
     fg=s stat=KJUSERSTAT_ATTACHFAILED spnum=11 flg=x0
MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262146
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca405e758 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=6400108
     fg=s stat= spnum=11 flg=x20
    END DEFER MSG QUEUE 0 DUMP ON LMS1
     SEQUENCES: 
    [sidx 0]  0:0.0  1:711171046.0  2:279807300.1  3:43212975.0
     DEFER MSG QUEUE 1 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 1]  0:0.0  1:711171047.0  2:279807292.1  3:43212974.0
     DEFER MSG QUEUE 2 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 2]  0:0.0  1:711171047.0  2:279807300.1  3:43212975.0


可以看到在实际出发ORA-600[kjbmprlst:shadow]之前,LMS后台进程确实在进行频繁地DRM操作。

我们不难看出出发ORA-600[kjbmprlst:shadow]的原因是:数据库较高的性能负载, 1号实例上SQL 语句引发的大量行锁争用 和 3号节点上的 gather_database_stats_job_proc 自动统计信息作业 同时存在 , 导致1号实例上的LMS进程频发地执行DRM操作 , 最终在较高的工作负载情况下触发了管理kjbm shadow锁的代码bug 。

经过分析发现2、4号实例发生该ORA-600[kjbmprlst:shadow]故障与1号节点的情形相似,整个RAC中唯有3号节点例外,因为总是由它来执行gather_database_stats_job_proc, 导致它总是成为DRM的受益者 , 而其他3个节点成为受害者。

但是注意 可能你要问 那不就是简单的由自动统计作业引发的故障吗? 用得着分析这么多吗, 直接关掉该作业 或者换个时段就可以了。

回答:实际上如果仅仅是gather_database_stats_job_proc的话是不足以引发该故障的, 一方面由于节点之间的所跑得应用程序都是一样的,也没有采取分区或者分schema , 这导致节点之间的全局缓存资源争用加剧 , 另一方面 因为每个Buffer Cache都较大, 均存放了大量的全局缓存资源, 且数据的更新十分频繁 , 这导致gather_database_stats_job_proc作业开始执行后 大量的全局资源涌向3号实例 ,同时也触发了DRM的决定, 以至于1、2、4号节点上的LMS 在短期内有极大的工作压力 , 最终导致了 bug 被触发。仅仅禁用该gather_database_stats_job_proc 自动信息搜集作业 ,或者修改该作业的运行时间 都不足以彻底规避该ORA-600[kjbmprlst:shadow]故障, 因为只要是LMS后台进程处于较高的工作负载情况下都仍有可能触发bug。

 

虽然以上分析了一大堆但是我对解决该ORA-600[kjbmprlst:shadow]故障的看法没有丝毫改变, 而且更坚定了。Disable DRM 禁用DRM特性可以从根本上改变LMS进程的行为方式,彻底杜绝该ORA-600[kjbmprlst:shadow]故障发生的可能性。

但我仍需要一些佐证!

 

资料信息

 

翻阅MOS 一个Note 映入我的眼帘,<Bug 12834027 – ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking [ID 12834027.8]> 这个Note 我在上一次分析这个问题的时候已经阅读过了, 当时该Note的描述该问题会在11.2.0.1 上发生,不提供任何workaround 方法 ,唯一的解决途径是打上11.2.0.2 + Bp 的补丁。

但是我发现这个Note 在最近的 21-OCT-2011 被修改了,内容如下:

Bug 12834027  ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
 This note gives a brief overview of bug 12834027.
 The content was last updated on: 21-OCT-2011
 Click here for details of each of the sections below.
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.2 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release) 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow]
    ORA-600 [kjbrasr:pkey] 

    RAC (Real Application Clusters) / OPS
    _gc_read_mostly_locking 

Description

    ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] can occur with RAC
    read mostly locking.

    Workaround
      Disable DRM
     or
      Disable read-mostly object locking
      eg: Run with "_gc_read_mostly_locking"=FALSE

 

 

这一次翻到这个Note 是在10月28日 ,而这个Note是在10月21日更新的, 不得不说这个更新来的很及时, 提供了2种 workaround 的方式 , 包括彻底禁用DRM ,和之前所说的禁用_gc_read_mostly_locking 特性,相关的bug note 说明了 另一个Exadata case的经过:

 

Hdr: 12834027 N/A RDBMS 11.2.0.2.0 RAC PRODID-5 PORTID-226 12730844
Abstract: ORA-600 [KJBMPRLST:SHADOW] & [KJBRASR:PKEY] IN A READ MOSTLY & SKIP LOCK ENV

*** 08/04/11 04:22 am ***

PROBLEM:
--------
3 nodes got crashed and restarted in exadata cluster.

DIAGNOSTIC ANALYSIS:
--------------------
8 node Exadata RAC instance. Below instances got terminated and it rejoined
cluster around 11:46 Today (01-Aug).

Instance2 (gmfcdwp2) got ORA-600 [kjbmprlst:shadow] at 11:46 AM and
instance got terminated by LMS1 process.
LMS1 (ospid: 14922): terminating the instance due to error 484

instance got terminated by LMS1 process
LMS1 (ospid: 28176): terminating the instance due to error 484

instance got terminated by LMS1 process

which was seen in instance 4 and 8.

 

这个case和我们的惊人的相似, 也是几个Exadata上特定的instance number 会出现ORA-600 [kjbmprlst:shadow]。

另一方面需要评估禁用DRM 对于整体性能的影响,其实这一点我已经在<Important parameters For Oracle BRM Application in 11gR2>中介绍了, Oracle 官方推荐为 BRM 应用程序 禁用 DRM 特性 (真绕口) , 而因为这套系统使用的是默认的Exadata 出场的实例参数设置,所以。。。

 

解决方案

1. 禁用DRM 特性,可以彻底避免LMS进程因为DRM操作而引发bug , 具体做法是设置隐藏参数_gc_policy_time为0。

影响评估:

DRM 特性可以一定程度上优化数据库性能, 但是对于没有分区或分schema 的应用程序而言 DRM并不会带来好处。  Oracle 官方推荐在BRM应用程序 上 要禁用DRM特性, 见Metalink 文档<Questions About BRM Support For Oracle DB 11GR2 And Exadata [ID 1319678.1]>:

“_gc_policy_time=0 to disable DRM.” 修改该_gc_policy_time参数 需要重启实例

 

2. 部分禁用DRM 特性 , 设置 _gc_read_mostly_locking= False。 设置 _gc_read_mostly_locking= False可以避免replayed lock 大量传输 ,极大程度上减少LMS触发bug的可能性。  修改该参数同样需要重启实例

 

案例总结

写了这么久有点累了, 也有点不知所云。 可能会再写一篇 介绍11g 中新加入的Read-mostly locking DRM 新特性,本来想在这篇里一并介绍但是发现篇幅有点长了。

有时候我们解决一个问题只需要某个恰当的方案就足够了, 这个方案可以来源于Metalink  、Google 或者朋友, 我们并不知道这种做法、解决方式实际这么生效 , 但是总归或多或少 它生效了。 但是这不代表 真正意义上理解这个故障或者问题了, 譬如 我花了大约 3-4个小时的时间来分析这个问题, 虽然得到的 结论几乎和我在7月份的那一次是一样的 ,但是透过 分析的过程 我还是获益良多 , 这是唯有自己分析方能产生的结果。

同学们, 享受这分析过程带来的 乐趣吧!

RAC动态资源(DRM)管理介绍

以下文本摘自: metalink doc 390483.1

Subject:  DRM – Dynamic Resource management
Doc ID:  390483.1  Type:  BULLETIN
Modified Date :  13-JAN-2009  Status:  PUBLISHED

In this Document
Purpose
Scope and Application
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering
References

Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 11.1.0
Oracle Server – Standard Edition – Version: 10.1.0.2 to 11.1.0
Information in this document applies to any platform.
Oracle Real application Clusters
Purpose

To describe the concept of DRM (Dynamic Resource Mastering)
Scope and Application

This note in intended for experienced Real application cluster DBA’s
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering

When using Real application Clusters (RAC), Each instance has its own SGA and buffer cache. RAC will ensure that these block changes are co -ordinated to maximize performance and to ensure data intergrity. Each copy of the buffer also called as a cache resource has a master which is one of the nodes of the cluster.

In database releases before 10g (10.1.0.2) once a cache resource is mastered on an instance, a re-mastering or a change in the master would take place only during a reconfiguration that would happen automatically during both normal operations like instance startup or instance shutdown or abnormal events like Node eviction by Cluster Manager. So if Node B is the master of a cache resource, this resource will remain mastered on Node B until reconfiguration.

10g  introduces a concept of resource remastering via DRM. With DRM a resource can be re-mastered on another node say from Node B to Node A if it is found that the cache resource is accessed more frequently from Node A. A reconfiguration is no longer the only reason for a resource to be re-mastered.

In 10gR1 DRM is driven by affinity of files and in 10gR2 it is based on objects.

Sample LMD trace file during a DRM operation

Begin DRM(202) - transfer pkey 4294951314 to 0 oscan 1.1
*** 2006-08-01 17:34:54.645
Begin DRM(202) - transfer pkey 4294951315 to 0 oscan 1.1
*** 2006-08-01 17:34:54.646
Begin DRM(202) - transfer pkey 4294951316 to 0 oscan 1.1
*** 2006-08-01 17:34:54.646
Begin DRM(202) - transfer pkey 4294951317 to 0 oscan 1.1

DRM attributes are intentionally undocumented since they may change depending on the version. These attributes should not be changed without discussing with Support.
@DRM is driven by the following
@ 1.) _gc_affinity_time = Time in minutes at which statistics will be evaluated (default = 10 mins)
@ 2.) _gc_affinity_limit = # of times a node accesses a file/object (default = 50)
@ 3.) _gc_affinity_minimum = minimum # of times per minute a file/object is accessed before affinity kicks in
@ (default = 600 per minute per cpu )

It is important to note that

  1. Two instance will not start a DRM operation at the same time however lmd,lms,lmon processes from all instances collectively take part in the DRM operation.
  2. Normal activity on the database is not affected due to DRM. This means users continue insert/update/delete operations without any interruptions. Also DRM operations complete very quickly.

@

Disable DRM
Generally DRM should not be disabled unless Oracle Support/Development has suggested turning it off due to some known issues.
@To disable DRM, set
@To disable DRM, set
@_gc_affinity_time=0                                 # Only if DB version is 10.1 or 10.2
@_gc_undo_affinity=FALSE                       # Only if Db version is 10.2
@_gc_policy_time=FALSE                         # Only if DB version is 11.1 or higher
@_gc_affinity_time has been renamed to _gc_policy_time in 11g

沪ICP备14014813号-2

沪公网安备 31010802001379号