RAC动态资源(DRM)管理介绍

2009/09/04 by admin 7 Comments

以下文本摘自: metalink doc 390483.1

Subject: DRM – Dynamic Resource management
Doc ID: 390483.1 Type: BULLETIN
Modified Date : 13-JAN-2009 Status: PUBLISHED

In this Document
Purpose
Scope and Application
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering
References

Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 11.1.0
Oracle Server – Standard Edition – Version: 10.1.0.2 to 11.1.0
Information in this document applies to any platform.
Oracle Real application Clusters
Purpose

To describe the concept of DRM (Dynamic Resource Mastering)
Scope and Application

This note in intended for experienced Real application cluster DBA’s
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering

When using Real application Clusters (RAC), Each instance has its own SGA and buffer cache. RAC will ensure that these block changes are co -ordinated to maximize performance and to ensure data intergrity. Each copy of the buffer also called as a cache resource has a master which is one of the nodes of the cluster.

In database releases before 10g (10.1.0.2) once a cache resource is mastered on an instance, a re-mastering or a change in the master would take place only during a reconfiguration that would happen automatically during both normal operations like instance startup or instance shutdown or abnormal events like Node eviction by Cluster Manager. So if Node B is the master of a cache resource, this resource will remain mastered on Node B until reconfiguration.

10g introduces a concept of resource remastering via DRM. With DRM a resource can be re-mastered on another node say from Node B to Node A if it is found that the cache resource is accessed more frequently from Node A. A reconfiguration is no longer the only reason for a resource to be re-mastered.

In 10gR1 DRM is driven by affinity of files and in 10gR2 it is based on objects.

Sample LMD trace file during a DRM operation

Begin DRM(202) - transfer pkey 4294951314 to 0 oscan 1.1

*** 2006-08-01 17:34:54.645

Begin DRM(202) - transfer pkey 4294951315 to 0 oscan 1.1

*** 2006-08-01 17:34:54.646

Begin DRM(202) - transfer pkey 4294951316 to 0 oscan 1.1

*** 2006-08-01 17:34:54.646

Begin DRM(202) - transfer pkey 4294951317 to 0 oscan 1.1

DRM attributes are intentionally undocumented since they may change depending on the version. These attributes should not be changed without discussing with Support.
@DRM is driven by the following
@ 1.) _gc_affinity_time = Time in minutes at which statistics will be evaluated (default = 10 mins)
@ 2.) _gc_affinity_limit = # of times a node accesses a file/object (default = 50)
@ 3.) _gc_affinity_minimum = minimum # of times per minute a file/object is accessed before affinity kicks in
@ (default = 600 per minute per cpu )

It is important to note that

Two instance will not start a DRM operation at the same time however lmd,lms,lmon processes from all instances collectively take part in the DRM operation.
Normal activity on the database is not affected due to DRM. This means users continue insert/update/delete operations without any interruptions. Also DRM operations complete very quickly.

Disable DRM
Generally DRM should not be disabled unless Oracle Support/Development has suggested turning it off due to some known issues.
@To disable DRM, set
@To disable DRM, set
@_gc_affinity_time=0                                # Only if DB version is 10.1 or 10.2
@_gc_undo_affinity=FALSE                     # Only if Db version is 10.2
@_gc_policy_time=FALSE                         # Only if DB version is 11.1 or higher
@_gc_affinity_time has been renamed to _gc_policy_time in 11g

Filed Under: Oracle, Oracle RAC Tagged With: drm, global cache, hidden parameter, RAC, real application cluster feature performance

Comments

admin says

2010/08/03 at 00:17

quote:
“10g Real Application Clusters introduced a concept of resource remastering via Dynamic Resource Mastering (DRM). With DRM a resource can be re-mastered on another node in the cluster if it is found that the cache resource is accessed more frequently from that node.

In 10G R1 this was file based whereas the 10G R2 it is object based.
In 10G R1 due to a few bugs many related to high CPU usage during the DRM freeze window most customers disabled DRM by setting the following parameters
_gc_affinity_time=0
_gc_undo_affinity=FALSE

_gc_affinity_time defines the frequency in minutes to check if remastering
is needed.
_gc_affinity_limit defines the number of times a node must access an object
for it to be a DRM candidate
_gc_affinity_minimum defines the minimum number of times an object is accessed per minute before affinity kicks in

The performance problems may manifest themselves in terms of a DRM related wait event like ‘gcs drm freeze in enter server mode’

In 10G R2 this feature appears to be more stable.

You can also manually remaster an object on a different node which
is different from the node on which the object is currrently mastered as shown below

SQL> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615

OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
———- ————– ————— ————
144615 0 2 0

The object 144615 is currently mastered on node 0.
To remaster the object onto node 2 connect to node 2 as sysdba

Go to instance 2
NODE2> oradebug setmypid
Statement processed.
NODE2> oradebug lkdebug -m pkey 144615
Statement processed.
NODE2> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615

OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
———- ————– ————— ————
144615 2 0 0
Note: In V$GCSPFMASTER_INFO you will also see resources with object ids in the 4Gb range (e.g. 4294950913)
These are for undo segments.

To dissolve remastering of this object on this instance
SQL> oradebug lkdebug -m dpkey 144615
Statement processed.

SQL> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615;
no rows selected

The remaster_cnt appears to be 0 for all objects. I have got Oracle
to log bug 5649377 on this issue.

SQL> select distinct remaster_cnt from V$GCSPFMASTER_INFO ;

REMASTER_CNT
————
0

DRM statistics are available in X$KJDRMAFNSTATS

SQL> select * from X$KJDRMAFNSTATS
2 /

ADDR INDX INST_ID DRMS AVG_DRM_TIME OBJECTS_PER_DRM QUISCE_T FRZ_T CLEANUP_T REPLAY_T FIXWRITE_T SYNC_T
——– ———- ———- ———- ———— ————— ———- ———- ———- ———- ———- ———-
RES_CLEANED REPLAY_S REPLAY_R MY_OBJECTS
———– ———- ———- ———-
200089CC 0 1 32 214 1 3 14 0 0 0 99
0 2441 6952 30

The column MY_OBJECTS denotes the number of objects mastered on that node.
This should match with the following

SQL> select count(*) from V$GCSPFMASTER_INFO where current_master=0
2 /

COUNT(*)
———-
30 “

回复
admin says

2010/08/14 at 12:29

Hdr: 5649377 10.2.0.1.0 RDBMS 10.2.0.1.0 DLM PRODID-5 PORTID-46
Abstract: REMASTER_CNT IN V$GCSPFMASTER_INFO IS NOT GETTING UPDATED
PROBLEM:
——–
When a object is getting remastered the column REMASTER_CNT in
V$GCSPFMASTER_INFO is not getting updated

DIAGNOSTIC ANALYSIS:
——————–
SQL> create table t1 ( a number);

Table created.

SQL> begin
2 for i in 1..100
3 loop
4 insert into t1 values(i);
5 end loop;
6 commit;
7 end;
8 /

PL/SQL procedure successfully completed.

SQL> select count(*) from t1;

COUNT(*)
———-
100

SQL> select * from t1;

A
Run select * from t1 multiple times and get the object mastered on any node.

In our case the object is currently mastered on the node elephant

elephant> select * from V$GCSPFMASTER_INFO where object_id = 290525;

FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
———- ———- ————– ————— ————
0 290525 0 32767 0

Force a remastering onto 2nd node hippo
hippo>>oradebug lkdebug -m pkey 290525
Statement processed.
hippo>> select * from V$GCSPFMASTER_INFO where object_id = 290525;

FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
———- ———- ————– ————— ————
0 290525 2 0 0

The remaster_cnt has not increased.

Force a remastering onto 3rd node rhino
rhino> oradebug lkdebug -m pkey 290525
Statement processed.
rhino> select * from V$GCSPFMASTER_INFO where object_id = 290525;

FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
———- ———- ————– ————— ————
0 290525 1 0 0

Again the remaster count has not increased.

WORKAROUND:
———–
NA

RELATED BUGS:
————-
NA

REPRODUCIBILITY:
—————-
Reproducible

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

REDISCOVERY INFORMATION:
Remaster count (REMASTER_CNT) in V$GCSPFMASTER_INFO is always 0, even after
remastering has occurred.
WORKAROUND:
None
RELEASE NOTES:
]]When dynamic remastering occurred, the remaster_cnt column in the dynamic
]]view v$gcspfmaster_info was not being updated. This has now been fixed.

回复
maclean says

2010/08/18 at 15:47

lmon Process Terminated With Error Ora-481
Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.4 to 10.2.0.2
This problem can occur on any platform.
RAC 10.1.0.3 (or 10.1.0.4)
Symptoms
Instance finished due to ORA-481 in lmon.

Cause
This can be produced for the internal bug: 3659289 Lmon Took Off Line During Remastering Sync.

The bug matches:

1. lmon produces the ORA-481
2. instance crashed
3. lmon failed to get an anwer from the lmd during a dynamic reconfiguration in the same step:

* kjfcdrmrfg: SYNC TIMEOUT while waiting for lmd in sync step 31
..
sync() timed out – lmon exiting

4. a quiesce step was executed before the error:

Solution
Before 10.2.0.3, we should not use DRM. The main reason is that we have some issues fixed in 10.2.0.3.0 which cannot be backported to earlier versions.

To disable DRM the following parameters must be set on the instances:

– 10gR1
_gc_affinity_time=0
– 10gR2
_gc_undo_affinity=FALSE
_gc_affinity_time=0

回复
maclean says

2010/08/18 at 15:48

Hdr: 5181297 10.2.0.1.0 RDBMS 10.2.0.1.0 RAC PRODID-5 PORTID-197 ORA-481
Abstract: LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT (STEP 31) – ORA-481
PROBLEM:
——–
o 4 node HP-UX Itanium RAC cluster running 10.2.0.1.0
o each node has 48 processors
o customer running migration testing, replacing PA RISC with Itanium
processors
o on Apr 17 at around 23:15 instance on node 1 terminated due to LMON dying
o LMON apparently terminated due to a timeout recorded in the LMON trace file
kjfcdrmrfg: SYNC TIMEOUT (48358, 47457, 900), step 31
o prior to the instance termination all instances were experiencing a hang
o after instance 1 died other instance “unfroze”

DIAGNOSTIC ANALYSIS:
——————–
There is no obvious reason for the instance termination. Resource starvation
appears unlikely as the CPU load during the time frame in qusestion is rather
low. The highest load on some processors is 20%.

WORKAROUND:
———–
none

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-
intermittent

TEST CASE:
———-
n/a

STACK TRACE:
————
LMON trace shows the following call stack:

ksedst ksedmp $cold_ksddoa ksdpcg ksdpec ksfpec
kgesev ksesec0 kjfcdrmrfg kjfcln ksbrdp opirip
opidrv sou2o opimai_real main main_opd_entry

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————
on ngmp1.
=========
From ngmp1_lmon_15085.trc
————————-
*** 23:15:44.052
kjfcdrmrfg: SYNC TIMEOUT (48358, 47457, 900), step 31
Submitting asynchronized dump request [28]
…
… <>
error 481 detected in background process
ORA-481: LMON process terminated with error
*** 23:16:37.822
ksuitm: waiting for [5] seconds before killing DIAG

From alert.log
————–
Mon Apr 17 23:16:06 2006
Errors in file /oracle/app/oracle/admin/NGMP/bdump/ngmp1_lmon_15085.trc:
ORA-481: LMON process terminated with error
…
… <>
Mon Apr 17 23:16:42 2006
Instance terminated by LMON, pid = 15085
*** 04/23/06 10:55 pm ***
on ngmp2
========
From alert.log
————–
Mon Apr 17 23:15:23 2006
GES: Potential blocker (pid=23648) on resource SS-0x3-0x2;
enqueue info in file /oracle/app/oracle/admin/NGMP/udump/ngmp2_ora_3359.trc
and DIAG trace file
Mon Apr 17 23:16:08 2006
Trace dumping is performing id=[cdmp_20060417231607]
Mon Apr 17 23:16:47 2006
Reconfiguration started (old inc 16, new inc 18)
List of nodes:
1 2 3

on ngmp3
========
From alert.log
————–
Mon Apr 17 23:16:08 2006
Trace dumping is performing id=[cdmp_20060417231607]
Mon Apr 17 23:16:47 2006
Reconfiguration started (old inc 16, new inc 18)
List of nodes:
1 2 3

on ngmp4
========
From alert.log
————–
Mon Apr 17 23:15:23 2006
GES: Potential blocker (pid=23803) on resource CI-0xe-0x1;
enqueue info in file
/oracle/app/oracle/admin/NGMP/bdump/ngmp4_lmd0_23655.trc and DIAG trace file
Mon Apr 17 23:16:08 2006
Trace dumping is performing id=[cdmp_20060417231607]
Mon Apr 17 23:16:47 2006
Reconfiguration started (old inc 16, new inc 18)
List of nodes:
1 2 3

回复
admin says

2010/08/18 at 15:54

Hdr: 5960580 10.2.0.2 RDBMS 10.2.0.2 RAC PRODID-5 PORTID-23 5181297
Abstract: LMON CRASHING WITH ORA-481 ERROR
PROBLEM:
——–
LMON crashed with ora-481 on Mon Mar 26 18:53:47 2007. Ct said that a
particular job which was completing in 3.5 hours on Friday March 23, was
taking 5 hours on Saturday March 24 and after the crash (March 26) is taking
12 to 15 hours.

I am not sure whether the crash and the performance are related. The crash
did NOT happen when the job was running.

Let us consider this bug for only the LMON crash.

DIAGNOSTIC ANALYSIS:
——————–
alert_PFNR1011.log:
==========================
Mon Mar 26 18:53:36 2007
WARNING: inbound connection timed out (ORA-3136)
Mon Mar 26 18:53:36 2007
WARNING: inbound connection timed out (ORA-3136)
Mon Mar 26 18:53:36 2007
WARNING: inbound connection timed out (ORA-3136)
Mon Mar 26 18:53:36 2007
WARNING: inbound connection timed out (ORA-3136)
Mon Mar 26 18:53:36 2007
???????

Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lmon_1349.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
LMON: terminating instance due to error 481
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lck0_1767.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_dbw2_1485.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms0_1354.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lmd0_1352.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_pmon_1341.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms2_1363.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms4_1410.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms6_1425.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
System state dump is made for local instance
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms1_1359.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms3_1397.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms5_1419.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms7_1432.trc:
ORA-481: LMON process terminated with error
System State dumped to trace file
/oracle/g01/admin/PFNR1011/bdump/pfnr1011_diag_1344.trc
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_ckpt_1513.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_j002_5622.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_dbw3_1491.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:47 2007
Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_j003_5637.trc:
ORA-481: LMON process terminated with error
Mon Mar 26 18:53:52 2007
Instance terminated by LMON, pid = 1349
Mon Mar 26 18:54:50 2007
Starting ORACLE instance (normal)
Mon Mar 26 18:55:29 2007

Reviewing pfnr1011_lmon_1349.trc:
===================================
*** 18:53:27.122
kjfcdrmrfg: SYNC TIMEOUT (275372, 274471, 900), step 31
Submitting asynchronized dump request [28]

kjctseventdump-end tail 205 heads 0 @ 0 205 @ 1047039906
sync() timed out – lmon exiting
kjfsprn: sync status inst 0 tmout 900 (sec)
kjfsprn: sync propose inc 8 level 396
kjfsprn: sync inc 8 level 396

waiting for ‘ges remote message’ blocking sess=0x0 seq=4510 wait_time=0
seconds since wait started=442
waittime=40, loop=0, p3=0
Dumping Session Wait History
for ‘ges remote message’ count=1 wait_time=746343
waittime=40, loop=0, p3=0

WORKAROUND:
———–
no

RELATED BUGS:
————-
similar to bug 5399702 – base bug 5181297
could be bug 4947571 – base bug 4940890

REPRODUCIBILITY:
—————-
no

TEST CASE:
———-
no

STACK TRACE:
————
*** 18:53:27.159
Dumping diagnostic information for ospid 1352:
OS pid = 1352
loadavg : 1.44 1.39 1.39
swap info: free_mem = 34099.71M rsv = 27064.04M
alloc = 26321.46M avail = 95169.25 swap_free = 95911.83M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
TIME CMD
0 S oracle 1352 1 0 40 20 ? 2646619 ? Mar 23
console 22:22 ora_lmd0_PFNR1011
1352: ora_lmd0_PFNR1011
—————– lwp# 1 / thread# 1 ——————–
ffffffff7a8cde8c ioctl (9, 373f, 105e6ba10)
ffffffff7fffc910) + 15f4
710
0000000100e662e0 ksliwat (105c00, 2, 8, 91a000160, 0, 0) + b60
0000000100e66820 kslwaitns (8, 1, 32, 0, 40, 0) + 20
00000001010d61f4 kskthbwt (8, 1, 32, 0, 40, 0) + d4
0000000100e667dc kslwait (8, 32, 0, 15e, 190, 0) + 5c
00000001010e8344 ksxprcv (104fb4, 105d68558, 104fb4128, 1468, 105d68,
104fb4000) + 364
0000000101591254 kjctr_rksxp (40, 403fe88f8, 0, ffffffff7fffd978, 14,
ffffffff7fffd974) + 1f4
0000000101592e24 kjctrcv (ffffffff79626208, 403fe88f8, 105e992c0,
ffffffff7fffe1bc, 40, 32) + 164
000000010157f6a0 kjcsrmg (ffffffff796261f0, 0, 40, 32, 0, 105d71) + 60
00000001015dc098 kjmdm (a, 44, 4097ab030, 8, 4097ab030, 0) + 2ff8
0000000101002a60 ksbrdp (105d6b, 380007774, 380000, 38000e, 105c00,
1015d90a0) + 380
00000001024219f8 opirip (105d75000, 105c00, 105d7d, 380007000, 105d75,
105df1ae0) + 338
00000001002fe790 opidrv (105d77d18, 1, 32, 0, 32, 105c00) + 4b0
00000001002f8e30 sou2o (ffffffff7ffff468, 32, 4, ffffffff7ffff490,
1056ac000, 1056ac) + 50
00000001002bc2ec opimai_real (3, ffffffff7ffff568, 0, 0, 1e42ee4, 14400) +
10c
00000001002bc118 main (1, ffffffff7ffff678, 0, ffffffff7ffff570,
ffffffff7ffff680, ffffffff7aa00140) + 98
00000001002bc03c _start (0, 0, 0, 0, 0, 0) + 17c

SUPPORTING INFORMATION:
———————–
will be uploaded

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-
Viral Shah – customer – 267-467-6950. This is a sev2 p1 SR.

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————
Ct is on pre-production. Planning to go production next week. Ct saidd he
cannot upgrade to 10.2.0.3.

回复
admin says

2010/08/18 at 15:54

Hdr: 9405510 10.2.0.4 RDBMS 10.2.0.4 RAC PRODID-5 PORTID-197 ORA-481 6960699
Abstract: LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT, STEP 31
PROBLEM:
——–
Two nodes cluster.
Two node always shutdown abort because of ORA-481

LMON apparently terminated due to a timeout recorded in the LMON trace file
kjfcdrmrfg: SYNC TIMEOUT (289446, 288545, 900), step 31

DIAGNOSTIC ANALYSIS:
——————–
alert_CSP1.log.20100203
————
Thu Jan 28 17:57:09 2010
Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lmon_18295.trc:
ORA-481: LMON process terminated with error
Thu Jan 28 17:57:09 2010
LMON: terminating instance due to error 481
Thu Jan 28 17:57:09 2010
Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lms0_18299.trc:
ORA-481: LMON process terminated with error
Thu Jan 28 17:57:09 2010
Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lms1_18307.trc:
ORA-481: LMON process terminated with error
Thu Jan 28 17:57:09 2010
Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lmd0_18297.trc:
ORA-481: LMON process terminated with error
Thu Jan 28 17:57:09 2010
System state dump is made for local instance
System State dumped to trace file
/cfs/oradata/CSP/admin/CSP/bdump/csp1_diag_18291.trc
Thu Jan 28 17:57:09 2010
Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_pmon_18289.trc:
ORA-481: LMON process terminated with error
Thu Jan 28 17:57:11 2010
Shutting down instance (abort)

alert_CSP2.log.20100203
——————
Thu Jan 28 17:57:10 2010
Trace dumping is performing id=[cdmp_20100128175709]
Thu Jan 28 17:57:18 2010
Reconfiguration started (old inc 64, new inc 66)
List of nodes:
1

From csp1_lmon_18295.trc
=========
*** 17:41:47.285
Begin DRM(1328)
sent syncr inc 64 lvl 8809 to 0 (64,0/31/0)
sent synca inc 64 lvl 8809 (64,0/31/0)
…
sent syncr inc 64 lvl 8840 to 0 (64,0/38/0)
sent synca inc 64 lvl 8840 (64,0/38/0)
End DRM(1328)
Begin DRM(1329)
*** 17:56:49.084
kjfcdrmrfg: SYNC TIMEOUT (289446, 288545, 900), step 31
Submitting asynchronized dump request [28]
==

WORKAROUND:
———–
Disable DRM as below
_gc_undo_affinity=FALSE
_gc_affinity_time=0

RELATED BUGS:
————-
5181297

REPRODUCIBILITY:
—————-

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–
This seems to be very much similar to bug 5181297, which is fixed in 10203.
But ct is on 10204.

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

回复
admin says

2010/08/18 at 15:55

Hdr: 5407742 10.2.0.2 RDBMS 10.2.0.2 RAC PRODID-5 PORTID-46 ORA-481
Abstract: LMON CRASHE WITH ORA-481 KJFCDRMRFG: SYNC TIMEOUT STEP 31
PROBLEM:
——–
12 nodes with 2 instance on each nodes

They ran into bug 4631662 and upgrade to 10.2.0.2 on 7/9/06.

Intance 1 of NMC database crashed with ora-481.

Thu Jul 20 09:25:19 2006
Errors in file
/opt/app/oracle/product/10.2.0/RAC/admin/NMC/bdump/nmc1_lmon_9754.trc:
ORA-481: LMON process terminated with error
Thu Jul 20 09:25:19 2006
LMON: terminating instance due to error 481

This is their production and can’t afford to have instance crashed on the
time

DIAGNOSTIC ANALYSIS:
——————–
From the lmon trace:

nmc1_lmon_9754.trc
—————————
*** 09:10:01.138
sent syncr inc 718 lvl 17953 to 0 (718,0/31/0)
sent synca inc 718 lvl 17953 (718,0/31/0)
sent syncr inc 718 lvl 17954 to 0 (718,0/34/0)
sent synca inc 718 lvl 17954 (718,0/34/0)
sent syncr inc 718 lvl 17955 to 0 (718,0/36/0)
sent synca inc 718 lvl 17955 (718,0/36/0)
sent syncr inc 718 lvl 17956 to 0 (718,0/38/0)
sent synca inc 718 lvl 17956 (718,0/38/0)
*** 09:25:04.835
kjfcdrmrfg: SYNC TIMEOUT (924581, 923680, 900), step 31
Submitting asynchronized dump request [28]
KJC Communication Dump:
state 0x5 flags 0x0 mode 0x0 inst 0 inc 718
nrcv 3 nsp 3 nrcvbuf 1000
reg_msg: sz 420 cur 216 (s:0 i:216) max 925 ini 2350
big_msg: sz 4128 cur 38 (s:0 i:38) max 205 ini 2350
rsv_msg: sz 4128 cur 0 (s:0 i:0) max 0 tot 1000
rcvr: id 2 orapid 8 ospid 9788
rcvr: id 1 orapid 7 ospid 9784
rcvr: id 0 orapid 6 ospid 9756
…………….

kjctseventdump-end tail 247 heads 0 @ 0 247 @ 931725070 247 @ 931725070 247 @

931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070 247
931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070
sync() timed out – lmon exiting
kjfsprn: sync status inst 0 tmout 900 (sec)
kjfsprn: sync propose inc 718 level 17956
kjfsprn: sync inc 718 level 17956
kjfsprn: sync bitmap 0 1 2 3 4 5 6 7 8 9 10 11
kjfsprn: dmap ver 718 (step 0)
kjfsprn: ftdone bitmap 0 1 2 3 5 6 7 8 9 10 11

This seem to match Bug 5237240 and that bug is at status 30.

WORKAROUND:
———–
none

RELATED BUGS:
————-
Bug 5181297 – LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT (STEP 31) –
ORA-481
– fixed in 10.2.0.3

Bug 5131042 – ORA-481: LMON PROCESS TERMINATED WITH ERROR SYNC TIMEOUT
– dup of 4903532
– fix in 10.2.0.3

REPRODUCIBILITY:
—————-

TEST CASE:
———-

STACK TRACE:
————
there is no lmon stack trace

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

回复

RAC动态资源(DRM)管理介绍

Comments

Comment 取消回复