Examine 11g automatic block Corruption recovery

2011/03/26 by admin 5 Comments

11g的高可用框架中DataGuard为我们带来了大量有用的特性,最引入注目的显然是Active Data Guard，这一特性几乎彻底改观了Oracle HA的原有格局。除了Active Data Guard外Automatic Block Media Repair即自动的块介质恢复也是11g中数据卫士一个不容忽视的特色。该特性通过后台进程ABMR自动将物理备库(physical standby)上的健康数据块传输到主库(primary database)上以替换在主库发现的已损坏的数据块。同样的若物理备库上发现数据块损坏那么也可以利用到以上特性来修复。注意使用该特性无需额外设置db_lost_write_protect参数为非默认的NONE值，ABMR的自动修复不依赖于该参数。

深入研究的话可以发现实际控制Automatic Block Media Repair特性的是一系列隐藏参数，它们包括:

_auto_bmr	enabled	enable/disable Auto BMR
_auto_bmr_req_timeout	60	Auto BMR Requester Timeout
_auto_bmr_sess_threshold	30	Auto BMR Request Session Threshold
_auto_bmr_pub_timeout	10	Auto BMR Publish Timeout
_auto_bmr_fc_time	60	Auto BMR Flood Control Time
_auto_bmr_bg_time	3600	Auto BMR Process Run Time
_auto_bmr_sys_threshold	100	Auto BMR Request System Threshold

显然这里面_auto_bmr隐藏参数是ABMR特性的开关，其默认值为enabled，而其他参数则定义了abmr的超时和限定阀值，这里不做展开。

这里我们来实地体验一下这种高可用的block repair特性:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE	11.2.0.2.0	Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

/* 演示所用数据库版本为较新的11.2.0.2 */

SQL> conn maclean/maclean
Connected.

SQL> create tablespace abmr datafile size 10M;
Tablespace created.

SQL> select file_name from dba_data_files where tablespace_name='ABMR';

FILE_NAME
--------------------------------------------------------------------------------
/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf

SQL>  create table need_repair tablespace abmr  as select 1 t1 from dual;
Table created.

SQL> exec dbms_stats.gather_table_stats('MACLEAN','NEED_REPAIR');
PL/SQL procedure successfully completed.

SQL> select * from need_repair;
	T1
----------
	 1

SQL>  select dbms_rowid.rowid_block_number(rowid) from need_repair;

DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
------------------------------------
				 131

SQL> alter system flush buffer_cache;
System altered.

SQL> alter system flush buffer_cache;
System altered.

[maclean@rh6 ~]$ bbed FILENAME=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf mode=edit
Password: 

BBED: Release 2.0.0.0.0 - Limited Production on Sat Mar 26 20:42:25 2011

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

************* !!! For Oracle Internal Use only !!! ***************

BBED> set block 131
	BLOCK#         	131

BBED> corrupt block 131
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Block marked media corrupt.

/* 以上我们使用bbed工具将need_repair表上唯一一行数据所在的数据块损坏 
   这样如果没有11g中automatic block Corruption recovery特性的话，
   应当报ORA-01578:ORACLE data block corrupted错误
*/

SQL> conn maclean/maclean
Connected.
SQL> select * from need_repair;

	T1
----------
	 1

/* 以上查询并未出错，显然已经通过后台调用ABMR进程修复了该数据块 */

/* 告警日志中记录了ABMR的修复过程 */

Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24289.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)

/* 以上dedicated server process 2次从disk读取该块都发现损坏后，启动了后台进程ABMR,
   在实例启动时abmr进程并不随instance启动，仅当需要时被服务进程启动 
 */

Starting background process ABMR
ABMR started with pid=33, OS id=24293 
Automatic block media recovery service is active.
Automatic block media recovery requested for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
WARNING: AutoBMR fixed mismatched on-disk single block 83 with in-mem rdba 1400083.

/* 同时abmr不是fatal的后台进程，杀死该进程不会导致不良影响 */

[maclean@rh6 ~]$ ps -ef|grep abmr
maclean  24293     1  0 20:43 ?        00:00:00 ora_abmr_PROD
maclean  24390 22254  0 20:49 pts/0    00:00:00 grep abmr
[maclean@rh6 ~]$ kill -9 24293

/* 如果不想使用Automatic Block Media Repair特性，抑或者因为该特性出现了一些问题的话，
   设置_auto_bmr为diabled即禁用该特性一般可以帮助我们绕过问题 */


SQL> alter system set "_auto_bmr"=disabled;
System altered.

SQL> select * from need_repair;
select * from need_repair
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5:
'/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'

/* 如预期的出现了ORA-01578错误 */

相关的告警日志内容 :
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
Corrupt Block Found
         TSN = 5, TSNAME = ABMR
         RFN = 5, BLK = 131, RDBA = 20971651
         OBJN = 13773, OBJD = 13773, OBJECT = NEED_REPAIR, SUBOBJECT = 
         SEGMENT OWNER = MACLEAN, SEGMENT TYPE = Table Segment
Errors in file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc  (incident=5081):
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5: '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'
Incident details in: /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_ora_24742_i5081.trc
2011-03-26 21:08:18.718000 +08:00
Sweep [inc][5081]: completed
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_m000_24753_i5081_a.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during validation
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Checker run found 1 new persistent data failures
Dumping diagnostic data in directory=[cdmp_20110326210819], requested by (instance=1, osid=24742), summary=[incident=5081].
2011-03-26 21:08:21.458000 +08:00
Sweep [inc2][5081]: completed

to be continued …………

Filed Under: Oracle Tagged With: Automatic Block Media Repair, data guard, Oracle 11g新特性, Oracle Corruption

Comments

honcho says

2011/10/11 at 15:19

为什么我没有成功！
ALTER SYSTEM: Flushing buffer cache
Hex dump of (file 5, block 11) in trace file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc
Corrupt block relative dba: 0x0140000b (file 5, block 11)
Fractured block found during multiblock buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x0140000b
last change scn: 0x0000.0013e1d5 seq: 0x3 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00000000
check value in block header: 0xa4ae
computed block checksum: 0xe2fb
Reading datafile ‘/u01/app/oracle/oradata/prim/honcho01.dbf’ for corruption at rdba: 0x0140000b (file 5, block 11)
Reread (file 5, block 11) found same corrupt data
Tue Oct 11 14:57:20 2011
Corrupt Block Found
TSN = 8, TSNAME = HONCHO
RFN = 5, BLK = 11, RDBA = 20971531
OBJN = 73559, OBJD = 73559, OBJECT = HONCHO, SUBOBJECT =
SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Hex dump of (file 5, block 12) in trace file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc
Corrupt block relative dba: 0x0140000c (file 5, block 12)
Completely zero block found during multiblock buffer read
Reading datafile ‘/u01/app/oracle/oradata/prim/honcho01.dbf’ for corruption at rdba: 0x0140000c (file 5, block 12)
Reread (file 5, block 12) found same corrupt data
Corrupt Block Found
TSN = 8, TSNAME = HONCHO
RFN = 5, BLK = 12, RDBA = 20971532
OBJN = 73559, OBJD = 73559, OBJECT = HONCHO, SUBOBJECT =
SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Hex dump of (file 5, block 13) in trace file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc
Corrupt block relative dba: 0x0140000d (file 5, block 13)
Fractured block found during multiblock buffer read
Data in bad block:
type: 0 format: 0 rdba: 0x00000000
last change scn: 0x0000.00000000 seq: 0x0 flg: 0x00
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe1d50603
check value in block header: 0x0
block checksum disabled
Reading datafile ‘/u01/app/oracle/oradata/prim/honcho01.dbf’ for corruption at rdba: 0x0140000d (file 5, block 13)
Reread (file 5, block 13) found same corrupt data
Corrupt Block Found
TSN = 8, TSNAME = HONCHO
RFN = 5, BLK = 13, RDBA = 20971533
OBJN = 73559, OBJD = 73559, OBJECT = HONCHO, SUBOBJECT =
SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Errors in file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc (incident=33749):
ORA-01578: ORACLE data block corrupted (file # 5, block # 11)
ORA-01110: data file 5: ‘/u01/app/oracle/oradata/prim/honcho01.dbf’
Incident details in: /u01/app/oracle/diag/rdbms/prim/prim/incident/incdir_33749/prim_ora_7496_i33749.trc
Tue Oct 11 14:57:23 2011
Identified End-Of-Redo for thread 1 sequence 93
Identified End-Of-Redo for thread 1 sequence 93
Tue Oct 11 14:57:24 2011
Trace dumping is performing id=[cdmp_20111011145724]
Tue Oct 11 14:57:25 2011
Sweep [inc][33749]: completed
Hex dump of (file 5, block 11) in trace file /u01/app/oracle/diag/rdbms/prim/prim/incident/incdir_33749/prim_m000_7521_i33749_a.trc
Corrupt block relative dba: 0x0140000b (file 5, block 11)
Fractured block found during validation
Data in bad block:
type: 6 format: 2 rdba: 0x0140000b
last change scn: 0x0000.0013e1d5 seq: 0x3 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00000000
check value in block header: 0xa4ae
computed block checksum: 0xe2fb
Reread of blocknum=11, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=11, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=11, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=11, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=11, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Hex dump of (file 5, block 12) in trace file /u01/app/oracle/diag/rdbms/prim/prim/incident/incdir_33749/prim_m000_7521_i33749_a.trc
Corrupt block relative dba: 0x0140000c (file 5, block 12)
Completely zero block found during validation
Reread of blocknum=12, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=12, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=12, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=12, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=12, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Hex dump of (file 5, block 13) in trace file /u01/app/oracle/diag/rdbms/prim/prim/incident/incdir_33749/prim_m000_7521_i33749_a.trc
Corrupt block relative dba: 0x0140000d (file 5, block 13)
Fractured block found during validation
Data in bad block:
type: 0 format: 0 rdba: 0x00000000
last change scn: 0x0000.00000000 seq: 0x0 flg: 0x00
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe1d50603
check value in block header: 0x0
block checksum disabled
Reread of blocknum=13, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=13, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=13, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=13, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Reread of blocknum=13, file=/u01/app/oracle/oradata/prim/honcho01.dbf. found same corrupt data
Checker run found 2 new persistent data failures
Tue Oct 11 14:58:52 2011
Hex dump of (file 5, block 11) in trace file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc
Corrupt block relative dba: 0x0140000b (file 5, block 11)
Fractured block found during multiblock buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x0140000b
last change scn: 0x0000.0013e1d5 seq: 0x3 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00000000
check value in block header: 0xa4ae
computed block checksum: 0xe2fb
Reading datafile ‘/u01/app/oracle/oradata/prim/honcho01.dbf’ for corruption at rdba: 0x0140000b (file 5, block 11)
Reread (file 5, block 11) found same corrupt data
Tue Oct 11 14:58:52 2011
Corrupt Block Found
TSN = 8, TSNAME = HONCHO
RFN = 5, BLK = 11, RDBA = 20971531
OBJN = 73559, OBJD = 73559, OBJECT = HONCHO, SUBOBJECT =
SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Errors in file /u01/app/oracle/diag/rdbms/prim/prim/trace/prim_ora_7496.trc (incident=33750):
ORA-01578: ORACLE data block corrupted (file # 5, block # 11)
ORA-01110: data file 5: ‘/u01/app/oracle/oradata/prim/honcho01.dbf’
Incident details in: /u01/app/oracle/diag/rdbms/prim/prim/incident/incdir_33750/prim_ora_7496_i33750.trc
Tue Oct 11 14:58:53 2011
Trace dumping is performing id=[cdmp_20111011145853]
Tue Oct 11 14:58:54 2011
Sweep [inc][33750]: completed
Sweep [inc2][33750]: completed
Sweep [inc2][33749]: completed

回复
honcho says

2011/10/11 at 15:37

后台进程BMR并没有启动
A physical standby database operating in real-time query mode can also be used to
repair corrupt data blocks in a primary database. If possible, during automatic block
repair any corrupt data block encountered when a primary database is accessed is
automatically replaced with an uncorrupted copy of that block from a physical
standby database operating in real-time query mode. Note that for this to work, the
standby database must be synchronized with the primary database.

我是按照上述要求做的

回复
- admin says
  
  2011/10/11 at 18:27
  
  请把你的详细步骤贴出来
  
  回复
Honcho says

2011/10/11 at 22:45

可以了，谢谢！出了点小差错！嘿嘿！

回复
admin says

2011/10/11 at 23:03

Hdr: 11809377 11.2.0.1 RDBMS 11.2.0.1 DATAGUARD_PSBY PRODID-5 PORTID-226
Abstract: AUTO BLOCK MEDIA RECOVERY IS CAUSING PASSWORD ERRORS ON PHYSICAL STANDBY

PROBLEM:
——–
Errors in file
/u01/oracle/db/diag/rdbms/p411/p411i8/trace/p411i8_pr0e_9448.trc:
ORA-17629: Cannot connect to the remote database server
ORA-17629: Cannot connect to the remote database server
ORA-17629: Cannot connect to the remote database server

DIAGNOSTIC ANALYSIS:
——————–
alert_p411i8.log
=================
Sun Dec 05 04:43:06 2010 <-- start of log Adjusting the default value of parameter parallel_max_servers ... Mon Feb 07 18:58:13 2011 <-- first ORA-1017 Block version read by Primary database is higher than the block version on Standby database => possible lost write on the Standby
Errors in file
/u01/oracle/db/diag/rdbms/p411/p411i8/trace/p411i8_pr05_9422.trc:
ORA-17629: Cannot connect to the remote database server
…
Wed Feb 09 15:46:04 2011 <-- last ORA-1017 Errors in file /u01/oracle/db/diag/rdbms/p411/p411i8/trace/p411i8_pr0e_9448.trc: ORA-17629: Cannot connect to the remote database server ... Wed Feb 09 22:37:51 2011 <-- end of log Media Recovery Waiting for thread 6 sequence 17867 p411i8_pr0e_9448.trc ==================== Warning: recovery process PR0E cannot use async I/O Block version read by Primary database is higher than the *** 18:59:42.193 block version on Standby database => possible lost write on the Standby
*** 18:59:42.365 4539 krsu.c
Logged on to standby successfully
krsd_get_primary_connect_string: found pcs ‘P411DG_EX4’ by FAL_SERVER lookup
‘
ORA-17629: Cannot connect to the remote database server

p411i8_pr05_9422.trc
====================
Warning: recovery async reads limited to 128 (req 1024)
Block version read by Primary database is higher than the

*** 18:58:13.032
block version on Standby database => possible lost write on the Standby
*** 18:58:13.143 4539 krsu.c
Logged on to standby successfully
krsd_get_primary_connect_string: found pcs ‘P411DG_EX4’ by FAL_SERVER lookup
‘
ORA-17629: Cannot connect to the remote database server

WORKAROUND:
———–
None

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-
Error is occurring intermittently.

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

Sorry about the delay. I did received feedback from base dev:

“When a primary (or another standby) that encounters corrupt blocks attempts
to fetch the corrupt blocks from a standby database, the target standby
database (which is the source of the good version of the block) has to be
ADG. We cannot fetch a good version of a block from a mounted standby. We can
only fetch a good version of the block from an open database (open read-write
or open read-only). Also, it has to be fairly upto date (but there is no
requirement for it to be driven by LGWR SYNC – just RTA is recommended. I
know there is some confusion here, since some book or paper (MAA or RMAN?)
talks about requiring a synchronized ADG)

If the corruption is noticed at a standby (say due to a lost-write), that
standby does not have to be ADG for autobmr to work – The standby that
encounters corruption can be ADG or simply mounted. However, for it to
receive the good version of the block, the target database, if a standby, has
to be ADG. Of course, the target database for the good version of the block
can also be Primary.”

Basically the ADG requirement only matters for the primary side being able to
use ABMR.

attched from 8 nodes on primary
[oracle@dm04db08 ~]$ dcli -l oracle -g /usr/oracle/dbs_group ” ls -l

dm04db01: -rw-r—– 1 oracle oinstall 1536 Feb 14 10:55
dm04db02: -rw-r—– 1 oracle oinstall 1536 Feb 14 10:51
dm04db03: -rw-r—– 1 oracle oinstall 1536 Feb 14 09:17
dm04db04: -rw-r—– 1 oracle oinstall 1536 Feb 14 10:41
dm04db04: -rw-r—– 1 oracle oinstall 1536 Nov 5 13:54
dm04db05: -rw-r—– 1 oracle oinstall 1536 Nov 19 13:31
dm04db06: -rw-r—– 1 oracle oinstall 1536 Feb 10 07:18
dm04db07: -rw-r—– 1 oracle oinstall 1536 Nov 19 13:31
dm04db08: -rw-r—– 1 oracle oinstall 1536 Feb 14 08:08

then likely the ORA-1017 happens when we’re trying to connect to one of them.

Note that this error isn’t persistent. It does seem to go away so likely a

To confirm, you can have them setup instance-specific Oracle Net aliases in
tnsnames.ora on one of the nodes or use EZConnect string. Then try to
connect remotely to each primary instance to see for sure if they’ll

% sqlplus sys@p411i1 as sysdba
% sqlplus sys@p411i2 as sysdba
% sqlplus sys@p411i3 as sysdba
etc ..

only updated on that instance, not on all instances. Ct will have to re-copy
to the other nodes.

If ct doesn’t want to use ABMR, they can stay with the workaround of
_auto_bmr=disabled. Otherwise I would encourage them to use it and make sure
standby can connect to any primary instance.

I would also recommend they monitor how often the ‘lost write’ is reported.
They may need to investigate further into it if it’s happening regularly.

回复

Examine 11g automatic block Corruption recovery

Comments

Comment 取消回复