Examine 11g automatic block Corruption recovery

11g的高可用框架中DataGuard为我们带来了大量有用的特性,最引入注目的显然是Active Data Guard,这一特性几乎彻底改观了Oracle HA的原有格局。除了Active Data Guard外Automatic Block Media Repair即自动的块介质恢复也是11g中数据卫士一个不容忽视的特色。该特性通过后台进程ABMR自动将物理备库(physical standby)上的健康数据块传输到主库(primary database)上以替换在主库发现的已损坏的数据块。同样的若物理备库上发现数据块损坏那么也可以利用到以上特性来修复。注意使用该特性无需额外设置db_lost_write_protect参数为非默认的NONE值,ABMR的自动修复不依赖于该参数。

深入研究的话可以发现实际控制Automatic Block Media Repair特性的是一系列隐藏参数,它们包括:

_auto_bmr enabled enable/disable Auto BMR
_auto_bmr_req_timeout 60 Auto BMR Requester Timeout
_auto_bmr_sess_threshold 30 Auto BMR Request Session Threshold
_auto_bmr_pub_timeout 10 Auto BMR Publish Timeout
_auto_bmr_fc_time 60 Auto BMR Flood Control Time
_auto_bmr_bg_time 3600 Auto BMR Process Run Time
_auto_bmr_sys_threshold 100 Auto BMR Request System Threshold

显然这里面_auto_bmr隐藏参数是ABMR特性的开关,其默认值为enabled,而其他参数则定义了abmr的超时和限定阀值,这里不做展开。

这里我们来实地体验一下这种高可用的block repair特性:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE	11.2.0.2.0	Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

/* 演示所用数据库版本为较新的11.2.0.2 */

SQL> conn maclean/maclean
Connected.

SQL> create tablespace abmr datafile size 10M;
Tablespace created.

SQL> select file_name from dba_data_files where tablespace_name='ABMR';

FILE_NAME
--------------------------------------------------------------------------------
/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf

SQL>  create table need_repair tablespace abmr  as select 1 t1 from dual;
Table created.

SQL> exec dbms_stats.gather_table_stats('MACLEAN','NEED_REPAIR');
PL/SQL procedure successfully completed.

SQL> select * from need_repair;
	T1
----------
	 1

SQL>  select dbms_rowid.rowid_block_number(rowid) from need_repair;

DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
------------------------------------
				 131

SQL> alter system flush buffer_cache;
System altered.

SQL> alter system flush buffer_cache;
System altered.

[maclean@rh6 ~]$ bbed FILENAME=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf mode=edit
Password: 

BBED: Release 2.0.0.0.0 - Limited Production on Sat Mar 26 20:42:25 2011

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

************* !!! For Oracle Internal Use only !!! ***************

BBED> set block 131
	BLOCK#         	131

BBED> corrupt block 131
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Block marked media corrupt.

/* 以上我们使用bbed工具将need_repair表上唯一一行数据所在的数据块损坏 
   这样如果没有11g中automatic block Corruption recovery特性的话,
   应当报ORA-01578:ORACLE data block corrupted错误
*/

SQL> conn maclean/maclean
Connected.
SQL> select * from need_repair;

	T1
----------
	 1

/* 以上查询并未出错,显然已经通过后台调用ABMR进程修复了该数据块 */

/* 告警日志中记录了ABMR的修复过程 */

Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24289.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)

/* 以上dedicated server process 2次从disk读取该块都发现损坏后,启动了后台进程ABMR,
   在实例启动时abmr进程并不随instance启动,仅当需要时被服务进程启动 
 */

Starting background process ABMR
ABMR started with pid=33, OS id=24293 
Automatic block media recovery service is active.
Automatic block media recovery requested for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
WARNING: AutoBMR fixed mismatched on-disk single block 83 with in-mem rdba 1400083.

/* 同时abmr不是fatal的后台进程,杀死该进程不会导致不良影响 */

[maclean@rh6 ~]$ ps -ef|grep abmr
maclean  24293     1  0 20:43 ?        00:00:00 ora_abmr_PROD
maclean  24390 22254  0 20:49 pts/0    00:00:00 grep abmr
[maclean@rh6 ~]$ kill -9 24293

/* 如果不想使用Automatic Block Media Repair特性,抑或者因为该特性出现了一些问题的话,
   设置_auto_bmr为diabled即禁用该特性一般可以帮助我们绕过问题 */


SQL> alter system set "_auto_bmr"=disabled;
System altered.

SQL> select * from need_repair;
select * from need_repair
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5:
'/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'

/* 如预期的出现了ORA-01578错误 */

相关的告警日志内容 :
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
Corrupt Block Found
         TSN = 5, TSNAME = ABMR
         RFN = 5, BLK = 131, RDBA = 20971651
         OBJN = 13773, OBJD = 13773, OBJECT = NEED_REPAIR, SUBOBJECT = 
         SEGMENT OWNER = MACLEAN, SEGMENT TYPE = Table Segment
Errors in file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc  (incident=5081):
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5: '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'
Incident details in: /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_ora_24742_i5081.trc
2011-03-26 21:08:18.718000 +08:00
Sweep [inc][5081]: completed
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_m000_24753_i5081_a.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during validation
Data in bad block:
 type: 6 format: 2 rdba: 0x00000083
 last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x000006ff
 check value in block header: 0x154d
 computed block checksum: 0x0
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Checker run found 1 new persistent data failures
Dumping diagnostic data in directory=[cdmp_20110326210819], requested by (instance=1, osid=24742), summary=[incident=5081].
2011-03-26 21:08:21.458000 +08:00
Sweep [inc2][5081]: completed

to be continued …………

沪ICP备14014813号-2

沪公网安备 31010802001379号