11g新特性:Rolling Upgrade With Physical Standby

从Oracle 10.1.0.3开始引入了利用逻辑备库(logical standby)实施滚动升级(rolling upgrade)的特性;在滚动升级期间(rolling upgrade),允许主库(primary database)与逻辑备库(logical standby)间运行不同的数据库版本,以最小化宕机时间。

到了11g中增加了可以将物理备库(physical standby)临时性转换成逻辑备库(logical standby)以完成滚动升级,之后将该临时逻辑备库反转为物理备库的功能。使用以上临时转换功能只需要在转换语句”ALTER DATABASE RECOVER TO LOGICAL STANDBY”后加上”KEEP IDENTITY”选项。除去转换细节的区别外,使用物理备库进行滚动升级的过程与10g中的逻辑备库滚动升级没有太大的区别。

这里我们通过实验来亲身体验一下这一新特性,我们假设环境中存在一套11.1.0.7的Data Guard系统:

Database Role DB_UNIQUE_NAME Version
Primary Database PROD 11.1.0.7
Physical Standby SBDB2 11.1.0.7

当前的需求是将生产数据库升级到11.2.0.2,但要求最小化应用downtime;这里我们就可以充分利用Data Guard的环境借势升级,同时也不会破坏原有的HA可用性:

1.滚动升级前需要为Primary Database做必要的准备工作,这里如果数据库之前没有启用闪回数据库的话需要enable

SQL> SHUTDOWN IMMEDIATE;
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE FLASHBACK ON;

SQL> select flashback_on from v$database;

FLASHBACK_ON
------------------
YES

SQL> ALTER DATABASE OPEN;

/* 同时创建保证还原点以保证升级失败时可以回退 */

SQL> create restore point pre_rolling_upgrd guarantee flashback database;

2.将物理备库临时转换为逻辑备库,不同于普通的转换这里要用KEEP IDENTITY

SQL> shutdown immediate;
SQL> startup mount;

之后需要在Primary Database中创建Logminer dictionary日志挖掘字典:
SQL> EXECUTE DBMS_LOGSTDBY.BUILD;

以上步骤完成后可以开始将physical standby转换为logical standby了:
SQL> ALTER DATABASE RECOVER TO LOGICAL STANDBY KEEP IDENTITY;

SQL> alter database open;

以下为转换期间的日志:
ALTER DATABASE RECOVER TO LOGICAL STANDBY KEEP IDENTITY
Media Recovery Start: Managed Standby Recovery (SBDB2)
Fast Parallel Media Recovery enabled
Managed Standby Recovery not using Real Time Apply
 parallel recovery started with 2 processes
Media Recovery Waiting for thread 1 sequence 184
2011-03-25 19:36:51.012000 +08:00
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
kcrrvslf: active RFS archival for log 8 thread 1 sequence 188
RFS[3]: Successfully opened standby log 7: '/standby/oradata/SBDB2/onlinelog/o1_mf_7_6qf9b6cw_.log'
2011-03-25 19:40:43.074000 +08:00
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
RFS[3]: Successfully opened standby log 8: '/standby/oradata/SBDB2/onlinelog/o1_mf_8_6qf9b8j4_.log'
Media Recovery Log /standby/flash_recovery_area/SBDB2/archivelog/2011_03_25/o1_mf_1_188_6rrzsqvm_.arc
2011-03-25 19:40:45.459000 +08:00
Media Recovery Log /standby/flash_recovery_area/SBDB2/archivelog/2011_03_25/o1_mf_1_189_6rrzsv10_.arc
Incomplete Recovery applied until change 11881985 time 03/25/2011 19:40:41
Media Recovery Complete (SBDB2)
tkcrrxms: Killing 140733193388036 processes (all RFS)
2011-03-25 19:40:48.011000 +08:00
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
RESETLOGS after incomplete recovery UNTIL CHANGE 11881985
Resetting resetlogs activation ID 157033137 (0x95c22b1)
Online log /standby/oradata/SBDB2/onlinelog/o1_mf_1_6qf99xt6_.log: Thread 1 Group 1 was previously cleared
Online log /standby/flash_recovery_area/SBDB2/onlinelog/o1_mf_1_6qf99y1p_.log: Thread 1 Group 1 was previously cleared
Online log /standby/oradata/SBDB2/onlinelog/o1_mf_2_6qf9b01v_.log: Thread 1 Group 2 was previously cleared
Online log /standby/flash_recovery_area/SBDB2/onlinelog/o1_mf_2_6qf9b06y_.log: Thread 1 Group 2 was previously cleared
Online log /standby/oradata/SBDB2/onlinelog/o1_mf_3_6qf9b21o_.log: Thread 1 Group 3 was previously cleared
Online log /standby/flash_recovery_area/SBDB2/onlinelog/o1_mf_3_6qf9b268_.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 11881983
Setting recovery target incarnation to 4
Converting standby mount to primary mount.
ACTIVATE STANDBY: Complete - Database mounted as primary (SBDB2)
Completed: ALTER DATABASE RECOVER TO LOGICAL STANDBY KEEP IDENTITY

我们还需要完成一系列逻辑备库的配置:

a.禁止在逻辑备库端删除外籍日志 

SQL> EXECUTE DBMS_LOGSTDBY.APPLY_SET('LOG_AUTO_DELETE', 'FALSE');
BEGIN DBMS_LOGSTDBY.APPLY_SET('LOG_AUTO_DELETE', 'FALSE'); END;

ERROR at line 1:
ORA-06550: line 1, column 7:
PLS-00201: identifier 'DBMS_LOGSTDBY.APPLY_SET' must be declared
ORA-06550: line 1, column 7:
PL/SQL: Statement ignored

/* 缺少DBMS_LOGSTDBY包,可以从以下脚本创建 */

SQL> @?/rdbms/admin/dbmslsby.sql

SQL> EXECUTE DBMS_LOGSTDBY.APPLY_SET('LOG_AUTO_DELETE', 'FALSE');
PL/SQL procedure successfully completed.

b.执行以下语句可以将逻辑备库不支持的而又在主库上运行过的事务记录到DBA_LOGSTDBY_EVENTS中

SQL> EXECUTE DBMS_LOGSTDBY.APPLY_SET('MAX_EVENTS_RECORDED',DBMS_LOGSTDBY.MAX_EVENTS);
PL/SQL procedure successfully completed.

c.启动在逻辑备库上的SQL APPLY:

SQL> ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
Database altered.

3.在步骤3中我们实际实施对逻辑备库的升级操作

1).在逻辑备库上停止SQL APPLY 

SQL> ALTER DATABASE STOP LOGICAL STANDBY APPLY;

2).升级逻辑备库(upgrade logical standby)

SQL> shutdown immediate;

[maclean@rh6 admin]$ cp $ORACLE_HOME/dbs/spfileSBDB2.ora /s01/product/11.2.0/dbhome_2/dbs
[maclean@rh6 dbs]$ orapwd file=orapwSBDB2 password=XXXXX entries=10 

/* 将原SBDB2实例的spfile复制到新的11.2.0.2版本的ORACLE_HOME下,并在新的ORACLE_HOME/dbs目录下创建
   与主库一致的密码文件 */

[maclean@rh6 ~]$ source 11gr2env.sh

[maclean@rh6 ~]$ export ORACLE_SID=SBDB2

SQL> startup upgrade;

SQL> @?/rdbms/admin/catupgrd
..................
/* 确认升级数据字典成功,值得一提的是11g的数据字典升级工作极度复杂,因此该阶段耗费大量时间 */

3)在逻辑备库上启用SQL APPLY

SQL>  ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

SQL> ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YY HH24:MI:SS';
Session altered.

SQL> SELECT SYSDATE, APPLIED_TIME FROM V$LOGSTDBY_PROGRESS;
SYSDATE 	   APPLIED_TIME
------------------ ------------------
25-MAR-11 21:26:26 25-MAR-11 21:26:21

/* 监控以上SQL APPLY进度,是否与当前时间接近 */

4)审查升级过程中的监控事件,可以直接查询DBA_LOGSTDBY_EVENTS视图,
该视图记录了在逻辑备库上没有applied的DDL与DML操作

SQL> SET LONG 1000
SQL> SET PAGESIZE 180
SQL> SET LINESIZE 79
SQL> SELECT EVENT_TIMESTAMP, EVENT, STATUS FROM DBA_LOGSTDBY_EVENTS ORDER BY EVENT_TIMESTAMP;

EVENT_TIMESTAMP
---------------------------------------------------------------------------
EVENT
-------------------------------------------------------------------------------
STATUS
-------------------------------------------------------------------------------
25-MAR-11 07.53.39.593831 PM

ORA-16111: log mining and apply setting up

25-MAR-11 07.53.39.594478 PM

Apply LWM 11881984, HWM 11881984, SCN 11881984

25-MAR-11 07.57.54.467786 PM

Shutdown acknowledged

25-MAR-11 07.57.57.516431 PM

ORA-16128: User initiated stop apply successfully completed

25-MAR-11 09.06.42.150573 PM

ORA-16111: log mining and apply setting up

25-MAR-11 09.06.42.345581 PM

Apply LWM 11883610, HWM 11883615, SCN 11883614

25-MAR-11 09.12.56.429392 PM

ORA-16128: User initiated stop apply successfully completed

25-MAR-11 09.26.16.124581 PM

ORA-16111: log mining and apply setting up

25-MAR-11 09.26.16.145925 PM

Apply LWM 11883610, HWM 11883615, SCN 11883614

25-MAR-11 09.26.20.723913 PM
ALTER DATABASE OPEN
ORA-16226: DDL skipped due to lack of support

ORA-16226错误表示相关DDL操作在逻辑备库上不受支持,常见的原因是该类操作涉及内部对象
ORA-16129错误表示相关DML操作在逻辑备库上不受支持

5)开始主备库之间的切换(switchover)

ALTER DATABASE COMMIT TO SWITCHOVER TO LOGICAL STANDBY;

SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE;

SWITCHOVER_STATUS
--------------------
TO PRIMARY

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

Database altered.

以上切换完成后当前主库(Primary Database)的日志将无法再传输到逻辑备库(PROD)中,当前主库SBDB的版本为11.2.0.2,
而逻辑备库RPOD(原主库)仍为11.1.0.7;因为当前主库的日志都无法传送到低版本的logical standby中,这意味着当前主库是不受
任何保护的状态,这是滚动升级中无法避免的一个真空期。
现在我们可以在原逻辑备库上启动用户应用程序了。

4.将原主库(PROD)闪回到之前创建的还原点上
SQL> SHUTDOWN IMMEDIATE;
SQL> STARTUP MOUNT;

SQL> select name from v$restore_point;

NAME
--------------------------------------------------------------------------------
PRE_ROLLING_UPGRD

SQL> flashback database to restore point pre_rolling_upgrd;

5.以新的ORACLE_HOME binary加载原主库(PROD)实例
SQL> shutdown immediate;
/* 先将原主库(PROD)实例关闭  */

[maclean@rh6 dbs]$ cp /standby/product/11.1.0/db_3/dbs/spfilePROD.ora $ORACLE_HOME/dbs
[maclean@rh6 dbs]$ orapwd file=orapwPROD password=XXXXX entries=10 

/* 在新版本的ORACLE_HOME下复制spfile并重建密码文件 */

[maclean@rh6 dbs]$ export ORACLE_SID=PROD

SQL> startup mount;

/* 这里我们不需要运行升级脚本,后面的redo apply会自动升级PROD上的数据字典 */

6.将原主库(PROD)切换会物理备库(physical standby)
SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
SQL> SHUTDOWN IMMEDIATE;

7.启动原主库(PROD)当前物理备库上的介质恢复进程
SQL> STARTUP MOUNT;
SQL>  ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
USING CURRENT LOGFILE DISCONNECT FROM SESSION;

8.在主库与物理备库间实施切换

SBDB2 switchover回物理备库状态
SQL>  alter database commit to switchover to physical standby;

PROD切换为Primary Database
SQL> alter database commit to switchover to primary;

因为数据字典升级期间产生了大量的重做日志,切换之前需要完成对这些redo的应用,
所以该阶段将耗费大量时间,可以从PROD的告警日志中看到类似下面的内容:

Media Recovery Log /standby/arch/1_51_746739648.dbf
2011-03-26 00:32:09.625000 +08:00
Media Recovery Log /standby/arch/1_52_746739648.dbf
2011-03-26 00:32:16.931000 +08:00
Media Recovery Log /standby/arch/1_53_746739648.dbf
2011-03-26 00:32:23.149000 +08:00
Media Recovery Log /standby/arch/1_54_746739648.dbf
2011-03-26 00:32:31.732000 +08:00
Media Recovery Log /standby/arch/1_55_746739648.dbf
2011-03-26 00:32:43.807000 +08:00
Media Recovery Log /standby/arch/1_56_746739648.dbf

9.出于节约磁盘空间的考虑将PROD上最初建立的还原点删除:
SQL> DROP RESTORE POINT PRE_ROLLING_UPGRD

Startup Upgrade为我们做了什么?

从Oracle 10g以后数据库升级前我们都需要将实例关闭,并以startup upgrade的升级模式启动数据库。那么startup upgrade到底对instance做了什么事?在这样的模式下升级可以绕过那些错误呢?

就常规判断可以猜测startup upgrade升级模式至少会限制以下几点:

  1. 关闭自动作业队列
  2. 关闭回收站功能
  3. 禁用系统触发器
  4. 以限制模式启动,仅允许以SYSDBA身份登录

为了了解更多的细节,我们不妨直观地来观察以下升级模式对比普通模式有哪些不同之处:

SQL> startup upgrade;

以升级模式启动后观察告警日志内容:
Starting ORACLE instance (normal)

以上信息表明虽然是upgrde模式,但其实仍是以NORMAL启动
...........
ALTER SYSTEM enable restricted session;

实例mount之后启用了限制会话特性

ALTER SYSTEM SET _system_trig_enabled=FALSE SCOPE=MEMORY;

_system_trig_enabled参数控制一些系统触发器是否被启用,例如某些登录审计触发器
(system trigger facilities are used as an additional audit trail or as a mechanism to prevent
certain actions taking place or as part of a virtual private database solution)

ALTER SYSTEM SET aq_tm_processes=0 SCOPE=MEMORY;

设置aq_tm_processes参数为0,相当于禁用了对信息队列的监控和控制

ALTER SYSTEM SET resource_manager_plan='' SCOPE=MEMORY;

升级期间显然不需要任何resource plan来凑热闹

ALTER SYSTEM SET _undo_autotune=FALSE SCOPE=MEMORY;

设置隐藏参数_undo_autotune为false,禁用undo自动调优特性

ALTER SYSTEM SET undo_retention=900 SCOPE=MEMORY;

900似乎一直是undo保存的理想值

ALTER SYSTEM SET enable_ddl_logging=FALSE SCOPE=MEMORY;

在11g中增加了新的DDL日志功能,这里需要通过设置enable_ddl_logging为false来禁用该特性

ALTER SYSTEM SET recyclebin=‘OFF’ DEFERRED SCOPE=MEMORY;

在10g下似乎不会关闭回收站,但其实这一点很有必要;

Completed: ALTER DATABASE OPEN MIGRATE

这里可以看到startup upgrade只是migrate模式的发展

除去以上部分初始化参数的修正外,upgrade升级模式下还会“隐瞒“不报一部分常见的非致命错误,这些常见错误并不影响升级进程;
这样做的用意是为了让在升级过程中只有真正致命(fatal)的错误才被反映和记录下来。

这部分升级常见错误包括:

ORA-00942:"table or view does not exist"       --alter table/view 或 drop table/view时可能出现

以上述ORA-00942错误为例:

SQL> drop table not_exist;

Table dropped.

/* 任意drop或alter不存在的表或索引都不会报错而会通过,以下错误也是类似的 */

ORA-01418: "specified index does not exist"    --DROP索引时可能出现
ORA-00955:"name is already used by an existing object"  --当创建某些对象时可能出现
ORA-01430:"column being added already exists in table"    --ALTER TABLE 时可能出现
ORA-04043:"object %s does not exist"                             --修改对象时可能出现

// *Cause: An object name was specified that was not recognized by the system.
// There are several possible causes:
// – An invalid name for a table, view, sequence, procedure, function,
// package, or package body was entered. Since the system could not
// recognize the invalid name, it responded with the message that the
// named object does not exist.
// – An attempt was made to rename an index or a cluster, or some
// other object that cannot be renamed.
// *Action: Check the spelling of the named object and rerun the code. (Valid
// names of tables, views, functions, etc. can be listed by querying
// the data dictionary.)

Utilize Sql Tuning Advisor from Script

Sql Tuning Advisor是10g以后出现的一个十分有用的调优工具,大多数情况下我们可以通过dbconsole或者Grid Control的web界面调用SQL Advisor;但如果系统中没有配置dbconsole或者Grid Control的话,我们则需要通过手动调用DBMS_SQLTUNE PL/SQL程序包来使用该特性。这里我列出一个针对单个SQL语句Autotune的脚本,具体脚本:


begin
  DBMS_SQLTUNE.drop_tuning_task('&task_name');
end;
/



DECLARE
  my_task_name VARCHAR2(30);
  my_sqltext CLOB;
  my_sqlid varchar2(30);
BEGIN
  my_sqlid := '&sqlid';
  my_task_name := dbms_sqltune.create_tuning_task(sql_id => my_sqlid,
                                                  scope => 'COMPREHENSIVE',
                                                  time_limit => 300,
                                                  task_name => '&task_name',
                                                  description => 'comment'                               
                                                  );
END;
/


BEGIN
  dbms_sqltune.execute_tuning_task(task_name => '&task_name');
END;
/

SELECT status FROM USER_ADVISOR_TASKS WHERE task_name = '&task_name';

SET LONG 10000
SET LONGCHUNKSIZE 10000
SET LINESIZE 100
set pages 60

SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('&task_name') FROM DUAL;

具体使用示例:

SQL> begin
  DBMS_SQLTUNE.drop_tuning_task('&task_name');
end;
/
Enter value for task_name: newtask
old   3:   DBMS_SQLTUNE.drop_tuning_task('&task_name');
new   3:   DBMS_SQLTUNE.drop_tuning_task('newtask');
begin
*
ERROR at line 1:
ORA-13605: The specified task or object newtask does not exist for the current
user.
ORA-06512: at "SYS.PRVT_ADVISOR", line 2307
ORA-06512: at "SYS.DBMS_ADVISOR", line 172
ORA-06512: at "SYS.DBMS_SQLTUNE", line 751
ORA-06512: at line 3

SQL> DECLARE
  my_task_name VARCHAR2(30);
  my_sqltext CLOB;
  my_sqlid varchar2(30);
BEGIN
  my_sqlid := '&sqlid';
  my_task_name := dbms_sqltune.create_tuning_task(sql_id => my_sqlid,
                                                  scope => 'COMPREHENSIVE',
                                                  time_limit => 300,
                                                  task_name => '&task_name',
                                                  description => 'comment'                               
                                                  );
END;
/ 
Enter value for sqlid: 17usubxchdf2w
old  11:   my_sqlid := '&sqlid';
new  11:   my_sqlid := '17usubxchdf2w';
Enter value for task_name: new_task 
old  19:						   task_name => '&task_name',
new  19:						   task_name => 'new_task',

SQL> BEGIN
  dbms_sqltune.execute_tuning_task(task_name => '&task_name');
END;
/  
Enter value for task_name: new_task
old   3:   dbms_sqltune.execute_tuning_task(task_name => '&task_name');
new   3:   dbms_sqltune.execute_tuning_task(task_name => 'new_task');

PL/SQL procedure successfully completed.

SQL> SELECT status FROM USER_ADVISOR_TASKS WHERE task_name = '&task_name';
Enter value for task_name: new_task
old   1: SELECT status FROM USER_ADVISOR_TASKS WHERE task_name = '&task_name'
new   1: SELECT status FROM USER_ADVISOR_TASKS WHERE task_name = 'new_task'

STATUS
-----------
COMPLETED

SQL> SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('&task_name') FROM DUAL;
Enter value for task_name: new_task
old   1: SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('&task_name') FROM DUAL
new   1: SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('new_task') FROM DUAL

DBMS_SQLTUNE.REPORT_TUNING_TASK('NEW_TASK')
----------------------------------------------------------------------------------------------------
GENERAL INFORMATION SECTION
-------------------------------------------------------------------------------
Tuning Task Name   : new_task
Tuning Task Owner  : SYS
Workload Type	   : Single SQL Statement
Scope		   : COMPREHENSIVE
Time Limit(seconds): 300
Completion Status  : COMPLETED
Started at	   : 03/25/2011 00:14:41
Completed at	   : 03/25/2011 00:14:45

-------------------------------------------------------------------------------
Schema Name: SYS
SQL ID	   : 17usubxchdf2w
SQL Text   : select count(t1) from hashtab

-------------------------------------------------------------------------------
FINDINGS SECTION (1 finding)
-------------------------------------------------------------------------------

1- Statistics Finding
---------------------
  Table "SYS"."HASHTAB" was not analyzed.

  Recommendation
  --------------
  - Consider collecting optimizer statistics for this table.
    execute dbms_stats.gather_table_stats(ownname => 'SYS', tabname =>
	    'HASHTAB', estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
	    method_opt => 'FOR ALL COLUMNS SIZE AUTO');

  Rationale
  ---------
    The optimizer requires up-to-date statistics for the table in order to
    select a good execution plan.

-------------------------------------------------------------------------------
EXPLAIN PLANS SECTION
-------------------------------------------------------------------------------

1- Original
-----------
Plan hash value: 485915006

------------------------------------------------------------------------------
| Id  | Operation	   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	     |	   1 |	  13 |	   2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |	     |	   1 |	  13 |		  |	     |
|   2 |   TABLE ACCESS FULL| HASHTAB |	 102 |	1326 |	   2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Oracle内部错误ORA-07445:[_memcmp()+88] [SIGSEGV]一例

一套Sparc Solaris上的10.2.0.1数据库,告警日志中出现ORA-07445:[_memcmp()+88] [SIGSEGV]内部错误日志,具体日志如下:

Errors in file /global/oracle1/centDB/admin/centDB/udump/centdb_ora_8749.trc:
ORA-07445: exception encountered: core dump [_memcmp()+88] [SIGSEGV] [Address not mapped to object] [0x000000010] []

/global/oracle1/centDB/admin/centDB/udump/centdb_ora_8749.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /global/oracle1/ORAHOME1/product/10.2/db_1
System name: SunOS
Node name: ora03ud-us
Release: 5.10
Version: Generic_142900-13
Machine: sun4u
Instance name: centDB
Redo thread mounted by this instance: 1
Oracle process number: 41
Unix process pid: 8749, image: oraclecentDB@ora03ud-us

*** SERVICE NAME:(SYS$USERS) 2011-03-08 04:54:18.528
*** SESSION ID:(1226.58882) 2011-03-08 04:54:18.528
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x10, PC: [0xffffffff7d600ca4, _memcmp()+88]
*** 2011-03-08 04:54:18.533
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [_memcmp()+88] [SIGSEGV] [Address not mapped to object] [0x000000010] [] []
----- Call Stack Trace -----
ksedmp <- ssexhd <- sighndlr <- call_user_handler <- sigacthandler
  <- memcmp <- kpzgkvl <- kziaia <- kpolnb <- kpolon
   <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr
   <- opidrv <- sou2o <- opimai_real <- main <- start

(session) sid: 1226 trans: 0, creator: 3bd522c00, flag: (41) USR/- BSY/-/-/-/-/-
     DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
     txn branch: 0
     oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
O/S info: user: root, term: unknown, ospid: , machine: 7cta-031-eqism
     program: JDBC Thin Client
application name: JDBC Thin Client, hash value=0
last wait for 'SQL*Net message from client' blocking sess=0x0 seq=2 wait_time=5705 seconds since wait started=0
      driver id=74637000, #bytes=1, =0
Dumping Session Wait History
 for 'SQL*Net message from client' count=1 wait_time=5705
      driver id=74637000, #bytes=1, =0
 for 'SQL*Net message to client' count=1 wait_time=5
      driver id=74637000, #bytes=1, =0
temporary object counter: 0
 ----------------------------------------
 Virtual Thread:
 kgskvt: 3a6bc1be8, sess: 3bcdc71e8, vc: 0, proc: 0
 consumer group cur: (upd? 0), mapped: , orig:
 vt_state: 0x0, vt_flags: 0x20, blkrun: 0
 is_assigned: 0, in_sched: 0 (0)
 vt_active: 0 (pending: 0)
 used quanta: 0 (cg: 0)
 cpu start time: 0, quantum status: 0x0
 quantum checks to skip: 0, check thresh: 0
 idle time: 0, active time: 0 (cg: 0)
 cpu yields: 0 (cg: 0), waits: 0 (cg: 0), wait time: 0 (cg: 0)
 queued time outs: 0, time: 0 (cur 0, cg 0)
 calls aborted: 0, num est exec limit hit: 0
 undo current: 0k max: 0k

以上7445内部错误并未导致实例意外终止crash,可以看到其最近的stack call为:memcmp kpzgkvl kziaia kpolnb kpolon opiodr ttcpip opitsk opiino opiodr opidrv sou2o opimai_real main start;通过Metalink搜索可以同Bug 5292883的调用堆栈匹配,Bug Note如下:

Bug 5292883  Dump from OCI client using OCI7 olog() call
Affects:

Product (Component)	 Oracle Server (Rdbms)
Range of versions believed to be affected	 Versions < 11
Versions confirmed as being affected	
10.2.0.3
Platforms affected	 Generic (all / most platforms affected)
Fixed:

This issue is fixed in	
10.2.0.2 Patch 10 on Windows Platforms
10.2.0.3 Patch 7 on Windows Platforms
10.2.0.4 (Server Patch Set)
11.1.0.6 (Base Release)
Symptoms:

Related To:

Process May Dump (ORA-7445) / Abend / Abort
Dump in or under kpzgkvl / kziaia
OCI
Description

On a 64 bit machines a dump can occur with the following stack
if the client uses the olog() OCI call to connect.
  memcmp()<-kpzgkvl()<-kziaia()<--kpolnb()<-kpolon() 

Workaround
 Use OCIServerAttach and OCISessionBegin instead of olog()

Hdr: 5292883 10.2.0.2.0 RDBMS 10.2.0.2.0 SECURITY PRODID-5 PORTID-197 ORA-7445
Abstract: ORA-7445: EXCEPTION ENCOUNTERED: CORE DUMP [_MEMCMP()+160] WHEN CONNECTING TO I

PROBLEM:
--------
 1. Clear description of the problem encountered

    OCI client connection fails with ORA-7445 [_memcmp()+160]. At the 
    time of error occurence no connection to database could be established 
    by OCI clients. Only sqlplus sessions were able to connect. The alertlog
    confirms a lot of ORA-7445 which were logged. Client version is 9.2.0.6. 

    ALERT LOG
    ---------
    Tue Jun  6 18:59:14 2006
     Submitted all GCS remote-cache requests
     Post SMON to start 1st pass IR
     Fix write in gcs resources
    Reconfiguration complete
    Tue Jun  6 19:02:32 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_16756.trc:
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2000000000] [] []
    Tue Jun  6 19:02:32 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_16756.trc:
    ORA-81: address range [0x60000000000A7D70, 0x60000000000A7D74) is not 
readable
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2000000000] [] []
    Tue Jun  6 19:04:42 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_20524.trc:
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2A00000000] [] []
    Tue Jun  6 19:04:42 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_20524.trc:
    ORA-81: address range [0x60000000000A7D70, 0x60000000000A7D74) is not 
readable
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2A00000000] [] []
    ...    

 2. Pertinent configuration information (MTS/OPS/distributed/etc)  
 
    3 instance RAC database.
    Errors were logged for 2 of these instances.
    
 3. Indication of the frequency and predictability of the problem  
 
    Intermittend occurence.
    Not reproducible at will.
    
 4. Technical impact on the customer. Include persistent after effects.

    No connections possible from ERP application which fails with ORA-7445.
    More than 500 ERP users affected.

STACK TRACE:
------------
_memcmp()+160        call                  
kpzgkvl()+192        call     _memcmp()            2000000002 ?
                                                   4000000001229FF0 ?
                                                   000000011 ?
kziaia()+480         call     kpzgkvl()            9FFFFFFFFFFFADB0 ?
                                                   9FFFFFFFFFFFAE18 ?
                                                   4000000001229FF0 ?
                                                   000000011 ? 000000000 ?
                                                   9FFFFFFFFFFF6ED8 ?
                                                   9FFFFFFFFFFF6EE0 ?
                                                   9FFFFFFFFFFF6ED0 ?
kpolnb()+1344        call     kziaia()             9FFFFFFFFFFF8040 ?
                                                   9FFFFFFFFFFF6EE0 ?
                                                   9FFFFFFFFFFF6ED8 ?
                                                   9FFFFFFFFFFF81E0 ?
                                                   9FFFFFFFFFFF81D8 ?
                                                   9FFFFFFFFFFF81E8 ?
                                                   000000000 ?
                                                   400000000233B440 ?
kpolon()+336         call     kpolnb()             9FFFFFFFFFFF8030 ?
                                                   4000000003F47E10 ?
                                                   9FFFFFFFFFFF6F80 ?
                                                   600000000009DB00 ?
                                                   00000820D ?
opiodr()+2064        call     kpolon()             000000051 ?
                                                   60000000000219A8 ?
                                                   9FFFFFFFFFFF81F0 ?
                                                   9FFFFFFFFFFF81B0 ?
                                                   60000000000AAA50 ?
                                                   40000000030BE570 ?
ttcpip()+1824        call     opiodr()             60000000000AA3B0 ?
                                                   6000000000015DD0 ?
                                                   9FFFFFFFFFFFA9A0 ?
                                                   6000000000015DD0 ?
                                                   9FFFFFFFFFFF82D0 ?
                                                   600000000009DB00 ?
                                                   00000001A ?
                                                   6000000000021838 ?

该Bug 5292883在10.2.0.1上没有相应的one-off patch补丁,而在11g和10.2.0.4补丁集中得到修复(fix)。如果无法实施补丁的话,那么一般可以通过以下2种途径绕过该问题:
1)限制用户名和密码的长度在9个字符以内
2)若使用OCI,登录使用OCIServerAttach和OCISessionBegin函数

How to find error message from OMS repository

OEM或者OMS管理工具为我们提供了方便的访问数据库度量告警和错误信息的界面,那么我们是否可以通过手动查询的方式来直接查看这些信息呢?答案是肯定的,这需要我们对OMS repository有一定的了解:

SQL> desc sysman.MGMT_CURRENT_METRIC_ERRORS;
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 TARGET_GUID				   NOT NULL RAW(16)
 METRIC_GUID				   NOT NULL RAW(16)
 COLL_NAME				   NOT NULL VARCHAR2(64)
 AGENT_GUID				   NOT NULL RAW(16)
 COLLECTION_TIMESTAMP			   NOT NULL DATE
 METRIC_ERROR_MESSAGE				    VARCHAR2(4000)
 METRIC_ERROR_TYPE				    NUMBER(1)

/* MGMT_CURRENT_METRIC_ERRORS记录了当前出现在OMS console中的度量错误记录 */

SQL> desc sysman.MGMT_METRIC_ERRORS;
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 TARGET_GUID				   NOT NULL RAW(16)
 METRIC_GUID				   NOT NULL RAW(16)
 COLL_NAME				   NOT NULL VARCHAR2(64)
 AGENT_GUID				   NOT NULL RAW(16)
 COLLECTION_TIMESTAMP			   NOT NULL DATE
 METRIC_ERROR_MESSAGE				    VARCHAR2(4000)
 METRIC_ERROR_TYPE				    NUMBER(1)

/* MGMT_METRIC_ERRORS为度量错误历史记录 */

SQL> SELECT METRIC_ERROR_MESSAGE,COLLECTION_TIMESTAMP,METRIC_ERROR_TYPE FROM MGMT_CURRENT_METRIC_ERRORS WHERE COLLECTION_TIMESTAMP<'20-Mar-11';

METRIC_ERROR_MESSAGE				   COLLECTIO METRIC_ERROR_TYPE
-------------------------------------------------- --------- -----------------
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1,SQLINPARAM4]	   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Missing Properties : [SQLINPARAM1]		   03-MAR-11		     0
Result has repeating key value : OCM_APPLY,apply   05-MAR-11		     0

/* 以上查询显示了3月20前出现的仍在console中的信息记录 */

SQL> desc sysman.mgmt$alert_current;
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 TARGET_NAME				   NOT NULL VARCHAR2(256)
 TARGET_TYPE				   NOT NULL VARCHAR2(64)
 TARGET_GUID				   NOT NULL RAW(16)
 VIOLATION_GUID 				    RAW(16)
 METRIC_NAME				   NOT NULL VARCHAR2(64)
 METRIC_COLUMN				   NOT NULL VARCHAR2(64)
 METRIC_GUID				   NOT NULL RAW(16)
 METRIC_LABEL					    VARCHAR2(64)
 COLUMN_LABEL					    VARCHAR2(256)
 KEY_VALUE					    VARCHAR2(256)
 KEY_VALUE2					    VARCHAR2(256)
 KEY_VALUE3					    VARCHAR2(256)
 KEY_VALUE4					    VARCHAR2(256)
 KEY_VALUE5					    VARCHAR2(256)
 COLLECTION_TIMESTAMP				    DATE
 ALERT_STATE					    VARCHAR2(18)
 VIOLATION_LEVEL			   NOT NULL NUMBER
 VIOLATION_TYPE 				    VARCHAR2(19)
 MESSAGE					    VARCHAR2(4000)
 MESSAGE_NLSID					    VARCHAR2(64)
 MESSAGE_PARAMS 				    VARCHAR2(4000)
 ACTION_MESSAGE 				    VARCHAR2(4000)
 ACTION_MESSAGE_NLSID				    VARCHAR2(64)
 ACTION_MESSAGE_PARAMS				    VARCHAR2(4000)
 TYPE_DISPLAY_NAME				    VARCHAR2(128)
 CURRENT_VALUE					    VARCHAR2(1024)

/* mgmt$alert_current记录当前Grid中活跃的警告信息,类似的mgmt$alert_history为其历史记录表 */

SQL> select target_name,message from SYSMAN.mgmt$alert_current;

TARGET_NAME	     MESSAGE
-------------------- ----------------------------------------------------------------------------------------------------
nas:3872	     Difference between OMS system time and Agent system time is 7680 mins and has exceeded the critical
		     threshold 120 mins

rh3:3872	     Agent Virtual Memory Growth is 1.53%
rh3:3872	     Count of targets not uploading exceeded the critical threshold (0). Current value: 2
EMREP_rh3	     User SYS logged on from rh3.oracle.com.
LSN1_rh3	     The listener is down: TNS-12541: TNS:no listener .
rh6:3872	     Difference between OMS system time and Agent system time is 7199 mins and has exceeded the critical
		     threshold 120 mins

Management Services  Management Service Status for nas:4889_Management_Service exceeded the critical threshold (DOWN). Cu
and Repository	     rrent value: DOWN


Oracle宣布终止所有Intel Itanium平台上的软件开发

Oracle最近宣布将停止其所有在Intel Itanium平台上的软件开发,将更加专注于X86和Sparc平台。换而言之我们肯定不会看到ia64版本的Oracle Database 12g了,但是否会就现有版本的Itanium平台软件继续发布补丁集则仍是一个未知之数。

这是继Microsoft和Redhat后又一重量级的厂商为安腾平台敲响喪鐘,很显然这是墙倒众人推,Intel Itanium气数已尽。

ksvcreate: Process creation failed故障一例

今天上午接到电话,用户反映一套10.2.0.4的数据库出现用户无法登录的症状,随即在家里连上VPN远程支持,登上主机以后尝试测试操作系统认证登录和远程登录,发现已经能够正常登录了;为了了解起因,去查看了告警日志alert.log,发现在11:00左右出现大量的ksvcreate:process creation failed错误,具体错误日志如下:

Wed Mar 23 10:00:17 2011
Process m000 died, see its trace file
Wed Mar 23 10:00:17 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:01:18 2011
Process m000 died, see its trace file
Wed Mar 23 10:01:18 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:02:19 2011
Process m000 died, see its trace file
Wed Mar 23 10:02:19 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:54:41 2011
Process P007 died, see its trace file
Wed Mar 23 10:56:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:57:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:58:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:59:18 2011
Process J000 died, see its trace file
Wed Mar 23 10:59:18 2011
kkjcre1p: unable to spawn jobq slave process
Wed Mar 23 10:59:18 2011
Errors in file /oracle/app/oracle/admin/sdh/bdump/sdh_cjq0_717010.trc:

Wed Mar 23 11:01:00 2011
Process m000 died, see its trace file
Wed Mar 23 11:01:00 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:01 2011
Process m000 died, see its trace file
Wed Mar 23 11:02:01 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:40 2011
Process P007 died, see its trace file
Wed Mar 23 11:03:02 2011
Process m000 died, see its trace file
Wed Mar 23 11:03:02 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:03 2011
Process m000 died, see its trace file
Wed Mar 23 11:04:03 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:40 2011
Process P007 died, see its trace file

ksvcreate: Process creation failed错误信息一般在Oracle实例在创建一些辅助后台进程(如mmon的子进程m00x或者并行子进程p00x等)时出现进程启动失败时出现,而造成该错误的可能性有多种,包括Oracle实例资源不足、操作系统资源不足等等。其中较为常见的是实例instance的process使用达到上限,可以通过查询v$resource_limit视图来了解实例生命周期内是否发生过process总数暴满的情况:

SQL> select * from v$resource_limit;

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes                                       79             800        800                  800
sessions                                        80             813        885                  885
enqueue_locks                                   18             303      10910                10910
enqueue_resources                               18             166       4112            UNLIMITED
ges_procs                                        0               0          0                    0
ges_ress                                         0               0          0            UNLIMITED
ges_locks                                        0               0          0            UNLIMITED
ges_cache_ress                                   0               0          0            UNLIMITED
ges_reg_msgs                                     0               0          0            UNLIMITED
ges_big_msgs                                     0               0          0            UNLIMITED
ges_rsv_msgs                                     0               0          0                    0
gcs_resources                                    0               0          0                    0
gcs_shadows                                      0               0          0                    0
dml_locks                                        0             153       3892            UNLIMITED
temporary_table_locks                            0               3  UNLIMITED            UNLIMITED
transactions                                  3125            3156        973            UNLIMITED
branches                                         0               3        973            UNLIMITED
cmtcallbk                                        0               2        973            UNLIMITED
sort_segment_locks                               0               8  UNLIMITED            UNLIMITED
max_rollback_segments                           19              43        973                65535
max_shared_servers                               0               0  UNLIMITED            UNLIMITED
parallel_max_servers                             6             122        120                 3600

/* 可以看到processes的MAX_UTILIZATION最大使用数目曾到过LIMIT_VALUE限定的800,
   sessions也有类似的情况 */

从以上V$resource_limit视图的输出来看,极有可能是processes总数达到上限导致了新的后台辅助进程创建失败,其实我们可以很方便地验证这一点:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE	10.2.0.1.0	Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> alter system set processes=20 scope=spfile;
System altered.

/* 将processes参数修改到一个较小值 */

SQL> startup force;
ORACLE instance started.

SQL> select resource_name,MAX_UTILIZATION,LIMIT_VALUE from v$resource_limit where resource_name in ('processes','sessions');

RESOURCE_NAME               MAX_UTILIZATION LIMIT_VALUE
------------------------------ --------------- ---------------
processes                    20           20
sessions                     19           27

 /* 从以上输出可以看到process总数曾经到达20的上限 */

/* 观察告警日志可以看到相关ksvcreate: Process(m000) creation failed的记录 */

[maclean@rh8 bdump]$ tail -20 alert_PROD1.log
ksvcreate: Process(m000) creation failed
Process m000 died, see its trace file
Wed Mar 23 22:01:52 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 22:01:55 2011
Process q000 died, see its trace file
Wed Mar 23 22:01:55 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:07 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:07 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:19 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:19 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:31 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:31 2011
ksvcreate: Process(q000) creation failed

不同于10g,在11g中类似的错误出现后会在告警日志中说明process creation即进程创建失败的具体原因,如进程总数达到上限,那么就会出现ORA-00020错误(maximum number of processes (%s) exceeded. All process state objects are in use.Increase the value of the PROCESSES initialization parameter),类似以下日志:

ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Process m001 submission failed with error = 20
Process m002 submission failed with error = 20
Process m003 submission failed with error = 20
Process m003 submission failed with error = 20
Process m002 submission failed with error = 20
2011-03-23 22:10:07.037000 +08:00
Process q001 submission failed with error = 20

实际上ksvcreate: Process creation failed错误在能够了解其发生的root cause的情况下,并不难解决。在上例中我们可以清楚地了解到是因为数据库在实际运行中出现了processes进程总数达到参数设定上限从而导致问题出现,那么可以合理增加初始化参数processes或者通过修正异常频繁的程序客户端登录来解决该问题。

fast incremental backup failed on standby database

一套Linux上的11.1.0.7的physical standby物理备库在使用fast incremental backup进行高于0级的增量备份时会出现ORA-19648错误,其出错记录如下:

RMAN>  backup incremental level 1 database;

Starting backup at 22-MAR-11
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting incremental level 1 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 1 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 03/22/2011 20:20:28
ORA-19648: datafile 2: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 1249353 time 03/09/2011 05:50:27
continuing other job steps, job failed will not be re-run
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_2 channel at 03/22/2011 20:22:33
ORA-19648: datafile 1: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 1249352 time 03/09/2011 05:50:26

经过分析该ORA-19640->ORA-19648是由11.1.0.7上的Bug引起的,MOS已经确认其为Bug 9288598:

Affects:
Product (Component)	 Oracle Server (Rdbms)
Range of versions believed to be affected	 Versions BELOW 12.1
Versions confirmed as being affected	
11.1.0.7
Platforms affected	 Generic (all / most platforms affected)
Fixed:

This issue is fixed in	
12.1 (Future Release)
11.2.0.2 (Server Patch Set)

An RMAN Backup incremental level 1 can fail with ORA-19648 / ORA-19640 if the standby database is open read only rather than just skipping the datafile.
PROBLEM:
--------
Following error occurring during incremental backup:

Starting backup at 02-DEC-2009 20:37:44
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=154 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Veritas NetBackup for Oracle - Release 6.5 
(2008052301)
channel ORA_SBT_TAPE_1: starting incremental level 1 datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
input datafile file number=00002 
name=/data/ora_data01/KAHCB4P/KAHCB4P_sysaux_01.dbf
input datafile file number=00001 
name=/data/ora_data01/KAHCB4P/KAHCB4P_system_01.dbf
input datafile file number=00003 
name=/data/ora_data01/KAHCB4P/KAHCB4P_undo_01.dbf
input datafile file number=00004 
name=/data/ora_data01/KAHCB4P/KAHCB4P_users_01.dbf
input datafile file number=00005 
name=/data/ora_data01/KAHCB4P/KAHCB4P_tools_01.dbf
input datafile file number=00006 
name=/data/ora_data01/KAHCB4P/KAHCB4P_backup_test_01.dbf
channel ORA_SBT_TAPE_1: starting piece 1 at 02-DEC-2009 20:38:01
RMAN-3009: failure of backup command on ORA_SBT_TAPE_1 channel at 12/02/2009 
20:38:02
ORA-19648: datafile 2: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 3911695 time 11/27/2009 12:36:48
continuing other job steps, job failed will not be re-run
channel ORA_SBT_TAPE_1: starting incremental level 1 datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set

DIAGNOSTIC ANALYSIS:
--------------------
Same problem as Bug 6903819
Requested backport, but this hasn't fixed the problem.
Confirmed that the patch was installed correctly and relinked successfully.

RELEASE NOTES:
]]Backup incremental leve 1 fails with ORA-19648 and ORA-19640 when standby
]]database opened for read only.
*** 02/14/10 02:43 pm *** (ADD: Impact/Symptom->FEATURE UNUSABLE )

REDISCOVERY INFORMATION:
Backup incremental level 1 fails with ORA-19648 and the standby database is
open read-only.
WORKAROUND:
None

可以从上述Note中看到该Bug需要到11.2.0.2中才得到fix,那么在11.1.0.7上我们有什么办法能解决或者绕过该问题吗?
经过测试,我得到2种workaround的方法:
该ORA-19648错误首先可以通过在物理备库(physical standby)上停止介质恢复(MRP)进程的方式来workaround:

SQL>  alter database recover managed standby database cancel;
Database altered.

RMAN> backup incremental level 1 database;

Starting backup at 22-MAR-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=141 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=140 device type=DISK
channel ORA_DISK_1: starting incremental level 1 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 1 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
channel ORA_DISK_1: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2km7suui_1_1.bak tag=TAG20110322T212538 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:33
channel ORA_DISK_2: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2lm7suv1_1_1.bak tag=TAG20110322T212538 comment=NONE
channel ORA_DISK_2: backup set complete, elapsed time: 00:00:27
Finished backup at 22-MAR-11

Starting Control File and SPFILE Autobackup at 22-MAR-11
piece handle=/standby/backup/c-157018592-20110322-01.bak comment=NONE
Finished Control File and SPFILE Autobackup at 22-MAR-11

该ORA-19648错误也可以通过禁用fast incremental backup的块跟踪(block change tracking)特性来绕过问题,当然在可能的情况下我们更推荐使用上面那种方法,因为毕竟还可以利用到fast incremental backup特性:

SQL> alter database recover managed standby database using current logfile disconnect;
Database altered.

SQL> select process,status from v$managed_standby;

PROCESS   STATUS
--------- ------------
ARCH	  CONNECTED
ARCH	  CONNECTED
ARCH	  CONNECTED
ARCH	  CONNECTED
MRP0	  APPLYING_LOG

SQL> select status from v$block_change_tracking;

STATUS
----------
ENABLED


SQL> alter database disable block change tracking;
Database altered.


RMAN>  backup incremental level 2 database;

Starting backup at 22-MAR-11
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting incremental level 2 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 2 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
channel ORA_DISK_1: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2nm7sv8d_1_1.bak tag=TAG20110322T213052 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
channel ORA_DISK_2: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2om7sv8s_1_1.bak tag=TAG20110322T213052 comment=NONE
channel ORA_DISK_2: backup set complete, elapsed time: 00:00:15
Finished backup at 22-MAR-11

Starting Control File and SPFILE Autobackup at 22-MAR-11
piece handle=/standby/backup/c-157018592-20110322-02.bak comment=NONE
Finished Control File and SPFILE Autobackup at 22-MAR-11

Gather more plan statistics by gather_plan_statistics hint

在10g以后我们可以通过利用gather_plan_statistics提示来了解更多的SQL执行统计信息,具体使用方法如下:

SQL> set linesize 150
SQL> set pagesize 2000
SQL> set autotrace traceonly exp

SQL> select avg(SALARY),DEPARTMENT_NAME from 
employees e,departments d where e.DEPARTMENT_ID=d.DEPARTMENT_ID group by DEPARTMENT_NAME;

Execution Plan
----------------------------------------------------------
Plan hash value: 3294250112

---------------------------------------------------------------------------------------------
| Id  | Operation		      | Name	    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	      | 	    |	 27 |	621 |	  5  (20)| 00:00:01 |
|   1 |  HASH GROUP BY		      | 	    |	 27 |	621 |	  5  (20)| 00:00:01 |
|   2 |   NESTED LOOPS		      | 	    |	106 |  2438 |	  4   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL	      | EMPLOYEES   |	107 |	749 |	  3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS BY INDEX ROWID| DEPARTMENTS |	  1 |	 16 |	  1   (0)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN	      | DEPT_ID_PK  |	  1 |	    |	  0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

SQL> show parameter cursor_sharing

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cursor_sharing			     string	 EXACT

SQL>  show parameter statistics_level

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
statistics_level		     string	 TYPICAL

SQL> set autotrace off;

SQL> select /*+ gather_plan_statistics */   avg(SALARY),DEPARTMENT_NAME from 
employees e,departments d where e.DEPARTMENT_ID=d.DEPARTMENT_ID group by DEPARTMENT_NAME;

SQL> select * from TABLE(dbms_xplan.display_cursor(NULL,NULL,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	bctzu9xuxay18, child number 0
-------------------------------------
select /*+ gather_plan_statistics */  avg(SALARY),DEPARTMENT_NAME from employees e,departments d
where e.DEPARTMENT_ID=d.DEPARTMENT_ID group by DEPARTMENT_NAME

Plan hash value: 3294250112
-------------------------------------------------------------------------------------------------------
| Id  | Operation		      | Name	    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   1 |  HASH GROUP BY		      | 	    |	   1 |	   27 |     11 |00:00:00.01 |	  219 |
|   2 |   NESTED LOOPS		      | 	    |	   1 |	  106 |    106 |00:00:00.01 |	  219 |
|   3 |    TABLE ACCESS FULL	      | EMPLOYEES   |	   1 |	  107 |    107 |00:00:00.01 |	    7 |
|   4 |    TABLE ACCESS BY INDEX ROWID| DEPARTMENTS |	 107 |	    1 |    106 |00:00:00.01 |	  212 |
|*  5 |     INDEX UNIQUE SCAN	      | DEPT_ID_PK  |	 107 |	    1 |    106 |00:00:00.01 |	  106 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

/* 可以从starts列看到某种操作执行了多少次,例如这里的INDEX UNIQUE SCAN为107次 */

/*也可以通过SQL_ID来定位计划信息 */


SQL> select t.* 
from v$sql s 
   , table(dbms_xplan.display_cursor(s.sql_id,s.child_number,'ALL IOSTATS LAST')) t where s.sql_id = '&SQL_ID' ;
  

Enter value for sql_id: bctzu9xuxay18 

old   3:    , table(dbms_xplan.display_cursor(s.sql_id,s.child_number,'ALL IOSTATS LAST')) t where s.sql_id = '&SQL_ID'
new   3:    , table(dbms_xplan.display_cursor(s.sql_id,s.child_number,'ALL IOSTATS LAST')) t where s.sql_id = 'bctzu9xuxay18'

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	bctzu9xuxay18, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ avg(SALARY),DEPARTMENT_NAME from 
employees e,departments d where
e.DEPARTMENT_ID=d.DEPARTMENT_ID group by DEPARTMENT_NAME

Plan hash value: 3294250112

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation		      | Name	    | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------------
|   1 |  HASH GROUP BY		      | 	    |	   1 |	   27 |   621 |     5  (20)| 00:00:01 |     11 |00:00:00.01 |	  219 |
|   2 |   NESTED LOOPS		      | 	    |	   1 |	  106 |  2438 |     4	(0)| 00:00:01 |    106 |00:00:00.01 |	  219 |
|   3 |    TABLE ACCESS FULL	      | EMPLOYEES   |	   1 |	  107 |   749 |     3	(0)| 00:00:01 |    107 |00:00:00.01 |	    7 |
|   4 |    TABLE ACCESS BY INDEX ROWID| DEPARTMENTS |	 107 |	    1 |    16 |     1	(0)| 00:00:01 |    106 |00:00:00.01 |	  212 |
|*  5 |     INDEX UNIQUE SCAN	      | DEPT_ID_PK  |	 107 |	    1 |       |     0	(0)|	      |    106 |00:00:00.01 |	  106 |
---------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   3 - SEL$1 / E@SEL$1
   4 - SEL$1 / D@SEL$1
   5 - SEL$1 / D@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - "DEPARTMENT_NAME"[VARCHAR2,30], AVG("SALARY")[22]
   2 - "SALARY"[NUMBER,22], "DEPARTMENT_NAME"[VARCHAR2,30]
   3 - "SALARY"[NUMBER,22], "E"."DEPARTMENT_ID"[NUMBER,22]
   4 - "DEPARTMENT_NAME"[VARCHAR2,30]
   5 - "D".ROWID[ROWID,10]

SQL> alter session set statistics_level=ALL;
Session altered.


/* 在session级别设置statistics_level为ALL,可以为我们提供更为详尽的执行统计信息 */

深入理解Oracle Universal Installer (OUI)

沪ICP备14014813号-2

沪公网安备 31010802001379号