【Oracle数据恢复】ORA-600[4194]错误一例

ORA-600[4194]内部错误一般由重做记录与回滚记录不匹配引发。Oracle在验证Undo record number时,会对比redo change 和回滚段中的undo record number,若发现2者存在差异则报该4194错误。其错误argument[a][b],a代表回滚块中的最大undo record number,b代表重做日志中记录的undo record number。这个错误可能由回滚段或者redo log日志文件讹误引起。

ORA-00600[4194]错误的根本原因是 redo记录与回滚段(rollback/undo)记录之间的不一致。当ORACLE在验证undo记录时相对应的变化需要应用到undo数据块的最大undo记录上,此时若检验出错则会报ORA-00600[4194]

 

 

 

此错误不像ORA-600[2662]或ORA-600[4000]错误那样必然导致数据库无法打开,因为它很少出现在前滚阶段;当数据库被打开,smon开始执行事务恢复或一些回滚段的管理工作时则很有可能触发该错误。

 

ORA-600[4194]的2个的含义:

Arg [a] Maximum Undo record number in Undo block
Arg [b] Undo record number from Redo block

 

这个ORA-600[4194] 报错属于ORACLE内核从cache层到事务undo处理,可能的影响是进程失败或者可能的回滚段坏块。

 

可能的bug 包括:

 

8240762  10.2.0.5,
11.1.0.7.10,
11.2.0.1
Undo corruptions with ORA-600 [4193]/ORA-600 [4194] or ORA-600 [4137] /
SMON may spin to recover transaction

 

3210520 9.2.0.5, 10.1.0.2 OERI[kjccqmg:esm] / OERI[4194] / corruption possible in RAC

792610 8.0.6.0, 8.1.6.0 Rollback segment corruption

 

对于非自举对象non-bootstrap对象对应的undo记录可以通过如下方法搞定,如果涉及到的对象是bootstrap系统对象则可能需要手动 bbed来修复, 如果自己搞不定可以找ASKMACLEAN专业数据库修复团队成员帮您恢复

 

 

 

来具体看一下错误记录:

 

 

 

Thu Aug 26 18:58:50 2010
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_smon_6587.trc:
ORA-01595: error freeing extent (3) of rollback segment (4))
ORA-00600: internal error code, arguments: [4194], [53], [41], [], [], [], [], []
Thu Aug 26 18:58:50 2010
..............
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_j000_6630.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2 change 1617922 time 08/26/2010 18:35:39
ORA-00334: archived log: '/s01/10gdb/flash_recovery_area/YOUYUS/onlinelog/o1_mf_3_65psr4on_.log'
Thu Aug 26 19:00:31 2010
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_j000_6630.trc:
ORA-00600: internal error code, arguments: [4194], [53], [41], [], [], [], [], []
Thu Aug 26 19:00:34 2010
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_j000_6630.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2 change 1617922 time 08/26/2010 18:35:39
ORA-00334: archived log: '/s01/10gdb/flash_recovery_area/YOUYUS/onlinelog/o1_mf_3_65psr4on_.log'
ORA-00600: internal error code, arguments: [4194], [53], [41], [], [], [], [], []
Thu Aug 26 19:00:35 2010
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_j000_6630.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2 change 1617922 time 08/26/2010 18:35:39
ORA-00334: archived log: '/s01/10gdb/flash_recovery_area/YOUYUS/onlinelog/o1_mf_3_65psr4on_.log'

ORA-00600: internal error code, arguments: [4194], [53], [41], [], [], [], [], []

 

 

 

如果你因为ORA-600[4194]错误导致数据库无法打开,那么可以尝试设置以下事件:

 

 

SQL> alter system set event='10513 trace name context forever,level 2 : 10512 trace name context forever,level 1: 10511 trace name context forever,level 2: 10510 trace name context forever,level 1' scope=spfile;
System altered.

/* 10513事件用以阻止SMON在启动数据库后执行事务恢复(transaction recovery) */
/* 10512事件用以阻止SMON shrink rollback segment */
/* 10511事件用以阻止SMON check to cleanup undo dictionary */
/* 10500事件用以阻止SMON check to offline pending offline rollback segment */

SQL> alter system set undo_management=MANUAL scope=spfile;
System altered.

SQL> shutdown immediate;
ORA-03113: end-of-file on communication channel

SQL> startup mount;
ORACLE instance started.

Total System Global Area 2634022912 bytes
Fixed Size                  2086288 bytes
Variable Size            2382367344 bytes
Database Buffers          234881024 bytes
Redo Buffers               14688256 bytes
Database mounted.
SQL> alter database open;

Database altered.

SQL>  create undo tablespace undoc datafile size 300M;

SQL> alter system set undo_management=AUTO scope=spfile;
System altered.

SQL>  alter system set undo_tablespace=undoc scope=spfile;
System altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 2634022912 bytes
Fixed Size                  2086288 bytes
Variable Size            2382367344 bytes
Database Buffers          234881024 bytes
Redo Buffers               14688256 bytes
Database mounted.

SQL> alter database open;
Database altered.

/* 通过重建undo表空间可以避免一些4194错误,但不是全部 */

/* 这个库目前处于随时会crash的不可控状态,我们必须要导出数据并导入到新库中 * /

/* 这种情况下direct方式 可能可以规避一些意外错误 */

[maclean@rh2 dump]$ exp maclean/maclean file=full_maclean.dmp owner=maclean  direct=y statistics=none
Export: Release 10.2.0.4.0 - Production on Thu Aug 26 21:18:40 2010
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and UTF8 NCHAR character set

About to export specified users ...
. exporting pre-schema procedural objects and actions
. exporting foreign function library names for user MACLEAN
. exporting PUBLIC type synonyms
. exporting private type synonyms
. exporting object type definitions for user MACLEAN
About to export MACLEAN's objects ...
. exporting database links
. exporting sequence numbers
. exporting cluster definitions
. about to export MACLEAN's tables via Direct Path ...
Table SYS_EXPORT_TABLE_01 will be exported in conventional path.
. . exporting table            SYS_EXPORT_TABLE_01        256 rows exported
Table SYS_EXPORT_TABLE_02 will be exported in conventional path.
. . exporting table            SYS_EXPORT_TABLE_02        257 rows exported
Table SYS_EXPORT_TABLE_03 will be exported in conventional path.
..............
exporting refresh groups and children
. exporting dimensions
. exporting post-schema procedural objects and actions
. exporting statistics
Export terminated successfully with warnings.

/* we are lucky! */

如何修复重编译Datapump工具expdp/impdp

数据泵工具expdp/impdp是10g中引发的服务器端导入导出外部工具,虽然是外部的binary,但是实际expdp/impdp都依赖于内部的PL/SQL package主要是(dbms_datapump),在很多情况下我们需要修复或重新加载Datapump工具,方法如下:

 

对于版本10.1:

 

1. Catdp.sql orders the installation of all its components including 
   the Metadata API which was previously installed separately.
   By default catproc.sql invoke this script.

SQL >@ $ORACLE_HOME/rdbms/admin/catdp.sql

2. dbmspump.sql will create DBMS procedures for dataPUMP

SQL >@ $ORACLE_HOME/rdbms/admin/dbmspump.sql

 

 

对于版本10.2:

 

1. Catdph.sql will Re-Install DataPump types and views

SQL >@ $ORACLE_HOME/rdbms/admin/catdph.sql 

Note: If XDB is installed, then it is required to run "catmetx.sql" script also.

Use this code to verify if XDB is installed:

SQL> select substr(comp_name,1,30) comp_name, substr(comp_id,1,10)
     comp_id,substr(version,1,12) version,status from dba_registry;

Sample output if XDB installed,

Oracle XML Database            XDB        -version-   VALID

2. prvtdtde.plb will Re-Install tde_library packages

SQL >@ $ORACLE_HOME/rdbms/admin/prvtdtde.plb 

3. Catdpb.sql will Re-Install DataPump packages

SQL >@ $ORACLE_HOME/rdbms/admin/catdpb.sql 

4.Dbmspump.sql will Re-Install DBMS DataPump objects

SQL >@ $ORACLE_HOME/rdbms/admin/dbmspump.sql 

5. To recompile  invalid objects, if any

SQL >@ $ORACLE_HOME/rdbms/admin/utlrp.sql

 

对于版本11g :

 

1. Catproc.sql 

SQL >@ $ORACLE_HOME/rdbms/admin/catproc.sql 

2. To recompile invalid objects, if any

SQL >@ $ORACLE_HOME/rdbms/admin/utlrp.sql

 

 

示例使用方法FOR 10.2.0.5:

 

 

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bi
PL/SQL Release 10.2.0.5.0 - Production
CORE    10.2.0.5.0      Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production

select substr(comp_name,1,30) comp_name, substr(comp_id,1,10)
     comp_id,substr(version,1,12) version,status from dba_registry;

确认XDB未安装

@$ORACLE_HOME/rdbms/admin/catdph.sql 

....

Package created.

Grant succeeded.

SQL> @ $ORACLE_HOME/rdbms/admin/prvtdtde.plb 

Library created.

No errors.

Package created.

Synonym created.

Package created.

@ $ORACLE_HOME/rdbms/admin/catdpb.sql 

@ $ORACLE_HOME/rdbms/admin/dbmspump.sql 

@ $ORACLE_HOME/rdbms/admin/utlrp.sql

同时可以把expdp、impdp的binary重新编译一遍

cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk iexpdp iimpdp

- Linking Datapump Export utility (expdp)
rm -f /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/expdp
gcc -o /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/expdp -L/s01/oracle/product/10.2.0.5/db_1/rdbms/lib/ -L/s01/oracle/product/10.2.0.5/db_1/lib/ -L/s01/oracle/product/10.2.0.5/db_1/lib/stubs/   /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/s0udexp.o  /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/defopt.o -ldbtools10 -lclntsh  `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lmm -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10   -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10   `cat /s01/oracle/product/10.2.0.5/db_1/lib/sysliblist` -Wl,-rpath,/s01/oracle/product/10.2.0.5/db_1/lib -lm    `cat /s01/oracle/product/10.2.0.5/db_1/lib/sysliblist` -ldl -lm   -L/s01/oracle/product/10.2.0.5/db_1/lib
mv -f /s01/oracle/product/10.2.0.5/db_1/bin/expdp /s01/oracle/product/10.2.0.5/db_1/bin/expdpO
mv /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/expdp /s01/oracle/product/10.2.0.5/db_1/bin/expdp
chmod 751 /s01/oracle/product/10.2.0.5/db_1/bin/expdp

 - Linking Datapump Import utility (impdp)
rm -f /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/impdp
gcc -o /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/impdp -L/s01/oracle/product/10.2.0.5/db_1/rdbms/lib/ -L/s01/oracle/product/10.2.0.5/db_1/lib/ -L/s01/oracle/product/10.2.0.5/db_1/lib/stubs/   /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/s0udimp.o  /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/defopt.o -ldbtools10 -lclntsh  `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lmm -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/oracle/product/10.2.0.5/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10   -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10   `cat /s01/oracle/product/10.2.0.5/db_1/lib/sysliblist` -Wl,-rpath,/s01/oracle/product/10.2.0.5/db_1/lib -lm    `cat /s01/oracle/product/10.2.0.5/db_1/lib/sysliblist` -ldl -lm   -L/s01/oracle/product/10.2.0.5/db_1/lib
mv -f /s01/oracle/product/10.2.0.5/db_1/bin/impdp  /s01/oracle/product/10.2.0.5/db_1/bin/impdpO
mv /s01/oracle/product/10.2.0.5/db_1/rdbms/lib/impdp /s01/oracle/product/10.2.0.5/db_1/bin/impdp
chmod 751 /s01/oracle/product/10.2.0.5/db_1/bin/impdp

[oracle@vrh8 lib]$ ls -l $ORACLE_HOME/bin/*pdp
-rwxr-x--x 1 oracle oinstall 228377 Aug 26 09:15 /s01/oracle/product/10.2.0.5/db_1/bin/expdp
-rwxr-x--x 1 oracle oinstall 233704 Aug 26 09:15 /s01/oracle/product/10.2.0.5/db_1/bin/impdp

SGA_MAX_SIZE,SGA_TARGET以及PRE_PAGE_SGA参数

10g引入ASMM后SGA_TARGET取代shared_pool_size,db_cache_size等参数,成为DBA关注的主要SGA内存管理参数;有不少情况下SGA_TARGET参数会设置为一个小于SGA_MAX_SIZE的值(这样做在多实例情况下更具灵活性)。但不少人会问,这样岂不是要浪费一部分物理内存吗?Oracle会为实例分配SGA_MAX_SIZE大小的内存的,SGA_TARGET要设得和SGA_MAX_SIZE一样大才合理啊!

让我们来看看实际的情况:

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

/*  linux上的10.2.0.4  */

SQL> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 3600M
sga_target                           big integer 368M

SQL> col component for a25;
SQL> select component, current_size / 1024 / 1024, last_oper_type
  2    from v$sga_dynamic_components
  3   where current_size != 0;

COMPONENT                 CURRENT_SIZE/1024/1024 LAST_OPER_TYP
------------------------- ---------------------- -------------
shared pool                                   96 STATIC
large pool                                    16 STATIC
java pool                                     16 STATIC
DEFAULT buffer cache                         224 INITIALIZING

/* 此时的SGA_TARGET为368M,而SGA_MAX_SIZE为3600M */

/* 我们来看一下实际的物理内存使用情况  */

/* 以root用户登录,因为我们需要用到清理文件系统缓存的命令 */

[maclean@rh2 ~]$ su - root
Password:

[root@rh2 ~]# sync
[root@rh2 ~]# sync

/* sync 命令用以写出文件系统脏缓存,类似于Oracle的checkpoint手动检查点 */

[root@rh2 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@rh2 ~]# echo 3 > /proc/sys/vm/drop_caches

/*  将proc文件系统下的drop_caches文件置为3,清理文件系统缓存,以免其干扰我们的实验 */

[root@rh2 ~]# free
             total       used       free     shared    buffers     cached
Mem:       4046352     429328    3617024          0        296     176100
-/+ buffers/cache:     252932    3793420
Swap:     10241428          0   10241428

/*  4g内存中仅有419M处于使用状态,其他均为free。可见Oracle没有为实例分配大小等于SGA_MAX_SIZE的内存空间,而接近于SGA_TARGET_SIZE的大小 */
[root@rh2 ~]# ps -ef|grep pmon|grep -v grep
maclean   6361     1  0 18:35 ?        00:00:00 ora_pmon_YOUYUS

[root@rh2 ~]# pmap -x 6361
6361:   ora_pmon_YOUYUS
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  100412    7300       0 r-x--  oracle
000000000680f000     544     180      64 rwx--  oracle
0000000006897000     148     104     104 rwx--    [ anon ]
000000001e9d0000     672     532     532 rwx--    [ anon ]
0000000060000000 3688448    1044     388 rwxs-    [ shmid=0x390005 ]

/* 利用pmap工具探测Oracle后台进程的内存地址空间,可以看到这里虚拟共享内存段(也就是SGA)的大小为3602M */

[root@rh2 ~]# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x74018366 3112960    root      600        4          0
0x00000000 3473409    root      644        80         2
0x74018365 3080194    root      600        4          0
0x00000000 3506179    root      644        16384      2
0x00000000 3538948    root      644        280        2
0x1cdd16e8 3735557    maclean   640        3776970752 15

/* 使用ipcs 命令也可以观察到这个大小为3602M的共享内存段 */

/* 可以猜测Oracle在这里使用malloc函数或类似的方法实现虚拟内存的分配,没有像calloc函数那样对空间中的每一位都初始为0,保证了不浪费物理内存  */

/*  我们动态修改SGA_TARGET参数来看看*  /

SQL> alter system set sga_target=3000M;

System altered.

SQL> select component, current_size / 1024 / 1024, last_oper_type
  2    from v$sga_dynamic_components
  3   where current_size != 0;

COMPONENT                 CURRENT_SIZE/1024/1024 LAST_OPER_TYP
------------------------- ---------------------- -------------
shared pool                                   96 STATIC
large pool                                    16 STATIC
java pool                                     16 STATIC
DEFAULT buffer cache                        2864 INITIALIZING

/*   BUFFER_CACHE 的空间大幅增加 * /

SQL> !free
             total       used       free     shared    buffers     cached
Mem:       4046352    3187784     858568          0        536    2889920
-/+ buffers/cache:     297328    3749024
Swap:     10241428          0   10241428

/* used mem部分上升到3113M,随着SGA_TARGET的动态增长而增长了 */

/* 如果我们改小SGA_TARGET呢? */

SQL> alter system set sga_target=368M;

System altered.

SQL> select component, current_size / 1024 / 1024, last_oper_type
  2    from v$sga_dynamic_components
  3   where current_size != 0;

COMPONENT                 CURRENT_SIZE/1024/1024 LAST_OPER_TYP
------------------------- ---------------------- -------------
shared pool                                   96 STATIC
large pool                                    16 STATIC
java pool                                     16 STATIC
DEFAULT buffer cache                         224 SHRINK

/* 可以看到BUFFER CACHE最近执行了SHRINK收缩操作,SIZE下降到224M */

QL> !free
             total       used       free     shared    buffers     cached
Mem:       4046352    3204608     841744          0       1596    2912888
-/+ buffers/cache:     290124    3756228
Swap:     10241428          0   10241428

/* 此时OS层仍认为used memory部分为3130M;但可以放心,它们是可被其他进程复用的  * /

官方对pre_page_sga参数的定义是”PRE_PAGE_SGA determines whether Oracle reads the entire SGA into memory at instance startup. Operating system page table entries are then prebuilt for each page of the SGA. This setting can increase the amount of time necessary for instance startup, but it is likely to decrease the amount of time necessary for Oracle to reach its full performance capacity after startup.” 它决定Oracle是否在启动阶段将整个SGA读入内存,实际上在启动阶段SGA的内容是十分稀疏的,当pre_page_sga为TRUE时Oracle所要做的是向OS所要SGA_MAX_SIZE大小的实际物理页。

我们来看一下这个参数的实际效果:

SQL> alter system set sga_max_size=2500M scope=spfile;
System altered.

SQL> alter system set pre_page_sga=true scope=spfile;
System altered.

SQL> startup force ;
ORACLE instance started.

Total System Global Area 2634022912 bytes
Fixed Size                  2086288 bytes
Variable Size            2382367344 bytes
Database Buffers          234881024 bytes
Redo Buffers               14688256 bytes
Database mounted.
Database opened.

[root@rh2 ~]# echo 3 > /proc/sys/vm/drop_caches

SQL> show parameter sga

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     TRUE
sga_max_size                         big integer 2512M
sga_target                           big integer 368M

SQL> !free
             total       used       free     shared    buffers     cached
Mem:       4046352    2895256    1151096          0        648    2623692
-/+ buffers/cache:     270916    3775436
Swap:     10241428     104936   10136492

/*  设置pre_page_sga参数为true后,Oracle实际分配的物理内存不再局限于SGA_TARGET,而是在实例生命周期内几乎恒等于SGA_MAX_SIZE指定的值了 * /

/*  当SGA_TARGET配合pre_page_sga使用时,这的确可能造成一种浪费 ! * /

10g中Oracle已经强烈不建议使用pre_page_sga参数了,在ASMM下它带来的问题可不小!

undo backup optimization does not work on 11.2.0.1?

Backup Undo Optimization是11g的新特性之一,RMAN将避免备份撤销表空间上那些已提交事务的撤销数据。且该特性无法被禁用(You can enable and disable backup optimization, but backup undo optimization is built-in behavior.)。

我们在11.2.0.1版本上具体测试一下这个新特性:

SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

/*  为了避免undo自动调优干扰我们的测试,修改_undo_autotune参数禁用该特性 */

SQL> alter system set "_undo_autotune"=false;
System altered.

/* 创建一个新的undo表空间,清理现场 */
SQL> create undo tablespace UNDOTEST datafile size 500M autoextend on next 50M maxsize unlimited;
Tablespace created.

SQL> alter system set undo_tablespace=UNDOTEST;
System altered.

/* 列出相关的undo参数,可以看到这里undo_retention参数设为极短的10s */

SQL> show parameter undo;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_undo_autotune                       boolean     FALSE
undo_management                      string      AUTO
undo_retention                       integer     10
undo_tablespace                      string      UNDOTEST

RMAN> list backup;

specification does not match any backup in the repository

RMAN> backup tablespace UNDOTEST;

RMAN> list backup of tablespace UNDOTEST;

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
4       Full    1.90M      DISK        00:00:00     25-AUG-10

/* undo表空间初始的备份大小为1.9M  */

SQL> conn maclean/maclean
Connected.

SQL> select count(*) from YOUYUS;
  COUNT(*)
----------
    579808

/* YOUYUS表有大约60万条数据,我们批量删除这些数据,将产生大量的undo*/

SQL> delete YOUYUS;
579808 rows deleted.

/*  此时再次执行备份undo表空间操作 */

RMAN> backup tablespace UNDOTEST;

RMAN>  list backup of tablespace UNDOTEST;

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
5       Full    134.43M    DISK        00:00:03     25-AUG-10

/*  在存在大量active undo数据的情况下,备份文件增大到134M */

SQL> commit;
Commit complete.

SQL> exec dbms_lock.sleep(20);
PL/SQL procedure successfully completed.

SQL> select status,sum(blocks) from dba_undo_extents group by status;
STATUS    SUM(BLOCKS)
--------- -----------
UNEXPIRED        2696
EXPIRED         32936

/* commit后等待20s,确定没有active的撤销段 */

RMAN> backup tablespace UNDOTEST;

RMAN> list backup of tablespace UNDOTEST;
BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
6       Full    134.49M    DISK        00:00:02     25-AUG-10

/* 备份文件还要大于commit前,undo backup optimization居然没有起作用? */

/* 这个会是BUG吗? */

根据以上情况我提交了SR,ORACLE GCS给出的回复:

Bug 6399468: UNDO OPTIMIZATION
====> Undo optimization was changed from ‘all undo not needed’ to ‘undo not in use older than 1 hour’
In unpublished bug 6399468 DEV has confirmed the Undo optimization was changed from ‘all undo not needed’ to ‘undo not in use older than 1 hour’. For your last test case please wait for 1 hour and try backup again.

居然又是一个unpublished的BUG,Oracle DEV部门确认了backup undo optimization所避免备份的是1个小时以上未被尝试使用的undo,而非所有不再需要的undo。这是典型的开发部门和文档撰写部门间没有充分交流造成的问题!

/* 尝试等待3600s */

SQL> exec dbms_lock.sleep(3600);
PL/SQL procedure successfully completed.

/* 3600s还真漫长....... */

RMAN> backup tablespace UNDOTEST;

RMAN> list backup of tablespace UNDOTEST;
BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
10      Full    15.01M     DISK        00:00:00     25-AUG-10

/* 备份集缩小到15m,undo backup optimization起到了作用!*/

That's great!

关于DataPump的external_table模式

在pre10g的很长时间内,Oracle仅提供exp/imp导入导出工具,虽然这2个实用程序十分有效(现在也是如此),但因为它们受限于client/server模式工具自身的限制,以普通用户程序的身份来运转数以TB计的数据,其才不堪大用!DataPump是10g以后主推的数据抽取/导入工具,不同于exp/imp工具,DataPump是一个服务器端的实用程序,因为运行在服务器上故而DataPump进程可以直接访问数据文件与SGA(无需借shadow进程之手),与exp/imp工具相比使用DataPump可以获得显著的性能改善。DataPump可以通过直接路径或外部表路径这两种方法导出数据;其中直接路径避开了数据库高速缓存。当使用直接路径模式抽取数据时,DataPump从磁盘直接读取数据文件,抽取和格式化文件内容,最后将内容写出到转储文件。该种模式和SGA交互等待少,其导入导出速度直接取决于数据库所在磁盘速度和cpu;因此,直接路径极为快速。

外部表路径模式将使用到数据库的高速缓存buffer cache,通过外部表路径方法导出数据时,DataPump使用普通的SELECT操作将数据块从数据文件中读入buffer cache,为了写出转储文件,数据会在缓存中被格式化。通过外部表路径导入数据时,DataPump根据转储文件的内容构造标准的插入语句,并且通过将数据块从数据文件读至缓存来执行这些语句,插入操作按照标准的样式在缓存中完成;如同任何普通DML操作一样,外部表路径也会同时产生撤销和重做。

DataPump自身会根据对象的复杂性作出使用直接路径还是外部表路径的选择;对于较复杂的对象(后文将列出)而言,为了分解复杂性而必须同SGA进行交互,此情况下Data Pump只能采用外部表模式。我们还可以通过使用access_method参数来控制其行为,当然这仅在我们确认Data Pump作出了错误选择时才有必要。

若满足右列条件EXPDP将采用direct_path即直接路径模式 表结构允许使用直接路径模式,举例而言:

表上没有启用针对SELECT操作的fine-grained access control

非队列表(queue table)

表上没有BFILE和opaque类型的列,或包含有opaque列的对象类型

表上没有加密列

表上没有被废弃的旧类型列

若表上存在LONG或LONG RAW类型列,则此列只能是最后一列
使用Expdp执行导出任务时没有为相关表指定QUERY, SAMPLE, or REMAP_DATA等参数

需要导出的表或分区相对较少(多达250M),亦或者表或分区其实很大,但导出任务无法工作在并行模式(未指定parallel参数,或parallel参数设置为1)

若满足右列条件EXPDP将采用external_table即外部表模式 数据结构不满足在直接路径模式下抽取的条件,举例而言:

表上启用了针对SELECT操作的精细粒度控制

队列表

表上包含了BFILE或opaque类型列,或者包含有opaque列的对象类型

表上存在加密列

表上存在被废弃的旧类型列

表上存在LONG或LONG RAW类型列,且不是最后列

数据结构满足使用直接路径模式的条件,但执行导出任务时相关表上指定了QUERY, SAMPLE, or REMAP_DATA等参数

数据结构满足使用直接路径模式的条件,但相关的表或分区相对较大(大于250M),此时并行SQL可以用来加速数据抽取

若满足右列条件IMPDP将采用direct_path即直接路径模式 数据结构满足使用直接路径模式的条件,举例而言:

当导入某单一表分区时该分区表上没有建立全局索引,这一点也包括分区的对象表

没有基于LOB列建立的域索引(domain index)

非cluster表

表上没有BFILE列或opaque类型列

表上没有嵌入了opaque类型的VARRAY列

表上没有加密列

没有启用补全日志(Supplemental logging)且表上没有LOB类型列

若导入表已预先建立了表建构,则需满足以下条件:

表上没有激活的触发器 并且 若是分区表则应有索引 并且 表上上没有启用针对INSERT操作的精细粒度控制 并且 表上除去check类型约束不存在其他类型约束 并且 表上没有unique的索引

执行导入任务时没有为相关表指定QUERY, REMAP_DATA等参数

需要导入的表或分区相对较小(少于250M),或者表或分区其实很大,但无法以并行模式导入(未指定parallel参数或指定其为1)

若满足右列条件IMPDP将采用external_table即外部表路径 当数据结构不满足在直接路径模式下导入的条件,举例而言:

当导入某单一表分区时该分区表上建有全局索引,这一点也包括分区的对象表

表上有基于LOB列建立的域索引(domain index)

cluster表

表上有BFILE列或opaque类型列

表上有嵌入了opaque类型的VARRAY列

表上有加密列

启用了补全日志且表上有至少一个LOB列

若导入表已预先建立了表建构,且满足以下条件:

表上有激活的触发器 或者

是分区表且没有任何索引 或者

表上启用了针对INSERT操作的精细粒度控制 或者

表上除去check类型约束还还有其他类型约束 或者

表上有unique的索引

执行导入任务时有为相关表指定QUERY, REMAP_DATA等参数

数据结构满足使用直接路径模式的条件,但相关的表或分区相对较大(大于250M),此时并行SQL可以用来加速数据导入

我们有必要深入了解一下外部表路径究竟是如何工作的:

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> exec dbms_workload_repository.create_snapshot;
PL/SQL procedure successfully completed.

/*  执行expdp导出任务,TCACHE表大小为465M,这里显示指定了外部表路径 */

[maclean@rh2 ~]$ expdp maclean/maclean directory=dumps tables=TCACHE dumpfile=tcache1.dmp access_method=external_table
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 24 August, 2010 20:31:48
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "MACLEAN"."SYS_EXPORT_TABLE_07":  maclean/******** directory=dumps tables=TCACHE dumpfile=tcache1.dmp access_method=external_table
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 544 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."TCACHE"                          465.6 MB 4999999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_07" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_07 is:
  /s01/dump/tcache1.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_07" successfully completed at 20:32:18

/*  这次我们指定了并行度为4  */

[maclean@rh2 ~]$ expdp maclean/maclean directory=dumps tables=TCACHE dumpfile=tcache2.dmp,tcache3.dmp,tcache4.dmp,tcache5.dmp access_method=external_table parallel=4
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 24 August, 2010 20:32:49
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "MACLEAN"."SYS_EXPORT_TABLE_07":  maclean/******** directory=dumps tables=TCACHE dumpfile=tcache2.dmp,tcache3.dmp,tcache4.dmp,tcache5.dmp access_method=external_table parallel=4
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 544 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."TCACHE"                          465.6 MB 4999999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_07" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_07 is:
  /s01/dump/tcache2.dmp
  /s01/dump/tcache3.dmp
  /s01/dump/tcache4.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_07" successfully completed at 20:33:04

SQL> exec dbms_workload_repository.create_snapshot;
PL/SQL procedure successfully completed.

SQL> @?/rdbms/admin/awrrpt

接着我们分析该时段内的awr报告!

从AWR报告中,我们不难找到以下语句:
SQL ID 3qwsywpvtdyjp:
CREATE TABLE "ET$088200010001" ( "OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY" ) ORGANIZATION EXTERNAL ( TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY DUMPS ACCESS PARAMETERS (DEBUG =0 DATAPUMP INTERNAL TABLE "MACLEAN"."TCACHE" JOB ( "MACLEAN", "SYS_EXPORT_TABLE_07", 1) WORKERID 1 PARALLEL 1 VERSION COMPATIBLE ENCRYPTPASSWORDISNULL ) LOCATION ('bogus.dat') ) PARALLEL 1 REJECT LIMIT UNLIMITED AS SELECT /*+ PARALLEL(KU$, 1) */ "OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY" FROM RELATIONAL("MACLEAN"."TCACHE" ) KU$

SQL ID 1rxax27p7anr3:
CREATE TABLE "ET$088000020001" ( "OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY" ) ORGANIZATION EXTERNAL ( TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY DUMPS ACCESS PARAMETERS (DEBUG =0 DATAPUMP INTERNAL TABLE "MACLEAN"."TCACHE" JOB ( "MACLEAN", "SYS_EXPORT_TABLE_07", 1) WORKERID 2 PARALLEL 2 VERSION COMPATIBLE ENCRYPTPASSWORDISNULL ) LOCATION ('bogus.dat') ) PARALLEL 2 REJECT LIMIT UNLIMITED AS SELECT /*+ PARALLEL(KU$, 2) */ "OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY" FROM RELATIONAL("MACLEAN"."TCACHE" ) KU$

以上就是Expdp在导出数据时使用的创建组织类型为ORACLE_DATAPUMP的外部表的SQL语句;后者语句中指定的并行度为2,而非我所指定的4,因该是Oracle考虑到所需导出表的具体大小,设置过高的并行度反而会适得其反,所以对设定值做了修正。

启用了并行的SQL 1rxax27p7anr3执行时间为22.63s,反而要比不使用并行慢;可见在TCACHE表这个数量级,完全没有必要使用并行导出。

接着我们来探究一下外部表路径的导入:

SQL> truncate table tcache;
Table truncated.
SQL> set pagesize 1400;
SQL> set linesize 140;

/* 执行外部表路径的数据导入,并行度1  */

[maclean@rh2 admin]$ impdp maclean/maclean directory=dumps tables=TCACHE content=data_only  dumpfile=tcache1.dmp access_method=external_table

/* 在以上SQL执行期间查询V$SQL动态视图的相关语句*/

select sql_text,
       sql_id,
       cpu_time,
       elapsed_time,
       disk_reads,
       buffer_gets,
       rows_processed
  from v$sql
 where sql_text like '%TCACHE%'
   and sql_text not like '%like%';

SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID          CPU_TIME ELAPSED_TIME DISK_READS BUFFER_GETS ROWS_PROCESSED
------------- ---------- ------------ ---------- ----------- --------------
INSERT /*+ SYS_DL_CURSOR */ INTO RELATIONAL("MACLEAN"."TCACHE") ("OWNER","OBJECT_NAME","SUBOBJECT_NAME","OBJECT_ID","DATA_OBJECT_ID","OBJECT
_TYPE","CREATED","LAST_DDL_TIME","TIMESTAMP","STATUS","TEMPORARY","GENERATED","SECONDARY") VALUES (NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
ULL,NULL,NULL,NULL,NULL)
6tn47a220d34q        408          408          0           0              0

INSERT /*+APPEND+*/  /*+PARALLEL("TCACHE",1)+*/ INTO RELATIONAL("MACLEAN"."TCACHE" ) ("OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID",
 "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY")     SELECT "OWNE
R", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPO
RARY", "GENERATED", "SECONDARY"     FROM "ET$087B00060001" KU$
66q4w8c4ak0wf      31576        31576          0         608              0

/* 当导入job完成后再次查询 */
select sql_text,
       sql_id,
       cpu_time,
       elapsed_time,
       disk_reads,
       buffer_gets,
       rows_processed
  from v$sql
 where sql_text like '%TCACHE%'
   and sql_text not like '%like%';

SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID          CPU_TIME ELAPSED_TIME DISK_READS BUFFER_GETS ROWS_PROCESSED
------------- ---------- ------------ ---------- ----------- --------------
INSERT /*+ SYS_DL_CURSOR */ INTO RELATIONAL("MACLEAN"."TCACHE") ("OWNER","OBJECT_NAME","SUBOBJECT_NAME","OBJECT_ID","DATA_OBJECT_ID","OBJECT
_TYPE","CREATED","LAST_DDL_TIME","TIMESTAMP","STATUS","TEMPORARY","GENERATED","SECONDARY") VALUES (NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
ULL,NULL,NULL,NULL,NULL)
6tn47a220d34q        408          408          0           0              0

/* 执行完成后一句SQL立马消失了!? Oracle好像很不情愿我们看到外部表路径使用的INSERT语句 :) */

INSERT /*+APPEND+*/ /*+PARALLEL("TCACHE",1)+*/
INTO RELATIONAL
  ("MACLEAN"."TCACHE")
  ("OWNER",
   "OBJECT_NAME",
   "SUBOBJECT_NAME",
   "OBJECT_ID",
   "DATA_OBJECT_ID",
   "OBJECT_TYPE",
   "CREATED",
   "LAST_DDL_TIME",
   "TIMESTAMP",
   "STATUS",
   "TEMPORARY",
   "GENERATED",
   "SECONDARY")
  SELECT "OWNER",
         "OBJECT_NAME",
         "SUBOBJECT_NAME",
         "OBJECT_ID",
         "DATA_OBJECT_ID",
         "OBJECT_TYPE",
         "CREATED",
         "LAST_DDL_TIME",
         "TIMESTAMP",
         "STATUS",
         "TEMPORARY",
         "GENERATED",
         "SECONDARY"
    FROM "ET$087F00230001" KU$

/* 这里的KU$代表的就是以ORACLE_DATAPUMP转储文件形式存放在数据库外的外部表 */

/* 可以看到这里尝试使用了APPEND插入模式,以减少undo和redo的产生*/

SQL> truncate table tcache;
Table truncated.

/* 若启用并行导入呢? */

[maclean@rh2 admin]$ impdp maclean/maclean directory=dumps tables=TCACHE content=data_only  dumpfile=tcache2.dmp,tcache3.dmp,tcache4.dmp access_method=external_table parallel=3
........

select sql_text,
       sql_id,
       cpu_time,
       elapsed_time,
       disk_reads,
       buffer_gets,
       rows_processed
  from v$sql
 where sql_text like '%TCACHE%'
   and sql_text not like '%like%';
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID          CPU_TIME ELAPSED_TIME DISK_READS BUFFER_GETS ROWS_PROCESSED
------------- ---------- ------------ ---------- ----------- --------------
INSERT /*+APPEND+*/  /*+PARALLEL("TCACHE",1)+*/ INTO RELATIONAL("MACLEAN"."TCACHE" ) ("OWNER", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID",
 "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPORARY", "GENERATED", "SECONDARY")     SELECT "OWNE
R", "OBJECT_NAME", "SUBOBJECT_NAME", "OBJECT_ID", "DATA_OBJECT_ID", "OBJECT_TYPE", "CREATED", "LAST_DDL_TIME", "TIMESTAMP", "STATUS", "TEMPO
RARY", "GENERATED", "SECONDARY"     FROM "ET$087A00040001" KU$
2mvs15623ssvy    5910289      5910289          0       73578              0

INSERT /*+ SYS_DL_CURSOR */ INTO RELATIONAL("MACLEAN"."TCACHE") ("OWNER","OBJECT_NAME","SUBOBJECT_NAME","OBJECT_ID","DATA_OBJECT_ID","OBJECT
_TYPE","CREATED","LAST_DDL_TIME","TIMESTAMP","STATUS","TEMPORARY","GENERATED","SECONDARY") VALUES (NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
ULL,NULL,NULL,NULL,NULL)
6tn47a220d34q       1167         1167          0           0              0

/* 似乎因为沿用了APPEND插入模式,插入语句的并行度并未上升 */

/* 因为Oracle不想同我们分享它的小秘密,所以awr和ash报告中都不记录外部表路径插入SQL的相关信息 */

/* 好吧,它有权保持沉默!  */

That's great!

Pending Problem

这个问题发生在今年的1月,用户以操作系统认证形式登陆RAC中的主用实例时发现登陆挂起,但不出现错误。之后应用人员陆续手动杀死服务进程,杀死进程后发现实例可以登录了,应用人员在没有做任何信息转储的情况下重启了数据库,这就造成了我们后期诊断时缺乏必要的信息,也是这个case变成悬案的主要原因。

在实例hang住的一个半小时中告警日志没有任何信息;仅有的有用信息是该实例中diag,pmon,lmd后台进程的trace文件。以下为trace文件:

lmd0 trace:
*** 2010-01-16 11:02:58.106
DUMP LOCAL BLOCKER/HOLDER: block level 5 res [0x10b0005][0xeb5],[TX]
----------resource 0x70000044a76e1a0----------------------
resname       : [0x10b0005][0xeb5],[TX]
Local node    : 1
dir_node      : 1
master_node   : 1
hv idx        : 25
hv last r.inc : 0
current inc   : 20
hv status     : 0
hv master     : 0
open options  : dd
grant_bits    : KJUSERNL KJUSEREX
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count         : 1         0         0         0         0         1
val_state     : KJUSERVS_NOVALUE
valblk        : 0x00000000000000000000000000000000 .
access_node   : 1
vbreq_state   : 0
state         : x0
resp          : 70000044a76e1a0
On Scan_q?    : N
Total accesses: 15061
Imm.  accesses: 12545
Granted_locks : 1
Cvting_locks  : 1
value_block:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
GRANTED_Q :
lp 700000488546990 gl KJUSEREX rp 70000044a76e1a0 [0x10b0005][0xeb5],[TX]
  master 1 gl owner 70000048dd08220 possible pid 2928930 xid 2033-0333-0026D22C bast 0 rseq 215 mseq 0 history 0x14951495
  open opt KJUSERDEADLOCK
CONVERT_Q:
lp 700000488546ae0 gl KJUSERNL rl KJUSEREX rp 70000044a76e1a0 [0x10b0005][0xeb5],[TX]
  master 1 gl owner 70000048ee0ed48 possible pid 2040272 xid 2006-0066-0006F888 bast 0 rseq 215 mseq 0 history 0x1495149a
  convert opt KJUSERGETVALUE KJUSERTRCCANCEL
----------enqueue 0x700000488546990------------------------
lock version     : 2242641
Owner node       : 1
grant_level      : KJUSEREX
req_level        : KJUSEREX
bast_level       : KJUSERNL
notify_func      : 0
resp             : 70000044a76e1a0
procp            : 70000048928b8f0
pid              : 2040272
proc version     : 40135
oprocp           : 0
opid             : 0
group lock owner : 70000048dd08220
possible pid     : 2928930
xid              : 2033-0333-0026D22C
dd_time          : 0.0 secs
dd_count         : 0
timeout          : 0.0 secs
On_timer_q?      : N
On_dd_q?         : N
lock_state       : GRANTED
Open Options     : KJUSERDEADLOCK
Convert options  : KJUSERNOQUEUE
History          : 0x14951495
Msg_Seq          : 0x0
res_seq          : 215
valblk           : 0x00000000000000000000000000000000 .
DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT
  possible owner[819.2928930] on resource TX-010B0005-00000EB5
Submitting asynchronized dump request [28]

pmon trace:
*** SESSION ID:(3384.1) 2010-01-16 11:00:11.349
[claim lock for dead process][lp 0x7000004870f7078][p 0x7000004891a79e0.2928930][hist x49514951]

[claim lock for dead process][lp 0x700000488546990][p 0x70000048928b8f0.0][hist x49514951]   <<<< note:this message@2010-01-16 12:35

pmon trace:
*** 2010-01-16 11:02:58.244
Dump requested by process [orapid=6]
REQUEST:custom dump [2] with parameters [6][819][2][2]
. process info of pid[819.2928930] requested by pid[6.3621092]
Dumping process 819.2928930 info:
*** 2010-01-16 11:02:58.244
Dumping diagnostic information for ospid 2928930:
OS pid = 2928930
loadavg : 10.83 10.96 10.91
swap info: free_mem = 21646.73M rsv = 128.00M
           alloc = 235.52M avail = 32768.00M swap_free = 32532.48M
       F S      UID     PID    PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A  orauser 2928930       1 120 120 20 6da7bd510 174068          10:35:28      -  1:50 oracleCRMDB22 (LOCAL=NO)
open: The file access permissions do not allow the specified action.
procstack: write(/proc/2928930/ctl): The requested resource is busy.
2928930: oracleCRMDB22 (LOCAL=NO)

lmd0进程的trace文件显示该进程发现了本地存在阻塞超时的问题,[0x10b0005][0xeb5],[TX]资源的拥有者是2928930进程,而pmon进程认为2928930进程可能出于僵死状态,故希望RAC中的diag进程能对该进程做进一步的诊断,diag进程受到pmon的邀请”Dump requested by process [orapid=6]”,尝试”Dumping process 819.2928930 info:”,并利用AIX上的procstack调试工具分析此进程当时的调用栈,但出现了”procstack: write(/proc/2928930/ctl): The requested resource is busy”的错误,在网上搜索该记录的相关信息发现可能是AIX上procstack工具的一个Bug,而跟实例挂起问题关系不大:

Problem summary
****************************************************************
* USERS AFFECTED:
* Users of the procstac command on the 5300-09 Technology Level
* with the bos.perf.proctools fileset at the 5.3.9.0 and 5.3.9.1
* levels.
****************************************************************
* PROBLEM DESCRIPTION:
* The procstac command fails with an error similar to:
*
* open: Permission denied
* procstack: write(/proc/2068724/ctl): Device busy
****************************************************************
* RECOMMENDATION:
* Install APAR IZ47903.
****************************************************************
Problem conclusion
*Modified the code to fix this problelm.

diag进程的跟踪文件还显示2928930进程持有redo copy latch;这个case发生后的一个礼拜恰好有Oracle原厂工程师到该客户单位做巡检,原厂工程师查看了实例挂起期间的ASH报告,发现该时段内实例中redo相关的等待频繁发生,基本可以认定是僵死进程持有”redo copy latch”造成了RAC中单实例hang住。

原厂工程师的结论令客户比较满意,这个case基本可以结束了。但我对这个结论并不完全认同,2928930进程持有的redo copy latch是一个子栓,该类子栓在实例上的总数取决于CPU总数,一般来说等于CPU * 2;该实例所在服务器为IBM的p595配有32个cpu,redo copy latch共有64个,单单一个子栓被hold住会造成整个实例都hang住,且登录挂起吗?

SQL> show parameter cpu_count
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     32

SQL> select count(*) from v$latch_children where name='redo copy';
  COUNT(*)
----------
        64

/*我们在非生产环境测试一下*/

select addr "十六进制地址",
       to_number(addr, 'xxxxxxxxxxxx') "十进制地址",
       gets,
       misses,
       immediate_gets,
       immediate_misses
  from v$latch_children
 where name = 'redo copy';

十六进制 十进制地址       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
-------- ---------- ---------- ---------- -------------- ----------------
2DAB5898  766204056          7          0              0                0
2DAB5818  766203928          5          0              0                0
2DAB5798  766203800          5          0              0                0
2DAB5718  766203672          5          0              0                0
2DAB5698  766203544          5          0              0                0
2DAB5618  766203416          5          0              0                0
2DAB5598  766203288          5          0            511                0
2DAB5518  766203160          5          0         297024              347
8 rows selected.

/*正式测试前的redo copy child latch统计信息,我们还要用到这里的十进制latch地址*/

/*我们尝试手动hold住一个redo copy child latch,并在其他会话中执行一些产生redo的操作*/

session A:
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug call kslgetl 766203160 1;
Function returned 1

/* kslgetl是Oracle内部用get latch的函数,oradebug 可以调用该函数*/

session B:

SQL> select * From v$latchholder;
       PID        SID LADDR    NAME                                                                   GETS
---------- ---------- -------- ---------------------------------------------------------------- ----------
        22        136 2DAB5518 redo copy                                                            297443

另外开三个session,执行一下批量插入数据的匿名块,测试redo是否能够正常产生:
begin
  for i in 1 .. 90000 loop
    insert into tv values (i, i, i);
    commit;
  end loop;
end;
/
PL/SQL procedure successfully completed.

session A:
SQL> oradebug call kslfre 766203160 1;
Function returned 2FD548E0
/*手动释放766203160对应的redo copy 子闩*/

select addr "十六进制地址",
       to_number(addr, 'xxxxxxxxxxxx') "十进制地址",
       gets,
       misses,
       immediate_gets,
       immediate_misses
  from v$latch_children
 where name = 'redo copy';

十六进制 十进制地址       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
-------- ---------- ---------- ---------- -------------- ----------------
2DAB5898  766204056          7          0              0                0
2DAB5818  766203928          5          0              0                0
2DAB5798  766203800          5          0              0                0
2DAB5718  766203672          5          0            745                0
2DAB5698  766203544          5          0         122370                5
2DAB5618  766203416          5          0         176712                5
2DAB5598  766203288          5          0         144441                6
2DAB5518  766203160          6          0         297443              373
/*可以看到虽然8个子闩中一个被手动hold住,但对实例的影响远不足以到达hang所需要的条件*/

虽然能证明单个redo copy latch被僵死进程长期持有和实例hang住没有必然的因果联系,但因为缺少相关的systemstate和hanganalyze转储信息,我们也无法进一步做出推断。
这个case发生后的几个月,我无意中在MOS上发现了以下Bug note:

Diagnostic collection may hang or crash the instance
Description

Sometimes while the DIAG process is collecting diagnostics
the database may hang or crash.
This is more likely to occur in a RAC or ASM environment .

You may be seeing this problem if the database crashes or hangs just
after you see the following text in a trace file:
“Dumping diagnostic information for X”
where X is the process that hung or terminated.

This fix replaces invokations of a port specific OS debugger
(like gdb) with Oracle ORADEBUG calls to gather the required
diagnostics.

Note:
For HP Itanium fixes for this issue in 10.2.0.3/10.2.0.4 use bug 8767023
instead of this bug#.

For AIX fixes for this issue in 10.2.0.3/10.2.0.4 use bug 8929701
instead of this bug#.

Workaround
>=10.2.0.3 only: Set init.ora parameter “_ksb_disable_diagpid” = TRUE
but this will disable automatic diagnostics collection.

当diag诊断进程调用系统debugger工具调试问题进程时可能出现实例hang或crash的,且该问题多发于RAC或ASM环境中,workaroud的方法是动态设置_ksb_disable_diagpid为TRUE,可以避免diag诊断进程对问题进程的调试,但这些调试信息往往对我们又是有用的,这倒有些《种树郭橐驼说》中”既然已,勿动勿虑,去不复顾”的味道。

好了这个问题,仍旧pending着,是个悬案。

Brain Split?

真正出现脑裂的几率并不高,但确实让我们碰上了。2个月前为一套AIX6.1上的10.2.0.4双节点RAC系统做故障测试,主要内容是拔除RAC interconnect网线,测试CRS能否正确处理私有网络挂掉的情况。

 

正式测试时发现2台主机都没有重启,而两端的CSS都认为对方节点已经down了。这就造成2个节点都以为自身是幸存者,也就是我们说的脑裂(brain split),此时实例一般会因为LMON进程的缘故而hang住。

 

我们来比对当时2个节点上的日志进一步分析:

 

STEP 1 :20:41:19物理拔出网线后,节点间无法正常通信,进入misscount倒计时600s
节点1:
[    CSSD]2010-06-22 20:41:21.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:22.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(3) checkin(s)
.........
[    CSSD]2010-06-22 20:51:17.956 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:18.963 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:19.963 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis2 (2), flags 0x0001, state 3, wt4c 0

/* 节点1上完成倒计时后开始驱逐节点2*/

节点2:
[    CSSD]2010-06-22 20:41:19.598 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:20.599 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(3) checkin(s)
......................
[    CSSD]2010-06-22 20:51:15.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:16.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:17.878 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis1 (1), flags 0x0001, state 3, wt4c 0

/*同样的节点2也是在10分钟后的51分开始驱逐节点1*/

STEP 2: 2个节点读取磁盘心跳信息(clssnmReadDskHeartbeat),且认为对方节点已经down了

节点1:
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:20.965 [2057] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:21.714 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4185) LATS(1628594178) Disk lastSeqNo(4185)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2010-06-22 20:51:22.718 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4186) LATS(1628595183) Disk lastSeqNo(4186)
[    CSSD]2010-06-22 20:51:22.964 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:51:23.722 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4187) LATS(1628596186) Disk lastSeqNo(4187)
[ CSSD]2010-06-22 20:51:24.724 [1543] >TRACE: clssnmReadDskHeartbeat: node(2) is down.
rcfg(3) wrtcnt(4188) LATS(1628597189) Disk lastSeqNo(4188)
.............................
[    CSSD]2010-06-22 20:59:49.953 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4692) LATS(1629102418) Disk lastSeqNo(4692)
[    CSSD]2010-06-22 20:59:50.057 [3085] >TRACE:   clssgmPeerDeactivate: node 2 (gis2), death 0, state 0x80000001 connstate 0xf
[    CSSD]2010-06-22 20:59:50.104 [1029] >TRACE:   clssnm_skgxncheck: CSS daemon failed on node 2
[    CSSD]2010-06-22 20:59:50.382 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112a6c5b0) proc(112a5a190) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Start
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Evicting node 2, birth 1, death 3, killme 1
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(1629103696)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(1629102418),timeout(1278)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: Ack message type (15)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSendUpdate: syncSeqNo(3)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2010-06-2F1.232 [2057] >TRACE:   clssnmDeactivateNode: node 0 () left cluster

[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 1, state (3/3) unique (1277207505/1277207505) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (1277206874/1277206874) prevConuni(1277206874) birth (1/0) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmDeactivateNode: node 2 (gis2) left cluster

[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: SYNC(3) from node(1) completed
[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: NODE 1 (gis1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmReconfigThread:  started for reconfig (3)
[    CSSD]2010-06-22 20:59:51.310 [4114] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_gisdb) birth(1/0)
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DBGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DGGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGGISDBALL type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRCSN_SRV_gisdb1 type 3
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 3
[    CSSD]2010-06-22 20:59:51.311 [3085] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishMasterNode: MASTER for 3 is node(1) birth(2)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmChangeMasterNode: requeued 1 RPCs
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2010-06-22 20:59:51.312 [4114] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

/* 节点1在hearbeat 8分钟左右后认为CSS daemon failed on node 2,正式认为Node 2离开了集群,并完成了reconfiguration*/

节点2:
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:18.892 [2057] >TRACE:   clssnmSendVoteInfo: node(2) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:19.287 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3548) LATS(351788040) Disk lastSeqNo(3548)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[ CSSD]2010-06-22 20:51:20.288 [1543] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3549) LATS(351789041) Disk lastSeqNo(3549)
[    CSSD]2010-06-22 20:51:21.308 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3550) LATS(351790062) Disk lastSeqNo(3550)
...........................
[    CSSD]2010-06-22 20:59:46.122 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4051) LATS(352294875) Disk lastSeqNo(4051)
[    CSSD]2010-06-22 20:59:46.341 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112947c70) proc(112946f90) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:46.355 [2314] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50210) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50cd0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a536f0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a4eb90) proc (112a4eef0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a69250) proc (112a67e10)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112946050) proc (112945e50)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: waited 0 seconds on 6 IO capable clients
[    CSSD]2010-06-22 20:59:46.494 [2314] >WARNING: clssgmClientShutdown: graceful shutdown completed.
[    CSSD]2010-06-22 20:59:47.130 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4052) LATS(352295883) Disk lastSeqNo(4052)
[    CSSD]2010-06-22 21:34:40.167 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.

/* node2 也正确进行了heartbeat,并认为node(1) is down,最后被手动关闭;之后还原了网络故障,在21:34时CSS重新启动*/

 

如果你仔细看以上日志的话,你大概会找出”Oracle Database 10g CSS Release 10.2.0.1.0″的记录;这套RAC不是10.2.0.4的吗,为什么CSS还是10.2.0.1版本的呢,事后调查才发觉是安装该套系统的施工方国内某X码工程师在给CRS打补丁的时候忘记运行最后的root102.sh脚本了,该脚本将更新OCR/Voting disk及实际的CRS binary文件等,如果补丁安装结束后没有运行该脚本则升级不会有任何效果,而只会更新oraInventory中的信息。

 

刚开始时哪位X码的工程师抵死不肯承认忘记了运行脚本,而实际上在AIX 6.1上打10.2.0.4 CRS的patch是需要为oracle用户赋特有的权限的,这一点不同于AIX 5.3上,即:

 

chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH oracle
/*进一步检查*/
lsuser -f oracle | grep capabilities
        capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH

 

如果未对oracle用户赋以上权限则运行root102.sh脚本时将报错。另一个判断的标志是pre10204/pre10205目录,如果运行过root102.sh脚本的话$ORA_CRS_HOME/install目录下会多出一个形如pre$VERSION的目录,没有的话一般就是没有运行过脚本,当然也有可能是时候删除了(不建议删除)。

 

了解到以上信息后对此次脑裂的追根溯源就要简单的多了,版本10201上的CRS可以说Bug众多的,从10201-10204期间CRS加入了不少新的参数和机制,在MOS上搜索关键词”brain split CSS”可以找到以下案例:

Hdr: 8293652 10.2.0.3 PCW 10.2.0.3 OSD PRODID-5 PORTID-46
Abstract: CSS CANNOT HANDLE SPLIT-BRAIN AND DB INSTANCE RECEIVES ORA-29740
PROBLEM:
——–
config:
2-node RAC: Node1 (pdb01) and Node2 (pdb02)
There’s no time difference between two nodes.

pdb02 got ORA-29740 and terminated at “Tue Feb 17 12:13:06 2009”
ORA-29740 occured with reason 1.
After ORA-29740 happened, the instance won’t be able to start
until rebooting OS.
After rebooting OS, everything was fine and instances were up.

DIAGNOSTIC ANALYSIS:
——————–
clssnmReadDskHeartbeat: node(2) is down. rcfg(8) wrtcnt(2494425)
LATS(1205488794) Disk lastSeqNo(2494425)
nodes
clssgmMasterSendDBDone: group/lock status synchronization complete
nodes

WORKAROUND:
———–
none

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-
once at ct’s env.

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

Does ct apply any CRS(bundle) patch ?

When problem happen, cssd can’t connect each other via interconnect,
but both cssd can do heartbeat to voting disk.
However, both cssd consider that “I’m survivor”.
Looking into node 1 cssd.
* 12:02:30.856 – Initiated sync
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking voting disk, and find node2 is still voting and living.
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) timeout(20) state_network(5) state_disk(3) misstime(0)
[ CSSD]2009-02-17 12:02:31.878 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) disk HB found, network state 5, disk state(3) misstime(1010)

* Compared cluster size and confirmed it can survive.
[ CSSD]2009-02-17 12:02:34.885 [1262557536] >TRACE: clssnmCheckDskInfo:
node 2, iz-pdb02, state 5 with leader 2 has smaller cluster size 1;
my cluster size 1 with leader 1

* Then finished
[ CSSD]2009-02-17 12:02:34.886 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:23 pm ***
Looking into node 2 cssd log.

* 12:02:20.647 – initiated sync protocol
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking disk and find node1 does not do disk heartbeart for 59690 ms.
it would have waited for misscount and considered node 1 is dead
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
node(1) timeout(59690) state_network(5) state_disk(3) misstime(61000)

* node2 is the only active member of cluster, finished.
[ CSSD]2009-02-17 12:02:22.723 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:45 pm ***
So, strange point is, node 2 cssd says node 1 cssd didn’t do
disk heartbeat for 60 seconds.

Looking into node1 cssd log just before initiating sync. We see 87sec gap.
———————————–
[ CSSD]2009-02-17 12:00:19.354 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x784d80) proc(0x7749b0) pid(14746) proto(10:2:1:1)
[ CSSD]2009-02-17 12:00:33.338 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x7d8620) proc(0x75cfb0) pid() proto(10:2:1:1)
[ CSSD]2009-02-17 12:01:03.688 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x76a390) proc(0x75cfb0) pid(13634) proto(10:2:1:1)
[ CSSD]2009-02-17 12:02:30.855 [1168148832] >WARNING: clssnmDiskPMT:
sltscvtimewait timeout (69200)
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >WARNING: clssnmeventhndlr:
Receive failure with node 2 (iz-pdb02), state 3, con(0x72b980),
probe((nil)), rc=10
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >TRACE: clssnmDiscHelper:
iz-pdb02, node(2) connection failed, con (0x72b980), probe((nil))
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7

As an interesting point, clssgmClientConnectMsg does not show message,
but nm polling thread/disk ping thread does not warn timeout.
(Usually it should write message first at 50% of timeout = 30 sec)

And “sltscvtimewait timeout (69200)” message means, DiskPingMonitor thread
does not run for a 69200 ms whereas it just wants to sleep 1 second.

These suggest, cssd does not scheduled about 70 seconds on node 1.
I don’t see any log from DiskPingThread, but I assume it is suspended
at some point also, and back to work after 70 seconds.
Please check OS message file to see any interesting error is recorded.
To prevent this issue, keep watching OS performance to see if any
extreme high load does not happen.

Recommended solution it to go to 10.2.0.4 and use oprocd so that
we can expect oprocd kill node1 in such case.

 

 

上述案例同样是在”cssd can’t connect each other via interconnect”的状况下出现了”I’m survivor”的脑裂问题,MOS的建议是升级到10204后oprocd进程可以阻止这样的惨剧发生。

 

该问题最后通过升级到10.2.0.5解决了,这个case告诉我们在中国的it大环境内,有时候我们不得不亲力亲为地关心每一个细节,就拿这次来说我一开始也没发现升级没完成的情况,后来还是同事提醒了我;因为这是一个非常低级错误,如果施工方的X码工程师仔仔细细地按照他们下发的文档按部就班亦或者能留意一下升级时的图形窗口中的说明的话,这个问题都不会发生!而实际上不仅仅是此套系统,连带着其他2套系统也是这位X码工程师安装升级的,这几套系统在之后的故障测试时都发现了同样的问题。

事实告诉我们,细节决定成败!

ORA-00600: internal error code, arguments: [kdsgrp1] example

一套Linux x86-64上的11.2.0.1系统,alert日志中出现ORA-00600: internal error code, arguments: [kdsgrp1]错误,相关trace的部分内容如下:

Dump file /u01/app/oracle/diag/rdbms/utdw016/utdw016b/incident/incdir_276035/utdw016b_ora_5756_i276035.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/112utdw016
System name:	Linux
Node name:	x42k601
Release:	2.6.18-164.el5
Version:	#1 SMP Tue Aug 18 15:51:48 EDT 2009
Machine:	x86_64
Instance name: utdw016b
Redo thread mounted by this instance: 2
Oracle process number: 52
Unix process pid: 5756, image: oracle@x42k601

*** 2010-07-28 11:08:35.394
*** SESSION ID:(577.11818) 2010-07-28 11:08:35.394
*** CLIENT ID:() 2010-07-28 11:08:35.394
*** SERVICE NAME:(utdw016-edw) 2010-07-28 11:08:35.394
*** MODULE NAME:(pmdtm@x42k604-zone1 (TNS V1-V3)) 2010-07-28 11:08:35.394
*** ACTION NAME:() 2010-07-28 11:08:35.394

Dump continued from file: /u01/app/oracle/diag/rdbms/utdw016/utdw016b/trace/utdw016b_ora_5756.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 276035 (ORA 600 [kdsgrp1]) ========

*** 2010-07-28 11:08:35.395
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=9hfdcgzzgmgvp) -----
DELETE FROM EIMABS.FACT_OPEN_SALES_ORDER WHERE DIM_SNAPSHOT_TYPE_ID = (SELECT DIM_SNAPSHOT_TYPE_ID FROM EIMABS.DIM_SNAPSHOT_TYPE WHERE SNAPSHOT_TYPE_CODE='FUTURE')
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----
object line object
handle number name
0x5578e0528 6 anonymous block

----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFFA85D7418 ? 000000001 ?
7FFFA85DB918 ? 000000000 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
ksedst()+34 call ksedst1() 000000000 ? 000000001 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
dbkedDefDump()+2736 call ksedst() 000000000 ? 000000001 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
ksedmp()+36 call dbkedDefDump() 000000003 ? 000000002 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
ksfdmp()+64 call ksedmp() 000000003 ? 000000002 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
dbgexPhaseII()+1764 call ksfdmp() 000000003 ? 000000002 ?
7FFFA85D7418 ? 000000001 ?
000000000 ? 000000000 ?
dbgexProcessError() call dbgexPhaseII() 2B30E1B74718 ? 2B30E1B795F8 ?
+2279 7FFFA85E3708 ? 000000001 ?
000000000 ? 000000000 ?
dbgeExecuteForError call dbgexProcessError() 2B30E1B74718 ? 2B30E1B795F8 ?
()+83 000000001 ? 000000000 ?
7FFF00000000 ? 000000000 ?
dbgePostErrorKGE()+ call dbgeExecuteForError 2B30E1B74718 ? 2B30E1B795F8 ?
1615 () 000000001 ? 000000001 ?
000000000 ? 000000000 ?
dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 000000000 ? 2B30E1B87F38 ?
63 000000258 ? 2B30E1B795F8 ?
000000000 ? 000000000 ?
kgeadse()+383 call dbkePostKGE_kgsf() 00A9B2EE0 ? 2B30E1B87F38 ?
000000258 ? 2B30E1B795F8 ?
000000000 ? 000000000 ?
kgerinv_internal()+ call kgeadse() 00A9B2EE0 ? 2B30E1B87F38 ?
45 000000258 ? 000000000 ?
000000000 ? 000000000 ?
kgerinv()+33 call kgerinv_internal() 00A9B2EE0 ? 2B30E1B87F38 ?
A85DBCA000000000 ?
000000258 ? 000000000 ?
000000000 ?
kgeasnmierr()+143 call kgerinv() 00A9B2EE0 ? 2B30E1B87F38 ?
A85DBCA000000000 ?
000000000 ? 000000000 ?
000000000 ?
kdsgrp1_dump()+832 call kgeasnmierr() 00A9B2EE0 ? 2B30E1B87F38 ?
A85DBCA000000000 ?
000000000 ? 000000003 ?
000000003 ?
kdsgrp1()+19 call kdsgrp1_dump() 00A9B2EE0 ? 2B30E1B87F38 ?
7FFFA85E4360 ? 000000000 ?
000000003 ? 000000003 ?
kdsgrp()+6346 call kdsgrp1() 00A9B2EE0 ? 2B30E1B87F38 ?
7FFFA85E4360 ? 000000000 ?
000000003 ? 000000003 ?
qetlbr()+211 call kdsgrp() 2B30E2018620 ? 000000000 ?
2B30E2018620 ? 000000000 ?
000000003 ? 000000003 ?
qertbFetchByRowID() call qetlbr() 2B30E2018620 ? 00000001A ?
+850 000000000 ? 0000000C8 ?
00400F521 ? 000000000 ?
qergiFetch()+276 call qertbFetchByRowID() 000000000 ? 2B30E2018518 ?
000000000 ? 000000000 ?
000000001 ? 000000000 ?
qerdlFetch()+678 call qergiFetch() 511B78238 ? 2B30E20187F8 ?
000000000 ? 000000000 ?
000000001 ? 000000000 ?
delexe()+952 call qerdlFetch() 511B78090 ? 2B30E2018900 ?
000000000 ? 000000000 ?
000007FFF ? 000000000 ?
opiexe()+14449 call delexe() 000000000 ? 000000000 ?
000000000 ? 2B30E27E1E08 ?
2B30E1FCA000 ? 000000000 ?
opipls()+3098 call opiexe() 000000004 ? 000000005 ?
7FFFA85E6708 ? 2B30E27E1E08 ?
2B30E1FCA000 ? 000000000 ?
opiodr()+1149 call opipls() 000000066 ? 000000007 ?
7FFFA85E7F90 ? 000000000 ?
2B30E2701C28 ? 300000000 ?
__PGOSF141_rpidrus( call opiodr() 000000066 ? 000000007 ?
)+206 7FFFA85E7F90 ? 000000003 ?
008C71EF0 ? 300000000 ?
skgmstack()+148 call __PGOSF141_rpidrus( 7FFFA85E79A0 ? 000000007 ?
) 7FFFA85E7F90 ? 000000003 ?
008C71EF0 ? 300000000 ?
rpiswu2()+612 call skgmstack() 7FFFA85E7978 ? 00A9B2B20 ?
00000F618 ? 008468D40 ?
7FFFA85E79A0 ? 300000000 ?
rpidrv()+1342 call rpiswu2() 55CE5B2F8 ? 000000078 ?
55CE5B37C ? 000000009 ?
2B30E1FEEF98 ? 000000078 ?
psddr0()+459 call rpidrv() 000000003 ? 000000066 ?
7FFFA85E7F90 ? 000000039 ?
2B30E1FEEF98 ? 000000078 ?
psdnal()+457 call psddr0() 000000003 ? 000000066 ?
7FFFA85E7F90 ? 000000031 ?
2B30E1FEEF98 ? 000000078 ?
pevm_EXECC()+376 call psdnal() 7FFFA85E9370 ? 7FFFA85E9580 ?
7FFFA85E7F90 ? 2B30E2742370 ?
2B30E1FEEF98 ? 000000078 ?
pfrinstr_EXECC()+75 call pevm_EXECC() 2B30E273F0A0 ? 7FFFA85E9580 ?
000000020 ? 2B30E2742370 ?
2B30E1FEEF98 ? 000000078 ?
pfrrun_no_tool()+63 call pfrinstr_EXECC() 2B30E273F0A0 ? 51148B482 ?
2B30E273F110 ? 2B30E2742370 ?
2B30E1FEEF98 ? 2B3000000020 ?
pfrrun()+1025 call pfrrun_no_tool() 2B30E273F0A0 ? 51148B482 ?
2B30E273F110 ? 2B30E2742370 ?
2B30E1FEEF98 ? 2B3000000020 ?
plsql_run()+769 call pfrrun() 2B30E273F0A0 ? 000000000 ?
2B30E273F110 ? 7FFFA85E9370 ?
2B30E1FEEF98 ? 503AC23E2 ?
peicnt()+296 call plsql_run() 2B30E273F0A0 ? 000000001 ?
000000000 ? 7FFFA85E9370 ?
2B30E1FEEF98 ? 000000000 ?
kkxexe()+520 call peicnt() 7FFFA85E9370 ? 2B30E273F0A0 ?
2B30E2418960 ? 7FFFA85E9370 ?
2B30E2486008 ? 000000000 ?
opiexe()+14796 call kkxexe() 2B30E2013538 ? 2B30E273F0A0 ?
000000000 ? 7FFFA85E9370 ?
2B30E2486008 ? 000000000 ?
kpoal8()+2283 call opiexe() 000000049 ? 000000003 ?
7FFFA85EA8F8 ? 7FFFA85E9370 ?
2B30E2486008 ? 000000000 ?
opiodr()+1149 call kpoal8() 00000005E ? 00000001C ?
7FFFA85ED9D0 ? 7FFFA85E9370 ?
2B30E2486008 ? 5E00000001 ?
ttcpip()+1251 call opiodr() 00000005E ? 00000001C ?
7FFFA85ED9D0 ? 000000000 ?
008C71DB0 ? 5E00000001 ?
opitsk()+1628 call ttcpip() 00A9C6450 ? 0086CCB08 ?
7FFFA85ED9D0 ? 000000000 ?
7FFFA85ED430 ? 7FFFA85EDBD4 ?
opiino()+953 call opitsk() 00A9C6450 ? 000000000 ?
7FFFA85ED9D0 ? 000000000 ?
7FFFA85ED430 ? 7FFFA85EDBD4 ?
opiodr()+1149 call opiino() 00000003C ? 000000004 ?
7FFFA85EF0C8 ? 000000000 ?
7FFFA85ED430 ? 7FFFA85EDBD4 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFFA85EF0C8 ? 000000000 ?
008C71860 ? 7FFFA85EDBD4 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFFA85EF0C8 ? 000000000 ?
008C71860 ? 7FFFA85EDBD4 ?
opimai_real()+128 call sou2o() 7FFFA85EF0A0 ? 00000003C ?
000000004 ? 7FFFA85EF0C8 ?
008C71860 ? 7FFFA85EDBD4 ?
ssthrdmain()+209 call opimai_real() 000000002 ? 7FFFA85EF290 ?
000000004 ? 7FFFA85EF0C8 ?
008C71860 ? 7FFFA85EDBD4 ?
main()+196 call ssthrdmain() 000000002 ? 7FFFA85EF290 ?
000000001 ? 000000000 ?
008C71860 ? 7FFFA85EDBD4 ?
__libc_start_main() call main() 000000002 ? 7FFFA85EF438 ?
+244 000000001 ? 000000000 ?
008C71860 ? 7FFFA85EDBD4 ?
_start()+36 call __libc_start_main() 0009D3E88 ? 000000002 ?
7FFFA85EF428 ? 000000000 ?
008C71860 ? 000000002 ?

错误由此删除语句引起:

DELETE FROM EIMABS.FACT_OPEN_SALES_ORDER WHERE 
DIM_SNAPSHOT_TYPE_ID = 
(SELECT DIM_SNAPSHOT_TYPE_ID FROM EIMABS.DIM_SNAPSHOT_TYPE WHERE SNAPSHOT_TYPE_CODE='FUTURE')

其中EIMABS.DIM_SNAPSHOT_TYPE是一个普通堆表,而EIMABS.FACT_OPEN_SALES_ORDER是一个分区表,针对该分区表的各个分区进行了validate structure分析:

SQL> @?/rdbms/admin/utlvalid
Table created.

SQL> set time on;

23:32:51 SQL> analyze table EIMABS.DIM_SNAPSHOT_TYPE validate structure cascade;
Table analyzed.

00:30:46 SQL> analyze table EIMABS.FACT_OPEN_SALES_ORDER partition (P_2455405) validate structure cascade into invalid_rows;
analyze table EIMABS.FACT_OPEN_SALES_ORDER partition (P_2455405) validate structure cascade into invalid_rows
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file

推测可能是P_2455405分区相关的索引可能出现了讹误,尝试重建该表上的15个索引,再次测试相关应用发现问题仍未解决。此次产生的trace文件:

Dump file /u01/app/oracle/diag/rdbms/utdw016/utdw016a/incident/incdir_310187/utdw016a_ora_26473_i310187.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/112utdw016
System name:	Linux
Node name:	x42k600
Release:	2.6.18-164.el5
Version:	#1 SMP Tue Aug 18 15:51:48 EDT 2009
Machine:	x86_64
Instance name: utdw016a
Redo thread mounted by this instance: 1
Oracle process number: 62
Unix process pid: 26473, image: oracle@x42k600

*** 2010-07-30 04:10:19.931
*** SESSION ID:(1993.31898) 2010-07-30 04:10:19.931
*** CLIENT ID:() 2010-07-30 04:10:19.931
*** SERVICE NAME:(utdw016-edw) 2010-07-30 04:10:19.931
*** MODULE NAME:(pmdtm@x42k604-zone2 (TNS V1-V3)) 2010-07-30 04:10:19.931
*** ACTION NAME:() 2010-07-30 04:10:19.931

Dump continued from file: /u01/app/oracle/diag/rdbms/utdw016/utdw016a/trace/utdw016a_ora_26473.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 310187 (ORA 600 [kdsgrp1]) ========

*** 2010-07-30 04:10:19.932
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=9hfdcgzzgmgvp) -----
DELETE FROM EIMABS.FACT_OPEN_SALES_ORDER WHERE DIM_SNAPSHOT_TYPE_ID = (SELECT DIM_SNAPSHOT_TYPE_ID FROM EIMABS.DIM_SNAPSHOT_TYPE WHERE SNAPSHOT_TYPE_CODE='FUTURE')
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----

SO: 0x5334781d0, type: 52, owner: 0x55d2646b8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
      proc=0x55caf0778, name=transaction, file=ktccts.h LINE:388, pg=0
     (trans) flg = 0x00001e03, flg2 = 0x000c0000, flg3 = 0x00000000, prx = (nil), ros = 2147483647
     flg  = 0x00001e03: ALC TRN VUS VID CHG USN
     flg2 = 0x000c0000: PGA NIP
     flg3 = 0x00000000:
     bsn = 0x8419 bndsn = 0x841b spn = 0x85c7
     efd = 15 DID:
     file:kta.c lineno:1665
     parent xid: 0x0000.000.00000000
     env: (scn: 0x0a19.a0b7c9d4  xid: 0x0021.017.00028415  uba: 0x00e00545.0438.12  statement num=0  parent xid: xid: 0x0000.000.00000000  scn: 0x0a19.a0b7c9d9 0sch: scn: 0x0000.00000000  mascn: (scn: 0x0000.00000000)
     cev: (spc = 11546  arsp = 0x5335450a8  ubkds (ubk:tsn: 2 rdba: 0x00e00545 flag:0x8 hdl:(nil) addr:0x35c50c014)  useg tsn: 2 rdba: 0x00e2c990
           hwm uba: 0x00e00545.0438.12  col uba: 0x00000000.0000.00
           num bl: 3 bk list: 0x53db19518)
           cr opc: 0x0 spc: 11546 uba: 0x00e00545.0438.12
     ccbstg: 0x00000000
     (enqueue) TX-00210017-00028415	DID: 0001-003E-00000C70
     lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x47
     mode: X, lock_flag: 0x0, lock: 0x533478248, res: 0x5414c2da0
     own: 0x55d2646b8, sess: 0x8b37718, proc: 0x55caf0778, prv: 0x5414c2db0
     slk: (nil)
      xga: (nil), heap: UGA
     Trans IMU st: 0 Pool index 65535, Redo pool 0x533478980, Undo pool 0x533478a68
     Redo pool range [0x2b13aacccb90 0x2b13aacccb90 0x2b13aacd1390]
     Undo pool range [0x2b13aacc8390 0x2b13aacc8390 0x2b13aacccb90]
      chnf control flags 0x0         CHNF hwm uba uba: 0x00000000.0000.00        ----------------------------------------
       SO: 0x53db19518, type: 51, owner: 0x5334781d0, flag: -/-/-/0x00 if: 0x3 c: 0x3

SO: 0x55d2646b8, type: 4, owner: 0x55caf0778, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
    proc=0x55caf0778, name=session, file=ksu.h LINE:11467, pg=0
   (session) sid: 1993 ser: 31898 trans: 0x5334781d0, creator: 0x55caf0778
             flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
             flags2: (0x40008) -/-
             DID: , short-term DID:
             txn branch: (nil)
             oct: 7, prv: 0, sql: 0x5534eae70, psql: 0x55a4ff230, user: 120/EIMABS
   ksuxds FALSE at location: 0
   service name: utdw016-edw
   client details:
     O/S info: user: infom, term: , ospid: 21955
     machine: x42k604-zone2 program: pmdtm@x42k604-zone2 (TNS V1-V3)
     application name: pmdtm@x42k604-zone2 (TNS V1-V3), hash value=4200765728
   Current Wait Stack:
     Not in wait; last wait ended 1.028093 sec ago
   Wait State:
     fixed_waits=0 flags=0x21 boundary=(nil)/-1
   Session Wait History:
       elapsed time of 1.028173 sec since last wait
    0: waited for 'Disk file operations I/O'
       FileOperation=0x5, fileno=0x1, filetype=0x3
       wait_id=4666 seq_num=4860 snap_id=2
       wait times: snap=0.000039 sec, exc=0.000181 sec, total=0.000687 sec
       wait times: max=infinite
       wait counts: calls=0 os=0
       occurred after 0.000000 sec of elapsed time
    1: waited for 'ASM file metadata operation'
       msgop=0x12, locn=0x0, =0x0
       wait_id=4667 seq_num=4859 snap_id=2
       wait times: snap=0.000002 sec, exc=0.000005 sec, total=0.000506 sec
       wait times: max=infinite
       wait counts: calls=0 os=0
       occurred after 0.000000 sec of elapsed time
    2: waited for 'KSV master wait'
       =0x0, =0x0, =0x0
       wait_id=4668 seq_num=4858 snap_id=1
       wait times: snap=0.000501 sec, exc=0.000501 sec, total=0.000501 sec
       wait times: max=infinite
       wait counts: calls=1 os=1
       occurred after 0.000000 sec of elapsed time
    3: waited for 'ASM file metadata operation'
       msgop=0x12, locn=0x0, =0x0
       wait_id=4667 seq_num=4857 snap_id=1
       wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec
       wait times: max=infinite
       wait counts: calls=0 os=0
       occurred after 0.000000 sec of elapsed time
    4: waited for 'Disk file operations I/O'
       FileOperation=0x5, fileno=0x1, filetype=0x3
       wait_id=4666 seq_num=4856 snap_id=1
       wait times: snap=0.000142 sec, exc=0.000142 sec, total=0.000142 sec
       wait times: max=infinite
       wait counts: calls=0 os=0
......

Block header dump:  0x0400f521 Data

Block header dump:  0x2317a579 Index

Block header dump:  0x2317a577 " "

Block header dump:  0x048c8c3c " "

Block header dump:  0x2317a1bb " "

Block header dump:  0x048c78ad " "

Block header dump:  0x0400f51f Data

Block header dump:  0x2317a33b Index

Block header dump:  0x048c78f3 " "

Block header dump:  0x07541a44 " "

Block header dump:  0x07577b92 " "

Block header dump:  0x078a4eb7 " "

Block header dump:  0x2317a4b7 " "

Block header dump:  0x040fe621 " "

Block header dump:  0x2317a47b " "

同时应用人员发现此系统中另一SQL语句也会导致ORA-600[kdsgrp1]错误的出现,如下:

SELECT
F.DIM_BIZ_SECTOR_ID AS DIM_BIZ_SECTOR_ID,
F.DW_REVENUE_ID ,
F.DIM_ENT_ITEM_ID AS DIM_ENT_ITEM_ID,
F.DIM_REV_PCH_KEY_ACCT_ID AS DIM_PCH_KEY_ACCOUNT_ID,
CT.DIM_PGH_COUNTRY_GROUP_ID AS DIM_PGH_COUNTRY_GROUP_ID,
F.DIM_REVENUE_DATE_ID AS DIM_FORECAST_DATE_ID,
NEXT_DAY(SYSDATE-4.000000000000000,'FRI') AS SNAPSHOT_DATE,
F.REVENUE_DT AS FORECAST_DATE,
F.REVENUE_QTY AS REVENUE_QTY
FROM
EIMABS.FACT_REVENUE F,
EIMABS.DIM_BIZ_SECTOR B,
EIMABS.DIM_ENT_ITEM E,
EIMABS.DIM_PCH_KEY_ACCOUNT PCH,
EIMABS.DIM_COUNTRY CT,
EIMABS.DIM_FISCAL_CAL_DATE CAL
WHERE
F.DIM_BIZ_SECTOR_ID = B.DIM_BIZ_SECTOR_ID
AND F.DIM_ENT_ITEM_ID = E.DIM_ENT_ITEM_ID
AND F.DIM_REV_PCH_KEY_ACCT_ID = PCH.DIM_PCH_KEY_ACCOUNT_ID
AND F.DIM_ULT_DEST_COUNTRY_ID = CT.DIM_COUNTRY_ID
AND UPPER(B.BIZ_SECTOR_CODE) = 'MDB'
AND F.DIM_REVENUE_DATE_ID = CAL.DIM_CALENDAR_DATE_ID
AND CAL.FISCAL_MONTH_NUM IN
(SELECT FISCAL_MONTH_NUM FROM EIMABS.DIM_FISCAL_CAL_MONTH
WHERE DIM_CALENDAR_MONTH_ID IN (
(SELECT B.DIM_CALENDAR_MONTH_ID -1 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID),
(SELECT B.DIM_CALENDAR_MONTH_ID -2 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID)));

ERROR at line 12:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [],
[], [], [], [], []

/*相关trace文件如下*/

Trace file /u01/app/oracle/diag/rdbms/utdw016/utdw016a/trace/utdw016a_ora_26473.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/112utdw016
System name:	Linux
Node name:	x42k600
Release:	2.6.18-164.el5
Version:	#1 SMP Tue Aug 18 15:51:48 EDT 2009
Machine:	x86_64
Instance name: utdw016a
Redo thread mounted by this instance: 1
Oracle process number: 62
Unix process pid: 26473, image: oracle@x42k600


*** 2010-07-30 04:10:18.249
*** SESSION ID:(1993.31898) 2010-07-30 04:10:18.249
*** CLIENT ID:() 2010-07-30 04:10:18.249
*** SERVICE NAME:(utdw016-edw) 2010-07-30 04:10:18.249
*** MODULE NAME:(pmdtm@x42k604-zone2 (TNS V1-V3)) 2010-07-30 04:10:18.249
*** ACTION NAME:() 2010-07-30 04:10:18.249

* kdsgrp1-1: *************************************************
           row 0x0400f521.1a continuation at
           0x0400f521.1a file# 16 block# 62753 slot 26 not found
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 26 ..... nrows: 28
kdsgrp - dump CR block dba=0x0400f521
Block header dump:  0x0400f521
Object id on Block? Y
seg/obj: 0x11527d  csc: 0xa19.61e20b3c  itc: 2  flg: E  typ: 1 - DATA
    brn: 0  bdba: 0x400f500 ver: 0x01 opc: 0
    inc: 0  exflg: 0

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0040.02f.0001bfcd  0x2822b8ea.0224.34  C-U-    0  scn 0x0a19.61e1aa84
0x02   0x003f.01c.00028920  0x2822f9de.01e0.6b  --U-   26  fsc 0x0000.61e21527
bdba: 0x0400f521
data_block_dump,data header at 0x2d63ec064

这个case同时提交了SR,Oracle GCS支持分析日志后认为告警日志中曾经出现kewastUnPackStats记录,说明可能由11.2.0.1上的Bug 8967729 引起的:

I researched the issue further based upon the error that followed the log switch command,

Tue Aug 03 10:00:04 2010
ALTER SYSTEM ARCHIVE LOG
Tue Aug 03 10:00:05 2010
Thread 1 advanced to log sequence 8652 (LGWR switch)
Current log# 4 seq# 8652 mem# 0: +UAT_DATA/utdw016/redo041a.log
Current log# 4 seq# 8652 mem# 1: +UAT_ARCH/utdw016/redo042a.log
Tue Aug 03 10:00:14 2010
Archived Log entry 16346 added for thread 1 sequence 8651 ID 0x8d616825 dest 1:
Tue Aug 03 10:00:21 2010
ALTER SYSTEM ARCHIVE LOG
Tue Aug 03 10:00:23 2010
Thread 1 advanced to log sequence 8653 (LGWR switch)
Current log# 5 seq# 8653 mem# 0: +UAT_DATA/utdw016/redo051a.log
Current log# 5 seq# 8653 mem# 1: +UAT_ARCH/utdw016/redo052a.log
Tue Aug 03 10:00:23 2010
Archived Log entry 16348 added for thread 1 sequence 8652 ID 0x8d616825 dest 1:
Tue Aug 03 10:00:45 2010
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> kewastUnPackStats(): bad magic 1 (0x2b29e289bf13, 0)

We need to resolve this issue and then determine whether these indexes corrupt again within the database.

I found Bug 8967729 11.2.0.1 Abstract: KEWASTUNPACKSTATS(): BAD MAGIC MESSAGE IN ALERT.LOG.

Basically, there is a archivelog switch issue that is causes string corruption.

This issue was introduced by the fix of bug #8715387 and has been fully fixed as part of Bug 8730312 11.2 Abstract: FWD MERGE FOR BASE Bug 8715387 FOR 12G.

Please download and install patch 8730312 for your Oracle and OS versions.
However, if you have installed patches in this 11.2 database, then upload the results of the Opatch inventory.
We will check the installed database patches against patch 8730312 for any contradictions.

Hdr: 8967729 11.2.0.1 RDBMS 11.2.0.1 SVRMAN AWR PRODID-5 PORTID-226 8730312
Abstract: KEWASTUNPACKSTATS(): BAD MAGIC MESSAGE IN ALERT.LOG
PROBLEM:
--------
In a two-node RAC cluster (+ Data Guard), the rac instances write many
strange messages in alert log, such:
kewastUnPackStats(): bad magic 1 (0x2ac893121390, 0)
The hex address changes.

DIAGNOSTIC ANALYSIS:
--------------------
I don't see any error associated with this message.
It looks like Bug 8730312 description, but the base bug 8715387 should be
already fixed in 11.2.0.1 version.

WORKAROUND:
-----------
n/a

RELATED BUGS:
-------------
8715387, 8730312

REPRODUCIBILITY:
----------------

TEST CASE:
----------

STACK TRACE:
------------
Tue Sep 22 09:15:34 2009
ALTER SYSTEM ARCHIVE LOG
Tue Sep 22 09:15:34 2009
LGWR: Standby redo logfile selected to archive thread 1 sequence 1244
LGWR: Standby redo logfile selected for thread 1 sequence 1244 for
destination LOG_ARCHIVE_DEST_2
Thread 1 advanced to log sequence 1244 (LGWR switch)
  Current log# 2 seq# 1244 mem# 0:
+DGCTRL/saneons/onlinelog/group_2.264.697133919
Archived Log entry 3903 added for thread 1 sequence 1243 ID 0x5186bbfd dest
1:
Tue Sep 22 10:00:02 2009
kewastUnPackStats(): bad magic 1 (0x2b1d760d3038, 0)
kewastUnPackStats(): bad magic 1 (0x2b1d760d3038, 0)

MOS认为应用补丁8730312可以解决我们的问题,但实际apply patch后EIMABS.FACT_OPEN_SALES_ORDER上的DELETE语句可以正常执行了,而之后发现的查询语句仍会报ORA-00600:[kdsgrp1]错误:

We have applied the patch. And we're in process of analyze all the partitions. So far there's no error reported. We analyzed all the partitions in table EIMABS.FACT_REVENUE, there's no error either. However, when we run the query, we still get the same error. Please note the other tables in the query are not partitioned.

SELECT
F.DIM_BIZ_SECTOR_ID AS DIM_BIZ_SECTOR_ID,
F.DW_REVENUE_ID ,
F.DIM_ENT_ITEM_ID AS DIM_ENT_ITEM_ID,
F.DIM_REV_PCH_KEY_ACCT_ID AS DIM_PCH_KEY_ACCOUNT_ID,
CT.DIM_PGH_COUNTRY_GROUP_ID AS DIM_PGH_COUNTRY_GROUP_ID,
F.DIM_REVENUE_DATE_ID AS DIM_FORECAST_DATE_ID,
NEXT_DAY(SYSDATE-4.000000000000000,'FRI') AS SNAPSHOT_DATE,
F.REVENUE_DT AS FORECAST_DATE,
F.REVENUE_QTY AS REVENUE_QTY
FROM
EIMABS.FACT_REVENUE F,
EIMABS.DIM_BIZ_SECTOR B,
EIMABS.DIM_ENT_ITEM E,
EIMABS.DIM_PCH_KEY_ACCOUNT PCH,
EIMABS.DIM_COUNTRY CT,
EIMABS.DIM_FISCAL_CAL_DATE CAL
WHERE
F.DIM_BIZ_SECTOR_ID = B.DIM_BIZ_SECTOR_ID
AND F.DIM_ENT_ITEM_ID = E.DIM_ENT_ITEM_ID
AND F.DIM_REV_PCH_KEY_ACCT_ID = PCH.DIM_PCH_KEY_ACCOUNT_ID
AND F.DIM_ULT_DEST_COUNTRY_ID = CT.DIM_COUNTRY_ID
AND UPPER(B.BIZ_SECTOR_CODE) = 'MDB'
AND F.DIM_REVENUE_DATE_ID = CAL.DIM_CALENDAR_DATE_ID
AND CAL.FISCAL_MONTH_NUM IN
(SELECT FISCAL_MONTH_NUM FROM EIMABS.DIM_FISCAL_CAL_MONTH
WHERE DIM_CALENDAR_MONTH_ID IN (
(SELECT B.DIM_CALENDAR_MONTH_ID -1 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID),
(SELECT B.DIM_CALENDAR_MONTH_ID -2 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID)));

SELECT
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [],
[], [], [], [], []

/*以下为trace文件*/
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/112utdw016
System name:	Linux
Node name:	x42k600
Release:	2.6.18-164.el5
Version:	#1 SMP Tue Aug 18 15:51:48 EDT 2009
Machine:	x86_64
Instance name: utdw016a
Redo thread mounted by this instance: 1
Oracle process number: 64
Unix process pid: 7170, image: oracle@x42k600 (TNS V1-V3)


*** 2010-08-05 19:10:19.776
*** SESSION ID:(10.4634) 2010-08-05 19:10:19.776
*** CLIENT ID:() 2010-08-05 19:10:19.776
*** SERVICE NAME:(SYS$USERS) 2010-08-05 19:10:19.776
*** MODULE NAME:(sqlplus@x42k600 (TNS V1-V3)) 2010-08-05 19:10:19.776
*** ACTION NAME:() 2010-08-05 19:10:19.776

* kdsgrp1-1:
*************************************************
           row 0x04608b75.40 continuation at
           0x04608b75.40 file# 17 block# 2132853 slot 64 not found
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 64 ..... nrows: 130
kdsgrp - dump CR block dba=0x04608b75
Block header dump:  0x04608b75
Object id on Block? Y
seg/obj: 0x1c8071  csc: 0xa19.622a6023  itc: 2  flg: E  typ: 1 - DATA
    brn: 0  bdba: 0x4608b40 ver: 0x01 opc: 0
    inc: 0  exflg: 0

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000f.040.0001267d  0x01067067.022e.1e  --U-   64  fsc 0x0000.622a621d
0x02   0x0029.026.0001be36  0x282115c8.0223.0a  C---    0  scn 0x0a19.62297970  <---------- this TX shows in the index too
bdba: 0x04608b75
data_block_dump,data header at 0x47c16c064
===============
tsiz: 0x3f98
hsiz: 0x116
pbl: 0x47c16c064
    76543210
flag=--------
ntab=1
nrow=130
frre=64
..
..
tab 0, row 63, @0x3a65
tl: 110 fb: --H-FL-- lb: 0x1  cc: 22
col  0: [ 5]  c4 0e 31 32 04
col  1: [ 2]  c1 02
col  2: [ 5]  c4 02 44 33 44
col  3: [ 5]  c4 03 14 10 23
col  4: [ 5]  c4 09 5f 35 38
col  5: [ 5]  c4 05 32 42 0b
col  6: [ 2]  c2 03
col  7: [ 5]  c4 03 2e 36 49
col  8: [ 1]  80
col  9: [ 1]  80
col 10: [ 7]  32 30 34 31 39 32 35
col 11: [ 1]  4e
col 12: [ 7]  78 6e 06 18 01 01 01
col 13: [ 2]  c1 06
col 14: [ 3]  c2 4d 33
col 15: [ 7]  78 6e 07 1b 0a 0d 04
col 16: [ 4]  c3 05 2f 25
col 17: [ 2]  c1 09
col 18: [ 3]  c2 02 50
col 19: [ 3]  c2 02 50
col 20: [ 2]  c1 0e
col 21: [ 8]  38 30 30 31 31 38 35 38
end_of_block_dump


Block dump from cache:
Dump of buffer cache at level 4 for tsn=13, rdba=73436021
BH (0x47fd84178) file#: 17 rdba: 0x04608b75 (17/2132853) class: 1 ba: 0x47c16c000
 set: 47 pool 3 bsz: 16384 bsi: 0 sflg: 1 pwc: 0,25
 dbwrid: 0 obj: 1867889 objn: 1130693 tsn: 13 afn: 17 hint: f
 hash: [0x54be43860,0x54be43860] lru: [0x49beca1a0,0x3b7e094c0]
 ckptq: [NULL] fileq: [NULL] objq: [0x52a0e8580,0x52a0e8580]
 use: [0x54271be40,0x54271be40] wait: [NULL]
 st: SCURRENT md: SHR tch: 0 le: 0x3effd22c8
 flags:
 LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
 cr pin refcnt: 0 sh pin refcnt: 0
Block dump from disk:
buffer tsn: 13 rdba: 0x04608b75 (17/2132853)
scn: 0x0a19.622a621d seq: 0x01 flg: 0x06 tail: 0x621d0601
frmt: 0x02 chkval: 0xdf67 type: 0x06=trans data

Block header dump:  0x04608b75
Object id on Block? Y
seg/obj: 0x1c8071  csc: 0xa19.622a6023  itc: 2  flg: E  typ: 1 - DATA
    brn: 0  bdba: 0x4608b40 ver: 0x01 opc: 0
    inc: 0  exflg: 0

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000f.040.0001267d  0x01067067.022e.1e  --U-   64  fsc 0x0000.622a621d
0x02   0x0029.026.0001be36  0x282115c8.0223.0a  C---    0  scn 0x0a19.62297970
bdba: 0x04608b75

Incident 322216 created, dump file: /u01/app/oracle/diag/rdbms/utdw016/utdw016a/incident/incdir_322216/utdw016a_ora_7170_i322216.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

utdw016a_ora_7170_i322218.trc
-------------------------------
----- Current SQL Statement for this session (sql_id=88kca6rwpm9pc) -----
SELECT
F.DIM_BIZ_SECTOR_ID AS DIM_BIZ_SECTOR_ID,
F.DW_REVENUE_ID ,
F.DIM_ENT_ITEM_ID AS DIM_ENT_ITEM_ID,
 F.DIM_REV_PCH_KEY_ACCT_ID AS DIM_PCH_KEY_ACCOUNT_ID,
 CT.DIM_PGH_COUNTRY_GROUP_ID AS DIM_PGH_COUNTRY_GROUP_ID,
  F.DIM_REVENUE_DATE_ID AS DIM_FORECAST_DATE_ID,
 NEXT_DAY(SYSDATE-4.000000000000000,'FRI') AS SNAPSHOT_DATE,
 F.REVENUE_DT AS FORECAST_DATE,
 F.REVENUE_QTY AS REVENUE_QTY
FROM
 EIMABS.FACT_REVENUE F,
 EIMABS.DIM_BIZ_SECTOR B,
 EIMABS.DIM_ENT_ITEM E,
 EIMABS.DIM_PCH_KEY_ACCOUNT PCH,
 EIMABS.DIM_COUNTRY CT,
 EIMABS.DIM_FISCAL_CAL_DATE CAL
WHERE
   F.DIM_BIZ_SECTOR_ID = B.DIM_BIZ_SECTOR_ID
AND F.DIM_ENT_ITEM_ID = E.DIM_ENT_ITEM_ID
AND F.DIM_REV_PCH_KEY_ACCT_ID = PCH.DIM_PCH_KEY_ACCOUNT_ID
AND F.DIM_ULT_DEST_COUNTRY_ID = CT.DIM_COUNTRY_ID
AND UPPER(B.BIZ_SECTOR_CODE) = 'MDB'
AND F.DIM_REVENUE_DATE_ID = CAL.DIM_CALENDAR_DATE_ID
AND CAL.FISCAL_MONTH_NUM IN
(SELECT FISCAL_MONTH_NUM FROM EIMABS.DIM_FISCAL_CAL_MONTH
 WHERE DIM_CALENDAR_MONTH_ID IN (
      (SELECT B.DIM_CALENDAR_MONTH_ID -1 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
          WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID),
          (SELECT B.DIM_CALENDAR_MONTH_ID -2 FROM EIMABS.DIM_FISCAL_CAL_DATE A , EIMABS.DIM_FISCAL_CAL_MONTH B
          WHERE CALENDAR_DT = TRUNC(SYSDATE) AND A.DIM_CALENDAR_MONTH_ID = B.DIM_CALENDAR_MONTH_ID)))

----- Call Stack Trace -----
 kdsgrp1  kdsgrp  qetlbr  qertbFetchByRowID  qerjotRowProc  qerbtFetch  wProc  qergiFetch

  BH (0x47fd84178) file#: 17 rdba: 0x04608b75 (17/2132853) class: 1 ba: 0x47c16c000
       set: 47 pool 3 bsz: 16384 bsi: 0 sflg: 1 pwc: 0,25
       dbwrid: 0 obj: 1867889 objn: 1130693 tsn: 13 afn: 17 hint: f

addr: 0x47fd84178 obj: 1867889 cls: DATA bscn: 0xa19.622a621d
       buffer tsn: 13 rdba: 0x04608b75 (17/2132853)
       scn: 0x0a19.622a621d seq: 0x01 flg: 0x06 tail: 0x621d0601
       frmt: 0x02 chkval: 0xdf67 type: 0x06=trans data

Block header dump:  0x04608b75
Object id on Block? Y
seg/obj: 0x1c8071  csc: 0xa19.622a6023  itc: 2  flg: E  typ: 1 - DATA
    brn: 0  bdba: 0x4608b40 ver: 0x01 opc: 0
    inc: 0  exflg: 0

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000f.040.0001267d  0x01067067.022e.1e  --U-   64  fsc 0x0000.622a621d
0x02   0x0029.026.0001be36  0x282115c8.0223.0a  C---    0  scn 0x0a19.62297970
bdba: 0x04608b75
data_block_dump,data header at 0x47c16c064
===============
tsiz: 0x3f98
hsiz: 0x116
pbl: 0x47c16c064
    76543210
flag=--------
ntab=1
nrow=130
frre=64


     BH (0x49fe84af8) file#: 29 rdba: 0x076e5058 (29/3035224) class: 1 ba: 0x49ee0c000
       set: 44 pool 3 bsz: 16384 bsi: 0 sflg: 1 pwc: 0,25
       dbwrid: 1 obj: 1893758 objn: 1893758 tsn: 17 afn: 29 hint: f
       addr: 0x49fe84af8 obj: 1893758 cls: DATA bscn: 0xa19.62297970
       buffer tsn: 17 rdba: 0x076e5058 (29/3035224)
       scn: 0x0a19.62297970 seq: 0x01 flg: 0x06 tail: 0x79700601
       frmt: 0x02 chkval: 0xe878 type: 0x06=trans data

Block header dump:  0x076e5058
Object id on Block? Y
seg/obj: 0x1ce57e  csc: 0xa19.6229787e  itc: 2  flg: E  typ: 2 - INDEX
    brn: 0  bdba: 0x76e5054 ver: 0x01 opc: 0
    inc: 0  exflg: 0

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0029.026.0001be36  0x282115de.0223.2a  --U-   14  fsc 0x0000.62297970  <--------------- same TX as in the table


之后MOS又有回复,认为设置隐式参数_row_cr=FALSE可以解决问题:

There are few bugs that could raise the ora-600[KDSGRP1]  in 11.2

1) based on the stack --> Bug:8771916 ORA-600 [KDSGRP1] WHEN DOING AN UPDATE


Abstract:            OERI [kdsgrp1] during CR read
Fixed-Releases:     C100
Tags:               OERI
Details:
 ORA-600 [kdsgrp1] can occur intermittently when using
 a query access path to a row in a table via an index
 due to a consistent read problem.
 Workaround:
   Disabling rowCR (which is an optimization to reduce consistent-read
   rollbacks during queries) by setting "_row_cr"=FALSE in the
   initialization files in one workaround. However, this could cause
   performance degradation of queries - the statistics "RowCR hits" /
   "RowCR attempts" can help show if this workaround may be detrimental
   to performance.


2) based on the fact it is hit on partitioned tables

Bug 8546356  ORA-8102/ORA-1499/OERI[kdsgrp1] Composite Partitioned Index corruption after rebuild ONLINE
Details:
  When a COMPOSITE PARTITIONED index is rebuild ONLINE and concurrent DML on
  the same table is run on separate RAC instances, index corruption might
  happen where the index is missing some keys. This is only pertaining to
  composite partitioned indexes.
  Wrong results or the next errors can be produced by SQL statements:
  ORA-8102 by a delete/update
  ORA-1499 by "analyze table validate structure cascade"
  ORA-600 [kdsgrp1]
  ORA-600 [qertbFetchByRowID]
  Workaround:
  Do not run concurrent DML from any instance other than the one where index
  online rebuild is happening or rebuild the index without the ONLINE clause.
  To fix it: Rebuild the index.

Please set in init.ora this parameter _row_cr=FALSE and bounce database,
 see after that if OERI[kdsgrp1] still reproduces.

There is one-off patch available for Bug 8771916 for 11.2.0.1 on LInux x86_64
                                              and for Bug 8546356 the one-off patch is done for 11.2.0.1.2

There is also Bug:8951812 hit a lot in 11.2.0.1, one-off patch available
An index can become corrupt during index rebuild online producing
 ORA-600 errors or other external like ORA-8102 by an update/delete,
 ORA-1499 by "analyze table validate structure cascade".

这个case仍在继续,to be continued .....

rman hang on SQL*Net message from client

有这样一个问题,平台为HP-UX(B.11.31 U ia64),Oracle版本为10.2.0.4 single instance,RMAN自动备份autobackup controlfile时出现hang症状,等待事件为SQL*Net message from client, 这还仅仅是使用最简单的NOCATALOG+ Disk Device的情况,没有MML层面的活动。

针对该问题对RMAN服务进程做了后台TRACE,发现stack call总是hang在_read_sys=>KERNEL内核态函数上,感觉与ORACLE的关系不大,应当是HP-UX C函数调用造成的问题,例如:

10046 trace:

PARSING IN CURSOR #3 len=278 dep=0 uid=0 oct=3 lid=0 tim=9513124502191 hv=3071086789 ad=’bd7447d0′
select sofar, context, start_time from v$session_longops where (start_time > nvl(:1, sysdate-100) or start_time = nvl(:2, sysdate+100)) and sid = :3 and serial# = :4 and opname like ‘RMAN:%’ order by start_time desc, context desc
END OF STMT
PARSE #3:c=0,e=3005,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=9513124502186
WAIT #3: nam=’SQL*Net message to client’ ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124502275
WAIT #3: nam=’SQL*Net message from client’ ela= 234 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124502527
BINDS #3:
kkscoacd
Bind#0
oacdty=12 mxl=07(07) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=64 off=0
kxsbbbfp=9fffffffbf3d73b0 bln=07 avl=00 flg=05
Bind#1
oacdty=12 mxl=07(07) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=8
kxsbbbfp=9fffffffbf3d73b8 bln=07 avl=00 flg=01
Bind#2
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=16
kxsbbbfp=9fffffffbf3d73c0 bln=22 avl=03 flg=01
value=1875
Bind#3
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=40
kxsbbbfp=9fffffffbf3d73d8 bln=22 avl=04 flg=01
value=50006
EXEC #3:c=10000,e=2859,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=9513124505442
FETCH #3:c=0,e=6030,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=9513124511493
WAIT #3: nam=’SQL*Net message to client’ ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124511560
WAIT #3: nam=’SQL*Net message from client’ ela= 56 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124511690
BINDS #3:
kkscoacd
Bind#0
oacdty=12 mxl=07(07) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=64 off=0
kxsbbbfp=9fffffffbf3d6360 bln=07 avl=00 flg=05
Bind#1
oacdty=12 mxl=07(07) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=8
kxsbbbfp=9fffffffbf3d6368 bln=07 avl=00 flg=01
Bind#2
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=16
kxsbbbfp=9fffffffbf3d6370 bln=22 avl=03 flg=01
value=1875
Bind#3
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=40
kxsbbbfp=9fffffffbf3d6388 bln=22 avl=04 flg=01
value=50006
EXEC #3:c=0,e=275,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=9513124512027
FETCH #3:c=10000,e=5688,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=9513124517733
WAIT #3: nam=’SQL*Net message to client’ ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124517765
WAIT #3: nam=’SQL*Net message from client’ ela= 35 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124517853
STAT #3 id=1 cnt=0 pid=0 pos=1 obj=0 op=’SORT ORDER BY (cr=0 pr=0 pw=0 time=11729 us)’
STAT #3 id=2 cnt=0 pid=1 pos=1 obj=0 op=’FIXED TABLE FULL X$KSULOP (cr=0 pr=0 pw=0 time=11697 us)’
WAIT #0: nam=’SQL*Net message to client’ ela= 0 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513124517949
WAIT #0: nam=’SQL*Net message from client’ ela= 1629803 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=9513126147781
BINDS #2:
kkscoacd
Bind#0
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=72 off=0
kxsbbbfp=9fffffffbf3d7ef8 bln=22 avl=03 flg=05
value=1782
Bind#1
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=24
kxsbbbfp=9fffffffbf3d7f10 bln=22 avl=04 flg=01
value=62996
Bind#2
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=0 off=48
kxsbbbfp=9fffffffbf3d7f28 bln=22 avl=02 flg=01
value=2

………….

RMAN DEBUG

DBGRCVMAN: resetAll
DBGRCVMAN: setRAflags kindMask=255 containerMask=15 actionMask=63
DBGRCVMAN: ENTERING setComputeRecoveryActionMasks
DBGRCVMAN: EXITING setComputeRecoveryActionMasks
DBGRCVMAN: allIncarnations is set to FALSE
DBGRCVMAN: reset transclause
DBGRCVMAN: *****BsRecCache Statistics*****
DBGRCVMAN: Cache size=0 hit=0
DBGRCVMAN: rcvRecBackupAge= 0
DBGRCVMAN: thisBackupAge= 0
DBGRCVMAN: ENTERING setRedoLogDeletionPolicy
DBGRCVMAN: setRedoLogDeletionPolicy with policy = TO NONE
DBGRCVMAN: setRedoLogDeletionPolicy with alldest = 0
DBGRCVMAN: ENTERING validateStandbyConfig1
DBGRCVMAN: policy = TO NONE
DBGRCVMAN: alldest = 0
DBGRCVMAN: EXITING validateStandbyConfig1 with target = NULLwith status = 1
DBGRCVMAN: EXITING setRedoLogDeletionPolicy with policy = TO NONE with target = NULL with bind = MANDATORY
DBGMISC: krmknmtr: the parse tree after name translation is: [13:13:53.118]
DBGMISC: EXITED krmknmtr with status Control File and SPFILE Autobackup [13:13:53.118] elapsed time [00:00:00:00.021]
DBGMISC: krmkdps: this_reset_scn=56498483682 [13:13:53.118]
DBGMISC: krmkdps: this_reset_time=02-SEP-09 [13:13:53.119]
DBGMISC: krmkdps: untilSCN= [13:13:53.119]
DBGMISC: krmkdps: untilTime= [13:13:53.119]
DBGMISC: krmkdps: getRA_completedAfter= [13:13:53.119]
DBGMISC: krmkdps: getRA_completedBefore= [13:13:53.119]
DBGMISC: krmkdps: getRA_likePattern= [13:13:53.120]
DBGMISC: krmkdps: getRA_containerMask=15 [13:13:53.120]
DBGMISC: krmkdps: getRA_actionMask=63 [13:13:53.120]
DBGMISC: krmkdps: computeRA_allRecords0 [13:13:53.120]
DBGMISC: krmkdps: computeRA_fullBackups=1 [13:13:53.120]
DBGMISC: krmkdps: allIncarnations=0 [13:13:53.120]
DBGMISC: ENTERED krmkbaut [13:13:53.121]
DBGMISC: Entering krmkgconf [13:13:53.121]

DBGSQL: EXEC SQL AT TARGET select controlfile_type ,controlfile_change# ,controlfile_sequence# ,controlfile_created ,decode(offr.records_used,0,0,((offr.last_recid-offr.records_used)+1)) into :b1,:b2,:b3,:b4,:b5 from v$database ,v$controlfile_record_section offr where offr.type=’OFFLINE RANGE’ [13:13:53.121]
DBGSQL: sqlcode=0 [13:13:53.125]
DBGSQL: :b1 = “CURRENT”
DBGSQL: :b2 = 152792315080
DBGSQL: :b3 = 7538564
DBGSQL: :b4 = “09-AUG-09″
DBGSQL: :b5 = 0
DBGMISC: krmkgconf: Configuration is ENCRYPTION FOR DATABASE [13:13:53.126]

DBGSQL: EXEC SQL AT TARGET declare first boolean ; conf# number ; begin if ( :first > 0 ) then first := TRUE ; else first := FALSE ; end if ; dbms_rcvman . getConfig ( conf# , :name:name_i , :value:value_i , first ) ; end ; [13:13:53.126]
DBGSQL: sqlcode=1403 [13:13:53.127]
DBGMISC: Function kmkgconf created the following tree: [13:13:53.128]
DBGMISC: EXIT
DBGMISC: Exiting krmkgconf [13:13:53.128]
DBGMISC: Configured NoEncrypt at db level [13:13:53.128]
DBGMISC: No Encrypt backup init(60000000) for cfile [13:13:53.128]
DBGMISC: EXITED krmkbaut [13:13:53.128] elapsed time [00:00:00:00.007]
DBGMISC: EXITED krmkomp [13:13:53.128] elapsed time [00:00:00:00.032]
DBGPLSQL: the compiled command tree is: [13:13:53.129] (krmicomp)
DBGPLSQL: 1 CMD type=Control File and SPFILE Autobackup id=1 status=NOT STARTED
DBGPLSQL: 1 STEP id=1 status=NOT STARTED
DBGPLSQL: 1 TEXTNOD = declare
DBGPLSQL: 2 TEXTNOD = ncopies number;
DBGPLSQL: 3 TEXTNOD = copyno number;
DBGPLSQL: 4 TEXTNOD = handle varchar2(512);
DBGPLSQL: 5 TEXTNOD = comment varchar2(80);
DBGPLSQL: 6 TEXTNOD = media varchar2(80);
DBGPLSQL: 7 TEXTNOD = lcfaudate date;
DBGPLSQL: 8 TEXTNOD = lsequence binary_integer;
DBGPLSQL: 9 TEXTNOD = lbautfmt varchar2(512);
DBGPLSQL: 10 TEXTNOD = rsid number;
DBGPLSQL: 11 TEXTNOD = rsts number;
DBGPLSQL: 12 TEXTNOD = p1 binary_integer := 0;
DBGPLSQL: 13 TEXTNOD = p2 binary_integer;
DBGPLSQL: 14 TEXTNOD = p3 binary_integer;
DBGPLSQL: 15 TEXTNOD = p4 binary_integer;
DBGPLSQL: 16 TEXTNOD = p5 binary_integer;
DBGPLSQL: 17 TEXTNOD = t1 varchar2(1025);
DBGPLSQL: 18 TEXTNOD = t2 varchar2(1);
DBGPLSQL: 19 TEXTNOD = t3 varchar2(1);
DBGPLSQL: 20 TEXTNOD = begin
DBGPLSQL: 21 TEXTNOD =
DBGPLSQL: 22 PRMVAL = lsequence := null;
DBGPLSQL: 23 TEXTNOD =
DBGPLSQL: 24 PRMVAL = lcfaudate := null;
DBGPLSQL: 25 TEXTNOD =
DBGPLSQL: 26 PRMVAL = lbautfmt := null;
DBGPLSQL: 27 TEXTNOD =
DBGPLSQL: 28 PRMVAL = rsid := 5708; rsts := 791817232;
DBGPLSQL: 29 TEXTNOD = setBackupParams(FALSE);
DBGPLSQL: 30 TEXTNOD = if (krmicd.getParams(1, p2, p3, p4, p5, t1, t2, t3)) then
DBGPLSQL: 31 TEXTNOD = p1 := 1;
DBGPLSQL: 32 TEXTNOD = end if;
DBGPLSQL: 33 TEXTNOD = sys.dbms_backup_restore.setRmanStatusRowId(rsid=>rsid, rsts=>rsts);
DBGPLSQL: 34 TEXTNOD = sys.dbms_backup_restore.DoAutobackup(ncopies => ncopies,
DBGPLSQL: 35 TEXTNOD = cfaudate => lcfaudate,
DBGPLSQL: 36 TEXTNOD = seq => lsequence,
DBGPLSQL: 37 TEXTNOD = format => lbautfmt,
DBGPLSQL: 38 TEXTNOD = p1 => p1,
DBGPLSQL: 39 TEXTNOD = p2 => p2,
DBGPLSQL: 40 TEXTNOD = p3 => p3,
DBGPLSQL: 41 TEXTNOD = p4 => t1);
DBGPLSQL: 42 TEXTNOD = copyno := 0;
DBGPLSQL: 43 TEXTNOD = loop
DBGPLSQL: 44 TEXTNOD = exit when copyno=ncopies;
DBGPLSQL: 45 TEXTNOD = sys.dbms_backup_restore.backupPieceCrtDupGet(copyno,
DBGPLSQL: 46 TEXTNOD = handle,
DBGPLSQL: 47 TEXTNOD = comment,
DBGPLSQL: 48 TEXTNOD = media);
DBGPLSQL: 49 TEXTNOD = if comment is null then comment := ‘NONE’; end if;
DBGPLSQL: 50 TEXTNOD = krmicd.writeMsg(8503, handle, comment);
DBGPLSQL: 51 TEXTNOD = copyno := copyno + 1;
DBGPLSQL: 52 TEXTNOD = end loop;
DBGPLSQL: 53 TEXTNOD = sys.dbms_backup_restore.setRmanStatusRowId(rsid=>0, rsts=>0);
DBGPLSQL: 54 TEXTNOD = end;
DBGMISC: executing command Control File and SPFILE Autobackup [13:13:53.138]
DBGRPC: krmxpoq: xc=6917529027645961080, action=”0000003 FINISHED66″, col_l=18, ind=0, sid=1475
DBGRPC: krmxr: xc=6917529027645961080 chid=default rpc count=3
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000021 FINISHED129″, col_l=19, ind=0, sid=1870
DBGRPC: krmxr: xc=6917529027645974080 chid=ORA_DISK_1 rpc count=21
DBGRPC: ENTERED krmqgns
DBGRPC: krmqgns: looking for work for channel default (krmqgns)
DBGRPC: krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC: CMD type=Control File and SPFILE Autobackup id=1 status=NOT STARTED
DBGRPC: 1 STEP id=1 status=NOT STARTED
DBGRPC: krmqgns: no work found for channel default (krmqgns)
DBGRPC: (krmqgns)
DBGRPC: EXITED krmqgns with status 1
DBGRPC: ENTERED krmqgns
DBGRPC: krmqgns: looking for work for channel ORA_DISK_1 (krmqgns)
DBGRPC: krmqgns: commands remaining to be executed: (krmqgns)
DBGRPC: CMD type=Control File and SPFILE Autobackup id=1 status=NOT STARTED
DBGRPC: 1 STEP id=1 status=NOT STARTED
DBGRPC: krmqgns: channel ORA_DISK_1 assigned step 1 (krmqgns)
DBGRPC: (krmqgns)
DBGRPC: EXITED krmqgns with status 0
DBGRPC: krmxcic: xc=6917529027645974080 chid=ORA_DISK_1 calling peidcs
DBGRPC: krmxcis: xc=6917529027645974080 chid=ORA_DISK_1 calling pcicmp
DBGRPC: krmxr: xc=6917529027645974080 chid=ORA_DISK_1 calling peicnt
DBGMISC: ENTERED krmzgparms [13:13:53.152]
DBGMISC: Step id = 1; Code = 1 [13:13:53.152] (krmzgparms)
DBGMISC: EXITED krmzgparms with status 0 (FALSE) [13:13:53.152] elapsed time [00:00:00:00.000]
DBGMISC: ENTERED krmzgparms [13:13:53.152]
DBGMISC: Step id = 1; Code = 1 [13:13:53.152] (krmzgparms)
DBGMISC: EXITED krmzgparms with status 0 (FALSE) [13:13:53.152] elapsed time [00:00:00:00.000]
DBGRPC: krmxrpc: xc=6917529027645974080 kpurpc2 rc=0 db=target proc=DBMS_BACKUP_RESTORE.SETRMANSTATUSROWID
DBGRPC: krmxrpc: xc=6917529027645974080 chid=ORA_DISK_1 rpc count(non-blocking)=22
DBGRPC: krmxrpc: xc=6917529027645974080 RPC #22 completed immediately
DBGRPC: krmxrpc: xc=6917529027645974080 kpurpc2 rc=3123 db=target proc=DBMS_BACKUP_RESTORE.DOAUTOBACKUP
DBGRPC: krmxrpc: xc=6917529027645974080 chid=ORA_DISK_1 rpc count=23
DBGRPC: krmxrpc: xc=6917529027645974080 starting longrunning RPC #23 to target: DBMS_BACKUP_RESTORE.DOAUTOBACKUP
DBGRPC: krmxr: xc=6917529027645974080 started long running rpc
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000022 FINISHED129″, col_l=19, ind=0, sid=1870
DBGRPC: krmxr: callback returned TRUE, skipping sleep
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 1 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 2 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 4 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 8 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111″, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds
DBGRPC: krmxpoq: xc=6917529027645974080, action=”0000023 STARTED111”, col_l=18, ind=0, sid=1870
DBGRPC: krmxr: sleeping for 10 seconds

对RMAN相关进程做stack call的追踪

select action from gv$session where sid=1782 and serial#=62996 and inst_id=2

rman.log

-rw-r—– 1 oracle oinstall 56475 Aug 20 13:13 mac_ora_29364.trc
-rw-r—– 1 oracle oinstall 439863 Aug 20 13:13 mac_ora_28147.trc

mac> select spid from v$process where addr in ( select paddr from v$session where module like ‘rman%’);

SPID
————
27957
26914

mac>
mac>
mac> oradebug setospid 27957;
Oracle pid: 275, Unix process pid: 27957, image: oracle@mfs138 (TNS V1-V3)
mac> oradebug short_stack;
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368 oradebug short_stack;
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368 oradebug short_stack;
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368
mac> oradebug setospid 26914
Oracle pid: 463, Unix process pid: 26914, image: oracle@mfs138 (TNS V1-V3)
mac> oradebug short_stack
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368 oradebug short_stack
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368 oradebug short_stack
ksdxfstk()+48<-ksdxcb()+5776<-sspuser()+640<-<-_read_sys()+48<-_read()+224<-sntpread()+80<-ntprd()+192<-nsprecv()+736<-nsrdr()+272<-nsdo()+12512<-nsbrecv()+80<-nioqrc()+704<-opikndf2()+1216<-opitsk()+1280<-opiino()+1840<-opiodr()+2128<-opidrv()+1088<-sou2o()+336<-opimai_real()+224<-main()+368<-main_opd_entry()+80

oradebug dump errorstack 3;
Statement processed.

oradebug tracefile_name
/apps/oracle/admin/mac/udump/mac_ora_26914.trc

pstack 27957
27957: /apps/oracle/product/10.2.0/bin/oracle

——————————– lwpid : 5899967 ——————————-

0: c000000000405690 : _read_sys() + 0x30 (/usr/lib/hpux64/libc.so.1)
1: c000000000419f00 : _read() + 0xe0 (/usr/lib/hpux64/libc.so.1)
2: 40000000020d7ad0 : sntpread() + 0x50 (/apps/oracle/product/10.2.0/bin/oracle)
3: 40000000020d7a20 : ntprd() + 0xc0 (/apps/oracle/product/10.2.0/bin/oracle)
4: 400000000aa3b7c0 : nsprecv() + 0x2e0 (/apps/oracle/product/10.2.0/bin/oracle)
5: 4000000002307bb0 : nsrdr() + 0x110 (/apps/oracle/product/10.2.0/bin/oracle)
6: 4000000002300180 : nsdo() + 0x30e0 (/apps/oracle/product/10.2.0/bin/oracle)
7: 40000000022fcfa0 : nsbrecv() + 0x50 (/apps/oracle/product/10.2.0/bin/oracle)
8: 4000000002aff570 : nioqrc() + 0x2c0 (/apps/oracle/product/10.2.0/bin/oracle)
9: 4000000002bd3c80 : opikndf2() + 0x4c0 (/apps/oracle/product/10.2.0/bin/oracle)
10: 4000000002d9f3f0 : opitsk() + 0x500 (/apps/oracle/product/10.2.0/bin/oracle)
11: 400000000268c3f0 : opiino() + 0x730 (/apps/oracle/product/10.2.0/bin/oracle)
12: 4000000002cd6ec0 : opiodr() + 0x850 (/apps/oracle/product/10.2.0/bin/oracle)
13: 400000000260fd80 : opidrv() + 0x440 (/apps/oracle/product/10.2.0/bin/oracle)
14: 400000000260f390 : sou2o() + 0x150 (/apps/oracle/product/10.2.0/bin/oracle)
15: 40000000025fbfc0 : opimai_real() + 0xe0 (/apps/oracle/product/10.2.0/bin/oracle)
16: 400000000250c880 : main() + 0x170 (/apps/oracle/product/10.2.0/bin/oracle)
17: c000000000032f90 : main_opd_entry() + 0x50 (/usr/lib/hpux64/dld.so)

pstack 26914
26914: /apps/oracle/product/10.2.0/bin/oracle

——————————– lwpid : 5898924 ——————————-

0: c000000000405690 : _read_sys() + 0x30 (/usr/lib/hpux64/libc.so.1)
1: c000000000419f00 : _read() + 0xe0 (/usr/lib/hpux64/libc.so.1)
2: 40000000020d7ad0 : sntpread() + 0x50 (/apps/oracle/product/10.2.0/bin/oracle)
3: 40000000020d7a20 : ntprd() + 0xc0 (/apps/oracle/product/10.2.0/bin/oracle)
4: 400000000aa3b7c0 : nsprecv() + 0x2e0 (/apps/oracle/product/10.2.0/bin/oracle)
5: 4000000002307bb0 : nsrdr() + 0x110 (/apps/oracle/product/10.2.0/bin/oracle)
6: 4000000002300180 : nsdo() + 0x30e0 (/apps/oracle/product/10.2.0/bin/oracle)
7: 40000000022fcfa0 : nsbrecv() + 0x50 (/apps/oracle/product/10.2.0/bin/oracle)
8: 4000000002aff570 : nioqrc() + 0x2c0 (/apps/oracle/product/10.2.0/bin/oracle)
9: 4000000002bd3c80 : opikndf2() + 0x4c0 (/apps/oracle/product/10.2.0/bin/oracle)
10: 4000000002d9f3f0 : opitsk() + 0x500 (/apps/oracle/product/10.2.0/bin/oracle)
11: 400000000268c3f0 : opiino() + 0x730 (/apps/oracle/product/10.2.0/bin/oracle)
12: 4000000002cd6ec0 : opiodr() + 0x850 (/apps/oracle/product/10.2.0/bin/oracle)
13: 400000000260fd80 : opidrv() + 0x440 (/apps/oracle/product/10.2.0/bin/oracle)
14: 400000000260f390 : sou2o() + 0x150 (/apps/oracle/product/10.2.0/bin/oracle)
15: 40000000025fbfc0 : opimai_real() + 0xe0 (/apps/oracle/product/10.2.0/bin/oracle)
16: 400000000250c880 : main() + 0x170 (/apps/oracle/product/10.2.0/bin/oracle)
17: c000000000032f90 : main_opd_entry() + 0x50 (/usr/lib/hpux64/dld.so)

uname -a
HP-UX B.11.31 U ia64 unlimited-user license

Workaround:

没有找到直接解决该问题的方法, (从HP-UX上迁移走似乎是一种不可行的方案),可以通过configure autobackup off绕过该问题,在RMAN备份脚本中加入backup current controlfile可以起来类似的作用。

如何跟踪Oracle动态服务注册

如何trace Oracle PMON进程动态注册过程?这个问题我想到2个答案,对PMON做event trace或者采用Oracle Network Server因该都可以达到目的。
让我们来实践一下!

Oracle Network Server Trace模式

1. 启用Oracle SqlNet服务器端trace,这需要我们修改sqlnet.ora配置文件

[maclean@rh2 ~]$ echo "TRACE_LEVEL_SERVER = 16
> TRACE_FILE_SERVER = SERVER
> TRACE_DIRECTORY_SERVER= /home/maclean/ntrc"  > $ORACLE_HOME/network/admin/sqlnet.ora
[maclean@rh2 ~]$ cat $ORACLE_HOME/network/admin/sqlnet.ora
TRACE_LEVEL_SERVER = 16
TRACE_FILE_SERVER = SERVER
TRACE_DIRECTORY_SERVER= /home/maclean/ntrc

2. 触发trace

SQL> conn / as sysdba
Connected.

SQL>  select spid from V$process, V$session where audsid=userenv('SESSIONID') and paddr=addr;
SPID
------------
4264

SQL> alter system register;
System altered.

/*在之前指定的TRACE_DIRECTORY_SERVER目录下将出现形如server_$spid.trc的trace文件*/

[maclean@rh2 ntrc]$ cat server_4290.trc|grep nsprecv|grep -A 2 -B 2 reg
[20-AUG-2010 10:42:53:896] nsprecv: 00 00 00 00 00 15 61 6C  |......al|
[20-AUG-2010 10:42:53:896] nsprecv: 74 65 72 20 73 79 73 74  |ter.syst|
[20-AUG-2010 10:42:53:896] nsprecv: 65 6D 20 72 65 67 69 73  |em.regis|
[20-AUG-2010 10:42:53:896] nsprecv: 74 65 72 01 00 00 00 01  |ter.....|
[20-AUG-2010 10:42:53:896] nsprecv: 00 00 00 00 00 00 00 00  |........|

/*分析该trace文件后可以匹配到以上动态注册语句*/

/*记得将server端sqlnet trace设置disable掉*/

[maclean@rh2 ntrc]$ echo "" > $ORACLE_HOME/network/admin/sqlnet.ora

PMON TRACE模式
1.找出PMON的进程号

SQL> select spid ,program from v$process where program like '%PMON%';

SPID         PROGRAM
------------ ------------------------------------------------
4050         oracle@rh2 (PMON)

2.执行10257事件trace

SQL> oradebug setospid 4050;
Oracle pid: 2, Unix process pid: 4050, image: oracle@rh2 (PMON)

SQL> oradebug event 10257 trace name context forever,level 16;
Statement processed.

SQL> alter system register;
System altered.

SQL> oradebug event 10257 trace name context off;
Statement processed.

SQL> oradebug tracefile_name;
/s01/10gdb/admin/YOUYUS/bdump/youyus_pmon_4050.trc

cat /s01/10gdb/admin/YOUYUS/bdump/youyus_pmon_4050.trc
.......................
kmmlrl: register now
kmmgdnu: cXDB
         goodness=0, delta=1,
         flags=0x5:unblocked/not overloaded, update=0x6:G/D/-
kmmgdnu: YOUYUS
         goodness=0, delta=1,
         flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: nsgr update returned 0
kmmlrl: nsgr register returned 0

/*可以看到注册了YOUYUS和cXDB2个service的过程*/

若无法正常注册可以按照以下步骤检查:
1. 检查Oracle Net命名方式是否正确设置,该参数由sqlnet.ora配置文件中的NAMES.DIRECTORY_PATH指定。
2. 确认SQL NET配置文件如sqlnet.ora,tnsnames.ora等的位置正确配载。
3. 检查是否设置了TNS_ADMIN环境变量,该变量将影响配置文件的正确路径。
4. 确认LOCAL_LISTENER或REMOTE_LISTENER中指定的service可以被正常tnsping通,若以上参数未指定值则尝试步骤6。
5. 确认主机名可以被正确解析为ip地址,如以下示例:

C:\Users\weisly>nslookup www.oracle.com
服务器:  dir-605
Address:  192.168.0.1

非权威应答:
名称:    a398.g.akamai.net.0.1.cn.akamaitech.net
Addresses:  117.104.136.98
          117.104.136.128
Aliases:  www.oracle.com
          www.oracle.com.edgesuite.net
          a398.g.akamai.net

6. 修改LOCAL_LISTENER为非服务别名解析的形式,这样可以排除配置文件可能造成的影响,从而确认问题是否由配置文件设置不当而引起:

SQL> conn / as sysdba
Connected.

SQL> alter system set local_listener='(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rh2)(PORT=1521)))';
System altered.

SQL> alter system set REMOTE_LISTENER=' (ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = rh1)(PORT = 1521))(ADDRESS = (PROTOCOL = TCP)(HOST = rh2)(PORT = 1521)))';
System altered.

7. 将主机名替换为ip地址并测试是否解决问题。
8. 以IPC协议替代常见的TCP协议测试,绕过TCP和主机名等因素可能造成的问题,以确认问题是否由这些因素引起:

SQL> alter system set local_listener='(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=KEY1)))';
System altered.

/*其中KEY值需要和listener.ora中指定的一致。*/

SQL> host
[maclean@rh2 admin]$ lsnrctl status KEY1

LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 20-AUG-2010 11:30:06
Copyright (c) 1991, 2007, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=KEY1)))
STATUS of the LISTENER
------------------------
Alias                     KEY1
Version                   TNSLSNR for Linux: Version 10.2.0.4.0 - Production
Start Date                20-AUG-2010 11:29:15
Uptime                    0 days 0 hr. 0 min. 50 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /s01/10gdb/network/admin/listener.ora
Listener Log File         /s01/10gdb/network/log/key1.log
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=KEY1)))
Services Summary...
Service "YOUYUS" has 1 instance(s).
  Instance "YOUYUS", status READY, has 1 handler(s) for this service...
Service "YOUYUS_XPT" has 1 instance(s).
  Instance "YOUYUS", status READY, has 1 handler(s) for this service...
Service "cXDB" has 1 instance(s).
  Instance "YOUYUS", status READY, has 10 handler(s) for this service...
The command completed successfully

实在不行的话,我们还是提SR吧!

沪ICP备14014813号-2

沪公网安备 31010802001379号