EM dbconsole显示数据文件使用量为负数?

有网友在itpub上反应11g中查看EM dbconsole显示数据文件的使用量为负数,如图:

 

 

实际上导致(EM ( EM Home page -> Administration Tab -> Storage -> Datafiles ))显示负的使用量的Bug在11g和10g中均存在,这里将这些bug列出来以供参考:

 

10g Enterprise Manager Shows Negative Utilisation For An Oracle Datafile.

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3
This problem can occur on any platform.
Symptoms

EM ( EM Home page -> Administration Tab -> Storage -> Datafiles ) shows negative 
utilization for some files. Value in Used (MB) column is negative.

Cause
This is because of Bug:5083393.

If the object is dropped and not purged, the value of FILE_ID / REL_FNO may be wrong in 
DBA_FREE_SPACE and this cause EM to display wrong value in Used (MB) column.

Solution
If available, download Patch:5083393 to resolve this issue.
Note that this patch is available on 10.2.0.2/10.2.0.3 for some platforms at the 
time of this writing this article (JULY 2007).

Workaround :
-------------
Purge all objects from the recycle bin.

$sqlplus "/as sysdba"
sql> purge dba_recyclebin 

Hdr: 5083393 10.1.0.4 RDBMS 10.1.0.4 SPACE PRODID-5 PORTID-912
Abstract: VALUE OF FILE_ID AND RELATIVE_FNO OF DBA_FREE_SPACE IS DIFFERENT

PROBLEM:
--------
When the table is dropped, the value of the FILE_ID column and
the RELATIVE_FNO column of dba_free_space might be different.

SQL> select * from dba_free_space where tablespace_name = 'TBS_TEST1';

TABLESPACE_NAME FILE_ID BLOCK_ID    BYTES BLOCKS RELATIVE_FNO
--------------- ------- -------- -------- ------ ------------
E
E
E
TBS_TEST1             6      105    65536      8            6
TBS_TEST1             6      113    65536      8            6
TBS_TEST1             6      121    65536      8            6
TBS_TEST1             6        9    65536      8            7   purge table tbs_test01;

Table purged.

SQL> select * from dba_free_space where tablespace_name = 'TBS_TEST1';

TABLESPACE_NAME FILE_ID BLOCK_ID    BYTES BLOCKS RELATIVE_FNO
--------------- ------- -------- -------- ------ ------------
TBS_TEST1             6        9   983040    120            6
TBS_TEST1             7        9   983040    120            7
TBS_TEST1             8        9 20905984   2552            8

In a word, there seems to be a problem in the display of an free space
concerning the segment that exists in RECYCLE BIN.

WORKAROUND:
-----------
None

RELATED BUGS:
-------------
None

REPRODUCIBILITY:
----------------
  100% (both In-house and Ct' site) 
  Rep?     Platform               RDBMS Ver. 
  -------  ---------------------  ---------- 
  Y(100%)  912 Intel Windows NT   10.1.0.4
  Y(100%)  912 Intel Windows NT   10.1.0.5
  Y(100%)  912 Intel Windows NT   10.2.0.2
  Y(100%)   46 Linux x86          10.1.0.4

 

11gR2 EMCONSOLE SHOWS NEGATIVE VALUES WHEN VIEWING DATAFILE USAGE SPACE IN 11.2.0.1

 

Hdr: 9868375 11.2.0.1 STORAGE 11.2.0.1 PRODID-1366 PORTID-226 9548105
Abstract: EMCONSOLE SHOWS NEGATIVE VALUES WHEN VIEWING DATAFILE USAGE SPACE IN 11.2.0.1

PROBLEM:
--------
The display of the usage of the data files (Tablespaces -> Datafiles) shows 
values in the range negative. The fill rate is not shown correctly.

This is the same exact issue as explained in note 559414.1 but on top of 
11.2.0.1 and also the issue does not seem fixed when trying the workaround by 
clearing the recycle bin or by turning it off.

DIAGNOSTIC ANALYSIS:
--------------------
Checked running the statements from sqlplus they provided the correct value 
and checked this Bug 5549540 seems it should have been fixed in 11G but still 
having the same issue.

WORKAROUND:
-----------
To run the statements and check the data file usage from sqlplus.

 

11gR1 Grid Control Shows Duplicate Datafile Names Multiple times (Some With Negative Values) When Drilling Onto Tablespace Link

Applies to:
Enterprise Manager Grid Control - Version: 10.2.0.5 to 11.1.0.1 - Release: 10.2 to 11.1
Information in this document applies to any platform.
Symptoms

The list of datafiles displayed for a tablespace in EM Grid Control is wrong, so that some 
datafiles are listed repeatedly multiple times, and some of the sizes are wrong and showing a negative value.

This is seen from database home page => server tab => tablespaces.
Click on tablespace name, a list of datafiles associated with that tablespace should be returned.

Cause
If the SQL statement which Grid Control uses is run manually at Sqlplus, the same output of the 
same datafile listed multiple times can also be seen. 

Here you have the SQL statement launched by the Grid Control to collect the list of datafiles :

SELECT d.file_name, TO_CHAR((d.bytes / 1024 / 1024), '99999990.000'), 
NVL(TO_CHAR(((d.bytes - s.bytes) / 1024 / 1024), '99999990.000'), TO_CHAR((d.bytes / 1024 / 1024), '99999990.000')), 
d.file_id, d.autoextensible, d.increment_by, d.maxblocks FROM sys.dba_data_files d, 
(SELECT file_id, SUM(bytes) bytes FROM sys.dba_free_space WHERE tablespace_name = '&tablespaceName' 
GROUP BY file_id) s WHERE (s.file_id (+)= d.file_id) AND (d.tablespace_name = '&tablespaceName')

This is the issue described in following bug :

Bug 8499373 GRID CONTROL SHOWS SAME DATAFILE NAME REPEATEDLY WHEN DRILL INTO A TABLESPACE

This bug is actually closed as a duplicate of a database bug Bug 8570838 WRONG RESULTS FOR QUERY AGAINST 
DBA_DATA_FILES IN 10.2.0.4
The root cause of the database bug is :
Bug 7155655 Intermittent Wrong Results/Plan Instability from Join Predicate Push Down (JPPD)

Solution

Apply the Patch 7155655. 

Note :
Depending on platform it may be necessary to apply a merge patch for this - eg. in some cases it was 
necessary to apply Patch 7278117.

Hdr: 9748363 11.2.0.1 STORAGE 11.1.0.1 PRODID-1366 PORTID-226 9548105
Abstract: NEGATIVE VALUES REPORTED BY GRIDCONTROLFOR TABLESPACE WITH AT LEAST TWO DATAFILE

*** 05/25/10 09:46 am ***
TAR
---
3-1772640611

Problem Description
-------------------
The Used (MB) column on the datafiles summary webpage in the 11.2 grid 
control 
shows negative values for a tablespace containing at least two datafiles

Environment Information
-----------------------

Test Case Step-by-Step Instructions
-----------------------------------

Test Case Location
------------------

Diagnostic Analysis
-------------------
This is the same as bug 9548105 except for this is for GridControl vs 
dbconsole.
SELECT d.file_name, TO_CHAR((d.bytes / 1024 / 1024), '99999990.000'),
NVL(TO_CHAR(((d.bytes - s.bytes) / 1024 / 1024), '99999990.000'),
TO_CHAR((d.bytes / 1024/ 1024), '99999990.000')), d.file_id, 
d.autoextensible,
d.increment_by, d.maxblocks FROM sys.dba_data_files d,  
(SELECT ts.name tablespace_name, SUM(e.length * ts.blocksize) bytes  
FROM sys.fet$ e, sys.ts$ ts  WHERE ts.ts# = e.ts# GROUP BY ts.name  
UNION ALL  SELECT ts.name tablespace_name, SUM(e.blocks * ts.blocksize) bytes 

FROM sys.dba_lmt_free_space e, sys.ts$ ts  WHERE ts.ts# = e.tablespace_id 
GROUP BY ts.name) s
WHERE (s.tablespace_name = d.tablespace_name) AND (d.tablespace_name = :1)

Performance
-----------

NLS Information
---------------

-------

Log Files Location
------------------

Reproducibility
---------------
yes

URL
---

Did you test with the latest version?
-------------------------------------

Available Workarounds
---------------------
none

Related Bugs
------------
9548105

Oracle内部错误:ORA-00600[13013][5001]故障诊断一例

周五被叫到客户现场解决一套10.2.0.4 RAC数据库上的ORA-00600[13013]内部错误问题,这个问题同事已经在上午通过远程拨号了解过情况,
初步判断是索引存在讹误corruption引起的600。

前期诊断

同事在我抵达现场之前已经做了初步的诊断,该10.2.0.4上的RAC系统主用节点的告警日志中多次出现ORA-00600:[13013], [5001]、ORA-00600:[qertbFetchByRowID]及ORA-00600: [25027] 等内部错误,具体的日志如下:

Fri Sep 16 01:16:54 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Fri Sep 16 01:16:55 2011
Trace dumping is performing id=[cdmp_20110916011655]
Fri Sep 16 01:17:06 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []

以下为13013 trace

ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Current SQL statement for this session:
UPDATE CUST_SUBSCRB_PERSONAL A SET a.cust_mobile='A04204441' WHERE a.subscrbid=71524739
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               1000D8FE4 ? 0785F7165 ?
10501ED28 ? FFFFFFFFFFF6748 ?
06C626E72 ?
ksesic6+0060         bl       kgesiv               110072D08 ? 7000000AC52B8F8 ?
7000000AC52B3A8 ? 07FFFFFFF ?
000000000 ?
updThreePhaseExe+0c  bl       01F9D7A8
4c
updexe+041c          bl       updThreePhaseExe     7000000AC52B8F8 ? 000000000 ?
110540128 ? FFFFFFFFFFF7C80 ?
opiexe+27d8          bl       updexe               7000000AC52B8F8 ?
FFFFFFFFFFF82B8 ?
kpoal8+0edc          bl       opiexe               FFFFFFFFFFFB434 ?
FFFFFFFFFFFB198 ?
FFFFFFFFFFF9608 ?
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       01F9F2A0
opiino+0990          bl       opitsk               000000000 ? 000000000 ?

============
Plan Table
============
------------------------------------------------------+-----------------------------------+
| Id  | Operation           | Name                    | Rows  | Bytes | Cost  | Time      |
------------------------------------------------------+-----------------------------------+
| 0   | UPDATE STATEMENT    |                         |       |       |     1 |           |
| 1   |  UPDATE             | CUST_SUBSCRB_PERSONAL   |       |       |       |           |
| 2   |   INDEX UNIQUE SCAN | CUST_SUBSCRB_PERSONAL_PK|     1 |    33 |     1 |  00:00:01 |
------------------------------------------------------+-----------------------------------+

===============================================================================================================

同时还伴随有ORA-00600:[qertbFetchByRowID]出现

Fri Sep 16 01:08:57 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * from CUST_SUBSCRB_PERSONAL A WHERE a.subscrbid=307557025
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              578318D500000003 ?
FFFFFFFFFFF80D0 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgerinv+00dc         bl       _ptrgl
kgeasnmierr+004c     bl       kgerinv              11048AE68 ? 000000000 ?
000000001 ? 104BDEC18 ?
110489398 ?
qertbFetchByRowID+0  bl       03F2EF38
d34
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?

============
Plan Table
============
---------------------------------------------------------------+-----------------------------------+
| Id  | Operation                    | Name                    | Rows  | Bytes | Cost  | Time      |
---------------------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT             |                         |       |       |     1 |           |
| 1   |  TABLE ACCESS BY INDEX ROWID | CUST_SUBSCRB_PERSONAL   |     1 |   287 |     1 |  00:00:01 |
| 2   |   INDEX UNIQUE SCAN          | CUST_SUBSCRB_PERSONAL_PK|     1 |       |     1 |  00:00:01 |
---------------------------------------------------------------+-----------------------------------+

===============================================================================================================

此外还会出现ORA-00600: [25027], [6], [1443670797], [], [], [], [], []

Fri Sep 16 15:06:00 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_1872436.trc:
ORA-00600: internal error code, arguments: [25027], [6], [1443670797], [], [], [], [], []

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [25027], [6], [4278267905], [], [], [], [], []
Current SQL statement for this session:
select q.subscrbid,q.serv_lvl,q.cust_lvl
from cust_subscrb_personal q
where q.serv_lvl is null
or q.cust_lvl is null
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               000000000 ? 000000000 ?
000000000 ? 105673724 ?
00000000C ?
ksesic2+0060         bl       kgesiv               000000245 ? 0FFFFFFFF ?
700000010013BD0 ?
700000010018078 ? 110000998 ?
krtd2abh+040c        bl       ksesic2              61C3000061C3 ? 000000000 ?
000000006 ? 000000000 ?
0FF013001 ? 000000FA0 ?
105673FB8 ? 105673FA8 ?
kcbgtcr+24a0         bl       krtd2abh             700000471D95738 ?
7000001A9F0F050 ?
FFFFFFFFFFF78E0 ?
700000471D93500 ? 000000003 ?
ktrget+04ac          bl       kcbgtcr              1105A75A8 ? 000002000 ?
000000000 ? 000001940 ?
kdsgrp+00bc          bl       ktrget               000000064 ? 110000998 ?
700000010008000 ?
kdsgnp+0444          bl       kdsgrp               000000000 ? 000000000 ?
FFFFFFFFFFF80D0 ?
kafger+08fc          bl       kdsgnp               000000000 ? 000000000 ?
111119E38 ?
kdstf1100101km+0f58  bl       kafger               1105A74D8 ? 111119E38 ?
FFFFFFFFFFF8190 ? 000000000 ?
70000046DCA8488 ? 000000000 ?
0FFFF9608 ? 110471948 ?
kdsttgr+1a04         bl       kdstf1100101km       111119E38 ? 0000102A1 ?
70000042DF6DC50 ?
FFFFFFFFFFF8418 ? 0FFFFFFFF ?
000001FE8 ? 000000000 ?
000000000 ?
qertbFetch+09b8      bl       kdsttgr              111119E38 ? 000000418 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000418 ? BE4610195B2C ?
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?

============
Plan Table
============
--------------------------------------------------+-----------------------------------+
| Id  | Operation          | Name                 | Rows  | Bytes | Cost  | Time      |
--------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |                      |       |       |  4898 |           |
| 1   |  TABLE ACCESS FULL | CUST_SUBSCRB_PERSONAL|    15 |   165 |  4898 |  00:00:59 |
--------------------------------------------------+-----------------------------------+

我们先从ORA-00600:[13013], [5001]内部错误入手,Mos Note <How to resolve ORA-00600 [13013], [5001] [ID 816784.1]> 比较翔实地介绍了该600错误,其argument的具体含义如下:

Arg [a] Passcount
Arg [b] Data Object number
Arg [c] Tablespace Relative DBA of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Relative DBA of block being updated (should be same as [c])
Arg [f] Code

我们这里的[13013], [5001], [66209], [50730514], [23], [50730514], [3], [] 就是

data_object_id = 66209 = 0x102A1
RDBA = 50730514 即  3061612 fild id=12  block id =398866
Row Slot number = 23
Code = 3

翻译过来就是对data_object_id为66209 的对象在12号数据文件上的398866块的第23行做代码为3的检测时发现了异常,通过data_object_id可以定位到该对象就是正在执行的SQL语句中的CUST_SUBSCRB_PERSONAL表。

ORA-00600:[13013], [5001]可能由表上的数据行或索引中的记录逻辑讹误(logical Corruption)所造成,这里要强调一下是逻辑讹误,而非物理损坏。逻辑讹误一般是由于Oracle Bug或者memory Corruption 所引起的,该ORA-00600:[13013], [5001]已知可能由5085288和4549673等多个Bug 引发,在版本10.2.0.4 上这些bug 触发概率较高。

伴随ORA-00600:[13013], [5001]发生的ORA-600 [qertbFetchByRowID]常由索引损坏引起,介绍了该问题:

ORA-600 [qertbFetchByRowID] Select Queries on 10.2.0.4 Database [ID 755592.1]

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.4 and later   [Release: 10.2 and later ]

Symptoms

The following error was received during SELECT operation:

ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

The call stack contains:

qertbFetchByRowID <- qergiFetch <- rwsfcd <- qerhjFetch
< - qerjotFetch <- rwsfcd <- qertqoFetch <- qerpxSlaveFetch <- qerpxFetch

Cause
Index corruption.

Solution

1.
Analyze the affected object to look for logical corruption.

SQL> analyze table TABLENAME validate structure cascade;

This will place an exclusive lock on the table.

If an error is reported try dropping and recreating the index(es) first.

2.
If this problem is occurring on a version less than 10.2.0.4, ensure the patch for
Bug 4883635 has been applied if Materialized Views are used.

BUG:5866783 - ORA-00600: [QERTBFETCHBYROWID] ON SELECT FROM ONE TABLE

而ORA-00600: [25027]也是类似的问题:

ORA-600 [25027] [ID 284433.1]

PURPOSE:
  This article represents a partially published OERI note.

  It has been published because the ORA-600 error has been
  reported in at least one confirmed bug.

  Therefore, the SUGGESTIONS section of this article may help
  in terms of identifying the cause of the error.

  This specific ORA-600 error may be considered for full publication
  at a later date. If/when fully published, additional information
  will be available here on the nature of this error.

ERROR:

  Format: ORA-600 [25027] [a] [b]

VERSIONS:
  versions 9.2 and above

ARGUMENTS:
  Arg [a]  Tablespace Number (TSN)
  Arg [b]  Decimal Relative Data Block Address (RDBA)

SUGGESTIONS:

 1. If the Arg [b] (the RDBA) is 0 (zero), then this could be due to fake indexes.

  The following query will list fake indexes:

     select do.owner,do.object_name, do.object_type,sysind.flags
     from dba_objects do, sys.ind$ sysind
     where do.object_id = sysind.obj#
     and bitand(sysind.flags,4096)=4096;

If the above query returns any rows, check the objects involved and consider
dropping them as they can cause this error. 

2. Run analyze table validate structure on the table referenced in the Current SQL statement in
    the related trace file.

  If the Known Issues section below does not help in terms of identifying
  a solution, please submit the trace files and alert.log to Oracle
  Support Services for further analysis.

但是请注意在该故障示例中ORA-00600: [25027]给出的Arg [b] Decimal Relative Data Block Address (RDBA)是一个完全不相干的RDBA,具体原因尚不明确。

小插曲

这当中有一个小的插曲,原来客户这里的应用人员需要尽快在该CUST_SUBSCRB_PERSONAL表上执行一段查询语句,但是该语句一旦执行就会遇到ORA-600 [qertbFetchByRowID]错误,因为是周五小周末,所以如果今天无法运行的话,就要拖到下个礼拜了。

所以被要求优先解决该语句执行的问题,首先看了一下该语句的执行计划:

SQL> explain plan for select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 3616548176

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |     1 |    31 |   231   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE               |                           |     1 |    31 |            |          |
|*  2 |   HASH JOIN                   |                           |   425 | 13175 |   231   (1)| 00:00:03 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |   425 |  8075 |   208   (0)| 00:00:03 |
|*  4 |     INDEX RANGE SCAN          | CUST_SUBSCRB_PERSONAL_2IX | 42462 |       |    20   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL          | WZY_BS20110916            | 39792 |   466K|    22   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("A"."SVCNUM"="B"."SVCNUM")
3 - filter(TO_NUMBER("A"."SERV_LVL")=0)
4 - access("A"."COUNTYID"='A00')

Note
-----
- dynamic sampling used for this statement

23 rows selected.

SQL> select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
  2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
                              *
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

以上执行计划中对表上的索引CUST_SUBSCRB_PERSONAL_2IX做了range scan后通过获得的rowid到表上去fetch记录(qertbFetchByRowID),在实际fetch by rowid 的时候引发了ORA-600错误。

因为需求较为紧急,所以我还是考虑能否使执行计划绕过该索引,投石问路给语句加上了RULE提示,再次执行:

SQL> select /*+ rule */
2  count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
3  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;

COUNT(*)
----------
11559

Execution Plan
----------------------------------------------------------
Plan hash value: 2851452146

-------------------------------------------------------------------
| Id  | Operation                     | Name                      |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |
|   1 |  SORT AGGREGATE               |                           |
|   2 |   NESTED LOOPS                |                           |
|   3 |    TABLE ACCESS FULL          | WZY_BS20110916            |
|*  4 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |
|   5 |     AND-EQUAL                 |                           |
|*  6 |      INDEX RANGE SCAN         | CUST_SUBSCRB_SVCNUM_2006  |
|*  7 |      INDEX RANGE SCAN         | CUST_SUBSCRB_PERSONAL_2IX |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - filter(TO_NUMBER("A"."SERV_LVL")=0)
6 - access("A"."SVCNUM"="B"."SVCNUM")
7 - access("A"."COUNTYID"='A00')

Note
-----
- rule based optimizer used (consider using cbo)

想不到居然执行成功了,但是执行计划当中仍有CUST_SUBSCRB_PERSONAL_2IX这个索引,这让我潜意识中认识到很可能是表而非索引存在逻辑讹误。

不管怎么说至少解了燃眉之急,先把这个方法告诉应用人员,然后回过头来继续诊断。

初步验证索引

为了确定到底是表还是索引存在逻辑讹误,一般需要使用analyze table validate structure cascade命令以验证表和索引的结构,但是该命令会要求以共享方式锁住表(TM mode=4),对于更新频繁的生产系统中的关键应用表,这是不可接受的。同事在接手这个故障时已经考虑过该问题了,所以他推荐用查询的方式来检验到底是那些索引存在corruption,具体方法如下:

通过ROWID_CREATE 构造出ROWID

SQL> select dbms_rowid.ROWID_CREATE(1,66209,12,398866,23) from dual;

DBMS_ROWID.ROWID_C
------------------
AAAQKhAAMAABhYSAAX

SQL> select * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> SELECT INDEX_NAME,COLUMN_NAME FROM DBA_IND_COLUMNS WHERE TABLE_NAME='CUST_SUBSCRB_PERSONAL' ORDER BY 1;

INDEX_NAME                     COLUMN_NAME
------------------------------ ----------------------------------------
CUST_SUBSCRB_CARD_NUM          VIP_CARD_NUM
CUST_SUBSCRB_MANAGERID_2006    MANAGERID
CUST_SUBSCRB_PERSONAL_2IX      COUNTYID
CUST_SUBSCRB_PERSONAL_3IX      CUST_LVL
CUST_SUBSCRB_PERSONAL_PK       SUBSCRBID
CUST_SUBSCRB_SERV_COUNTY       SERV_COUNTYID
CUST_SUBSCRB_SVCNUM_2006       SVCNUM
IDX_CUST_SUBSCRB_PERSONAL_01   SERV_LVL

之后强制使用index提示使用不同的索引

SQL> select /*+ INDEX(a CUST_SUBSCRB_CARD_NUM ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_MANAGERID_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_3IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a CUST_SUBSCRB_SERV_COUNTY) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

可以看到使用以上方法在其中4个索引上遇到了ORA-00600:[qertbFetchByRowID]。 但是使用该方法还是无法彻底搞清楚逻辑到底存在于table还是index上?

 

必要的工具

 

这里我们要介绍一下validate structure还存在online在线使用的选项,在online模式下整个验证过程不会在表上加任何TM锁,这一点具体可以使用10704 事件来证明:

 

10704, 00000, "Print out information about what enqueues are being obtained"
// *Cause:  When enabled, prints out arguments to calls to ksqcmi and
//          ksqlrl and the return values.
// *Action: Level indicates details:
//   Level: 1-4: print out basic info for ksqlrl, ksqcmi
//          5-9: also print out stuff in callbacks:  ksqlac, ksqlop
//          10+: also print out time for each line

SQL> create table validate_me (t1 int);

Table created.

SQL> create index ind_validate_me on validate_me(t1);

Index created.

SQL> insert into validate_me select rownum  from dba_tables where rownum<201;

200 rows created.

SQL> commit;

Commit complete.

SQL> oradebug setmypid;
Statement processed.

SQL> oradebug event 10704 trace name context forever,level 10;
Statement processed.

SQL> analyze table validate_me validate structure cascade online;

Table analyzed.

SQL> oradebug tracefile_name;
c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_3020.trc

g11r2_ora_3020.trc=========================================================================

*** 2011-09-18 20:55:25.373
Oradebug command 'event 10704 trace name context forever,level 10' console output: <none>

*** 2011-09-18 20:55:49.765
ksqgtl *** TX-00060005-000006a9 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x31434194, ktcdix=2147483647, topxcb=0x31434194
ktcipt(topxcb)=0x0

*** 2011-09-18 20:55:49.766
ksucti: init txn DID from session DID
ksqgtl:
ksqlkdid: 0001-001B-00000006

*** 2011-09-18 20:55:49.766
*** ksudidTrace: ksqgtl
ktcmydid(): 0001-001B-00000006
ksusesdi:   0000-0000-00000000
ksusetxn:   0001-001B-00000006
ksqgtl: RETURNS 0

*** 2011-09-18 20:55:49.766
ksqrcl: TX,60005,6a9
ksqrcl: returns 0

但是validate strucutre online也有它的缺点,那就是在线模式下结构验证命令将不填充索引的状态信息到index_stats视图,如:

SQL> analyze index  ind_validate_me validate structure  ;

Index analyzed.

SQL> select count(*) from index_stats;

COUNT(*)
----------
1

SQL> conn / as sysdba
Connected.
SQL> analyze index  ind_validate_me validate structure  online;

Index analyzed.

SQL> select count(*) from index_stats;

COUNT(*)
----------
0

但是因为我们这里只要用到validate structure的结构验证功能,而对索引的详细状态没有兴趣,所以我们可以充分利用该online模式。

 

具体验证

 

使用validate structure online具体验证该问题表和表上的索引:

先仅对表进行验证,以区分到底是表还是索引存在逻辑讹误 

SQL> analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online;
analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file

若验证发现问题会出现ORA-01498错误,并产生trace 文件

kdrchk:  row is marked as both a Key and Clustered
prow=0x7000001f241c45c flag=0xff
Block Checking: DBA = 50730514, Block Type = KTB-managed data block
data header at 0x7000001f241c07c
kdbchk: bad row tab 0, slot 23
Block header dump:  0x03061612
Object id on Block? Y
seg/obj: 0x102a1  csc: 0xb43.ecde68ca  itc: 3  flg: E  typ: 1 - DATA
brn: 0  bdba: 0x3061609 ver: 0x01 opc: 0
inc: 0  exflg: 0

............

可以看到问题发生在  23 号槽位上

tab 0, row 23, @0x3e0
tl: 4 fb: KCHDFLPN lb: 0x1  cc: 0 cki: 1
tab 0, row 24, @0x3de
tl: 2 fb: --HD---N lb: 0x30
tab 0, row 25, @0x3dc
tl: 2 fb: --HD---- lb: 0x30
tab 0, row 26, @0x3d8
tl: 4 fb: KCHDFLPN lb: 0xff  cc: 0 cki: 255

这里居然第23个row piece的 flag 是KCHDFLPN 即实际fb = 0xff,也就是该row piece同时被标记为key和clustered(row is marked as both a Key and Clustered),因此不管当服务进程尝试update该问题行记录或者通过ROWID访问该row时都出现了ORA-00600错误,虽然其错误代码不同,但都是由于该数据块中第23行记录的flag存在讹误引起的。

关于该ORA-00600:[13013], [5001]问题的成因和解决方法,更多内容可以参考<手工模拟Oracle数据块逻辑讹误引发ORA-00600:[13013], [5001]一例>一文。

 

待修订!

MMON Slave日志[KDL_TRIM]: NEWLEN: MESSAGES

有网友反映windows 2003 上的10.2.0.4数据库,MMON的Slave后台进程报带有NEWLEN: MESSAGES的相关日志,如:


*** ACTION NAME:(Auto ADDM Slave Action) 2011-09-09 18:00:12.053
*** MODULE NAME:(MMON_SLAVE) 2011-09-09 18:00:12.053
*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-09 18:00:12.053
*** SESSION ID:(797.606) 2011-09-09 18:00:12.053
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

这一般是由于10.2.0.4上已经confirmed的bug:”Bug 6972843 – Unnecessary trace files with “[kdl_trim]: newlen” text in them [ID 6972843.8]”所引起的,在bug目前确认仅在10.2.0.4上发生,而到10.2.0.4.1中已经被修复了。

主要是MMON的辅助slave进程会不必要地输出如”[kdl_trim]: newlen: 0″的日志信息:


Unnecessary trace files may be generated with entries that read:'[kdl_trim]: newlen: '.
 
Hdr: 9600827 10.2.0.4 RDBMS 10.2.0.4 RAM LOBS PRODID-5 PORTID-212
Abstract: [KDL_TRIM]: NEWLEN: MESSAGES IN MMON SLAVES TRACES

PROBLEM:
--------
MMON slave processes traces contains the following entries despite patch 
6972843 applied

*** NAME:(Auto ADDM Slave Action) 2010-04-14 06:00:42.899
*** NAME:(MMON_SLAVE) 2010-04-14 06:00:42.899
*** NAME:(SYS$BACKGROUND) 2010-04-14 06:00:42.899
*** ID:(2472.45569) 2010-04-14 06:00:42.899
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

DIAGNOSTIC ANALYSIS:
--------------------
- Opatch contains patch 6972843 applied

WORKAROUND:
-----------
none

RELATED BUGS:
-------------
Bug 6972843 - HOURLY TRACE FILE CREATED IN BDUMP FOR AUTO ADDM SLAVE ACTION

该bug可以被忽略,但是如果日志产生过于频繁的话可以考虑升级10.2.0.4的PSU或者直接升级到10.2.0.5。

CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM

客户的一套11.2.0.1 RAC系统采用ASM diskgroup 存放ocr和votedisk,该REG diskgroup中的某个LUN disk由于硬件的原因损坏了,导致冗余的votedisk表决磁盘有一个处于OFFLINE状态,客户希望能删除该OFFLINE的votedisk并新增一个可用的。

在删除该votedisk文件时出现了CRS-4258的错误,错误如下:

crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. OFFLINE  5b3380d6367e4f94bf19e9db5f2f684e ()  []
 2. ONLINE   6802e6d139354fb3bf95725dd01a02fd (/dev/ocr2) [REG]
 3. ONLINE   a433d51ebd2d4facbfc8e95b017f5393 (/dev/asm-disk1) [REG]
 4. ONLINE   3784d344bffa4f6ebff21c4dd3c873bd (/dev/asm-disk2) [REG]
Located 4 voting disk(s).

crsctl delete css votedisk 5b3380d6367e4f94bf19e9db5f2f684e
CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM

居然无法移除ASM存储上的voting files,太搞笑了。

客户在MOS上找到了CRS-4258相关问题的Note:

CRS-4258: Addition and Deletion of Voting Files are not Allowed Because the Voting Files are on ASM in 11gR2 [ID 1060146.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.1 - Release: 11.2 to 11.2
Information in this document applies to any platform.
Symptoms

CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM in 11gR2.


Changes
Stale voting files are seen after accidently dropping one of ASM disks belonging to the ASM diskgroup where voting files are stored.
And CRS-4258 occurs when trying to delete the stale voting files using crsctl delete css votedisk FUID.

[root@grid]# crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 5b91aad0a2184f3dbfa8f970e8ae4d49 (/dev/oracleasm/disks/ASM10) [PLAY]
2. ONLINE 53b1b40b73164f9ebf3f498f6d460187 (/dev/oracleasm/disks/ASM9) [PLAY]
3. OFFLINE 82dfd04b96f14f6dbf36f5a62b118f61 () []

[root@grid]# crsctl delete css votedisk 82dfd04b96f14f6dbf36f5a62b118f61
CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM
Cause

1. Seeing stale voting files is due to bug 9024611.

2. "delete" command is not available , only "replace" command is available when voting files are stored on  ASM diskgroup.    

    Please see Oracle Clusterware Administration and Deployment Guide11g Release 2 (11.2)

Solution


1. This issue is permanently fixed in 11.2.0.2.0.

2. Apply patch 9024611. Please contact Oracle support if this patch is not available on your platform.

3. If CSS has stale voting files even after applying patch 9024611, do the following workaround -

WORKAROUND:
Do something to trigger ASM to try to relocate the voting file.

e.g)  $ crsctl replace votedisk  +asm_disk_group   --- Put available ASM diskgroup

        $ crsctl query css votedisk         --- Check if voting files are all online on the new ASM diskgroup
        $ crsctl replace votedisk +PLAY    -- Put the original ASM diskgroup where voting files were 

4. If the workaround above cannot be followed for any reason then you can request the patch for unpublished bug 9409327 for your platform.

References
BUG:9294664 - NOT ABLE TO REMOVE THE VOTEDISK WHICH IS OFFILNE

Hdr: 9294664 11.2.0.1 PCW 11.2.0.1 ADMUTL PRODID-5 PORTID-226 9024611
Abstract: NOT ABLE TO REMOVE THE VOTEDISK WHICH IS OFFILNE

PROBLEM:
--------
crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   9f7f4f7f798d4f69bfe31653894421a2 (ORCL:GRID1) [GRID]
2. OFFLINE  a9b785a59c3c4f67bf15babc67ffb79a () []
3. OFFLINE  29988f37fa794f12bfea3f3672c99609 () []
4. ONLINE   a8b3a040195c4f54bfce8ef21bd4fa07 (ORCL:GRID3) [GRID]
5. ONLINE   a1e4fbd9df6f4f67bf8fc12fe9780721 (ORCL:GRID2) [GRID]
Located 5 voting disk(s).


[root@sdc-drrac01 grid]# crsctl delete css votedisk 
a9b785a59c3c4f67bf15babc67ffb79a
CRS-4258: Addition and deletion of voting files are not allowed because the 
voting files are on ASM

DIAGNOSTIC ANALYSIS:
--------------------
Ct is performing some voting disk failover scenarios in which he has removed 
the 2 votedisk which were on ASM buy drop disk using asmlib and after that 
recreating the disk again and start the cluster in exclusive mode and start 
the ASM and mount the diskgourp So that rebalancing has been done but after 
that

 crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   9f7f4f7f798d4f69bfe31653894421a2 (ORCL:GRID1) [GRID]
2. OFFLINE  a9b785a59c3c4f67bf15babc67ffb79a () []
3. OFFLINE  29988f37fa794f12bfea3f3672c99609 () []
4. ONLINE   a8b3a040195c4f54bfce8ef21bd4fa07 (ORCL:GRID3) [GRID]
5. ONLINE   a1e4fbd9df6f4f67bf8fc12fe9780721 (ORCL:GRID2) [GRID]
Located 5 voting disk(s).

and not able to drop the vote disk which is offiline 

WORKAROUND:
-----------
n/a

RELATED BUGS:
-------------
as per bug 9024611 tried the workaround:

but while running 

crsctl css votedisk delete 

we got syntax error and found that there is no command with crsctl css ...

这个Note说明不能移除ASM存储内voting files的问题在11.2.0.2.0上已经解决了,也可以通过安装one-off patch 9024611来修复。

但是实际在11.2.0.2上测试可以发现仍旧无法删除ASM上的voting files:

root@rh2 ~]# crsctl query crs  releaseversion
Oracle High Availability Services release version on the local node is [11.2.0.2.0]

[root@rh2 ~]# crsctl query crs  activeversion
Oracle Clusterware active version on the cluster is [11.2.0.2.0]


[grid@rh2 ~]$ /s01/grid/OPatch/opatch lsinventory
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation.  All rights reserved.


Oracle Home       : /s01/grid
Central Inventory : /s01/app/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.2.0.1.1
OUI version       : 11.2.0.2.0
OUI location      : /s01/grid/oui
Log file location : /s01/grid/cfgtoollogs/opatch/opatch2011-08-04_18-50-34PM.log

Patch history file: /s01/grid/cfgtoollogs/opatch/opatch_history.txt

Lsinventory Output file location : /s01/grid/cfgtoollogs/opatch/lsinv/lsinventory2011-08-04_18-50-34PM.txt

--------------------------------------------------------------------------------
Installed Top-level Products (1): 

Oracle Grid Infrastructure                                           11.2.0.2.0
There are 1 products installed in this Oracle Home.


There are no Interim patches installed in this Oracle Home.


Rac system comprising of multiple nodes
  Local node = rh2
  Remote node = rh3

--------------------------------------------------------------------------------

OPatch succeeded.


[root@rh2 ~]# crsctl delete css votedisk a433d51ebd2d4facbfc8e95b017f5393

CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM

又是一个伪修复的Bug….!!

无法,寄希望与replace能解决问题,结果发现:

crsctl replace votedisk +DATA
Failed to create voting files on disk group DATA.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.

方法四中指出的unpublished bug 9409327(Patch 9409327: OFFLINE VF ENTRY REMAINS AFTER PATCH FOR BUG 9024611),目前仅在IBM AIX on POWER Systems (64-bit)的11.2.0.1上有对应的补丁。

Oracle内部错误:ORA-07445[kcflfi()+466] [INT_DIVIDE_BY_ZERO]一例

一套Windows上的11.2.0.1单实例数据库在database open阶段出现了ORA-07445:core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []内部错误,具体的出错日志如下:

LOG CONTENT

=======================ALERT.LOG============================

Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
ARCH: Warning; less destinations available than specified
by LOG_ARCHIVE_MIN_SUCCEED_DEST init.ora parameter
Autotune of undo retention is turned on. 
IMODE=BR
ILAT =84
2011-08-01 13:13:47.068000 +08:00
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
Using parameter settings in server-side spfile C:\APP\PRODUCT\11.2.0\DBHOME_1\DATABASE\SPFILEG11R2.ORA
System parameters with non-default values:
  _spin_count              = 2000
  processes                = 500
  event                    = "10500 trace name context forever,level 8:10013 trace name context forever,level 10:
10015 trace name context forever,level 10"
  sga_max_size             = 600M
  shared_pool_size         = 152M
  large_pool_size          = 32M
  java_pool_size           = 4M
  streams_pool_size        = 0
  _db_file_direct_io_count = 12
  sga_target               = 0
  memory_target            = 0
  control_files            = "C:\APP\ORADATA\G11R2\CONTROLFILE\O1_MF_6VWCSH9J_.CTL"
  control_files            = "C:\APP\FLASH_RECOVERY_AREA\G11R2\CONTROLFILE\O1_MF_6VWCSHNF_.CTL"
  db_block_checksum        = "TRUE"
  db_block_size            = 8192
  db_cache_size            = 196M
  _shared_io_pool_size     = 0
  compatible               = "11.2.0.0.0"
  log_archive_dest_2       = "service=stdby optional lgwr sync affirm valid_for=(online_logfiles,all_roles)"
  log_buffer               = 10485760
  db_create_file_dest      = "C:\app\oradata"
  db_recovery_file_dest    = "C:\app\flash_recovery_area"
  db_recovery_file_dest_size= 500000M
  undo_tablespace          = "UNDOTBS1"
  _kgl_bucket_count        = 2
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  session_cached_cursors   = 300
  audit_file_dest          = "C:\APP\ADMIN\G11R2\ADUMP"
  optimizer_features_enable= "10.2.0.4"
  audit_trail              = "DB"
  cell_offload_plan_display= "ALWAYS"
  db_name                  = "G11R2"
  open_cursors             = 3000
  _optimizer_extended_cursor_sharing_rel= "NONE"
  pga_aggregate_target     = 300M
  diagnostic_dest          = "C:\APP"
2011-08-01 13:13:48.164000 +08:00
PMON started with pid=2, OS id=984 
VKTM started with pid=3, OS id=3656 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
GEN0 started with pid=4, OS id=5824 
DIAG started with pid=5, OS id=5832 
DBRM started with pid=6, OS id=2784 
PSP0 started with pid=7, OS id=2500 
DIA0 started with pid=8, OS id=5320 
MMAN started with pid=9, OS id=4128 
DBW0 started with pid=10, OS id=5852 
LGWR started with pid=11, OS id=3960 
CKPT started with pid=12, OS id=4472 
SMON started with pid=13, OS id=5788 
RECO started with pid=14, OS id=6036 
MMON started with pid=15, OS id=5740 
MMNL started with pid=16, OS id=2112 
ORACLE_BASE from environment = C:\app
alter database mount exclusive
2011-08-01 13:13:52.390000 +08:00
Sweep [inc][135908]: completed
NSS2 started with pid=19, OS id=2728 
Sweep [inc][135901]: completed
Successful mount of redo thread 1, with mount id 2704081164
Database mounted in Exclusive Mode
2011-08-01 13:13:53.413000 +08:00
Lost write protection disabled
2011-08-01 13:13:54.578000 +08:00
Sweep [inc][135897]: completed
Sweep [inc2][135908]: completed
Sweep [inc2][135901]: completed
Sweep [inc2][135897]: completed
2011-08-01 13:13:55.788000 +08:00
Completed: alter database mount exclusive
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
2011-08-01 13:13:56.959000 +08:00
Started redo scan
Completed redo scan
 read 0 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 867, block 88140, scn 9122496
Recovery of Online Redo Log: Thread 1 Group 3 Seq 867 Reading mem 0
  Mem# 0: C:\APP\ORADATA\G11R2\ONLINELOG\O1_MF_3_6VWCSMPO_.LOG
  Mem# 1: C:\APP\FLASH_RECOVERY_AREA\G11R2\ONLINELOG\O1_MF_3_6VWCSNGX_.LOG
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 867, block 88140, scn 9142497
 0 data blocks read, 0 data blocks written, 0 redo k-bytes read
2011-08-01 13:13:58.738000 +08:00
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=22, OS id=4784 
2011-08-01 13:13:59.765000 +08:00
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
ARC1 started with pid=24, OS id=2780 
ARC2 started with pid=25, OS id=1288 
ARC1: Archival started
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
ARC3 started with pid=26, OS id=3876 
2011-08-01 13:14:00.828000 +08:00
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
NSS2 started with pid=19, OS id=5156 
2011-08-01 13:14:29.008000 +08:00
ORA-16198: LGWR received timedout error from KSR
2011-08-01 13:14:35.980000 +08:00
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_lgwr_3960.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Error 16198 verifying archivelog destination LOG_ARCHIVE_DEST_2
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
ARCH: LGWR is scheduled to archive destination LOG_ARCHIVE_DEST_2 after log switch
2011-08-01 13:14:38.629000 +08:00
Trying to expand controlfile section 11 for Oracle Managed Files
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_arc0_4784.trc  (incident=136091):
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []
Incident details in: c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136091\g11r2_arc0_4784_i136091.trc
2011-08-01 13:14:40.283000 +08:00
Trace dumping is performing id=[cdmp_20110801131440]
2011-08-01 13:14:52.417000 +08:00
Sweep [inc][136091]: completed
Sweep [inc2][136091]: completed
2011-08-01 13:14:59.805000 +08:00
ARC2: Detected ARCH process failure
ARC2: STARTING ARCH PROCESSES
ARC0 started with pid=19, OS id=5016 
2011-08-01 13:15:00.836000 +08:00
ARC0: Archival started
ARC2: STARTING ARCH PROCESSES COMPLETE
2011-08-01 13:15:36.689000 +08:00
Deleted Oracle managed file C:\APP\FLASH_RECOVERY_AREA\G11R2\ARCHIVELOG\2011_08_01\O1_MF_1_866_73DFKWRK_.ARC
2011-08-01 13:15:38.013000 +08:00
Error 12154 received logging on to the standby
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc:
ORA-12154: TNS:could not resolve the connect identifier specified
ARCH: Error 12154 Creating archive log file to 'stdby'
Trying to expand controlfile section 11 for Oracle Managed Files
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc  (incident=136051):
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []
Incident details in: c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136051\g11r2_ora_4852_i136051.trc
2011-08-01 13:15:39.680000 +08:00
Trace dumping is performing id=[cdmp_20110801131539]
2011-08-01 13:15:42.782000 +08:00
PMON (ospid: 984): terminating the instance due to error 397
2011-08-01 13:15:50.520000 +08:00
Instance terminated by PMON, pid = 984

=============================g11r2_ora_4852_i136051.trc=============================

Dump file c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136051\g11r2_ora_4852_i136051.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Windows NT Version V6.1 Service Pack 1 
CPU                 : 4 - type 586, 2 Physical Cores
Process Affinity    : 0x0x00000000
Memory (Avail/Total): Ph:2122M/3566M, Ph+PgF:5413M/7130M, VA:1084M/2047M 
Instance name: g11r2
Redo thread mounted by this instance: 1
Oracle process number: 17
Windows thread id: 4852, image: ORACLE.EXE (SHAD)

*** 2011-08-01 13:15:38.527
*** SESSION ID:(197.1) 2011-08-01 13:15:38.527
*** CLIENT ID:() 2011-08-01 13:15:38.527
*** SERVICE NAME:() 2011-08-01 13:15:38.527
*** MODULE NAME:(oradim.exe) 2011-08-01 13:15:38.527
*** ACTION NAME:() 2011-08-01 13:15:38.527

Dump continued from file: c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []

========= Dump for incident 136051 (ORA 7445 [kcflfi()+466]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
Process Id: 0x000010bc  Thread Id : 0x000012f4    Time : Mon Aug 01 13:15:38 
Excp. Code: 0xc0000094  Excp. Type: INT_DIVIDE    Flags: 0x00000000

------------------- Registers ----------------------------
eip = 0500282e esp = 0d9f525c ebp = 0d9f577c edi = 37eefe00 esi = 00000265
eax = 00000265 ebx = 00000000 ecx = 089ee234 edx = 00000000
ecs = 0000001b eds = 00000023 ees = 00000023 ess = 00000023
egs = 00000000 efs = 0000003b
eflags = 00010246
------------------- End of Registers ---------------------

*** 2011-08-01 13:15:38.536
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=a01hp0psv0rrh) -----
alter database open
----------- messages from pre-loading .sym files:
Symbol file C:\app\product\11.2.0\dbhome_1\RDBMS\ADMIN\oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
Symbol file C:\app\product\11.2.0\dbhome_1\RDBMS\ADMIN\oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
----------- end of messages from pre-loading .sym files
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
Symbol file C:\app\product\11.2.0\dbhome_1\BIN\oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
Symbol file C:\app\product\11.2.0\dbhome_1\BIN\oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
EnumerateLoadedModules64 failed with error -1073741819
Symbol file oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
Symbol file oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
__VInfreq__kcflfi()           00000000             
+466                                               
_kccrszf()+287       CALLrel  _kcflfi()            0 318345B8 34 31C0DD40 4000
                                                   265 4 7FFFFFFF 1 0 0
_kccrsd_expd()+1418  CALLrel  _kccrszf()           D9F7CEC 268 264
_kccwnc_reuse_expan  CALLrel  _kccrsd_expd()       D9F7CEC B 38
d()+640                                            
__VInfreq__kccwnc()  CALLrel  _kccwnc_reuse_expan  D9F7CEC B 26
+235                          d()                  
_krse_arc_complete(  CALLrel  _kccwnc()            D9F7CEC D9F6D38 B
)+1615                                             
_krse_arc_driver_co  CALLrel  _krse_arc_complete(  D9F78AC
re()+1307                     )                    
_krse_arc_driver()+  CALLrel  _krse_arc_driver_co  D9F7CEC 1 D9F7C6C 0 0 D9F7CC8
274                           re()                 0 0 0 0 0 0 0
_krsq_arch_to_force  CALLrel  _krse_arc_driver()   D9F7CEC 1 D9F7C6C 0 0 D9F7CC8
_switch()+196                                      0 0 0 0 0 0 0
__VInfreq__kcttsc()  CALLrel  _krsq_arch_to_force  D9F7CEC 1
+129                          _switch()            
_kcfopd()+1504       CALLrel  _kcttsc()            2
_adbdrv()+16700      CALLrel  _kcfopd()            0 0 0 0 D9FBBF8
_opiexe()+13594      CALLrel  _adbdrv()            4A C0000094 33644518 D9FBD38
                                                   6D60697 2F3FC5F0
_opiosq0()+6248      CALLrel  _opiexe()            4 0 D9FC704
_kpooprx()+277       CALLrel  _opiosq0()           3 E D9FC970 A4 0
_kpoal8()+632        CALLrel  _kpooprx()           D9FF074 D9FD3F8 13 1 0 A4
_opiodr()+1248       CALLreg  00000000             5E 1C D9FF070
___dyn_tls_init_cal  CALLreg  00000000             5E 1C D9FF070 1
lback()+2935122                                    
_opitsk()+1404       CALL???  00000000             C9A10E8 5E D9FF070 0 D9FED00
                                                   D9FF19C 53E52E 0 D9FF1C8
_opiino()+980        CALLrel  _opitsk()            0 0
_opiodr()+1248       CALLreg  00000000             3C 4 D9FFBC4
_opidrv()+1201       CALLrel  _opiodr()            3C 4 D9FFBC4 0
_sou2o()+55          CALLrel  _opidrv()            3C 4 D9FFBC4
_opimai_real()+124   CALLrel  _sou2o()             D9FFBD4 3C 4 D9FFBC4
_opimai()+125        CALLrel  _opimai_real()       2 D9FFBFC
_OracleThreadStart@  CALLrel  _opimai()            2 D9FFF3C 0 70 FFFFFFFF
4()+830                                            FFFFFFFF
___dyn_tls_init_cal  CALLptr  00000000             901FF6C D9FFFD4 776437F5
lback()+366382316                                  901FF6C 765D34CB 0
___dyn_tls_init_cal  CALLreg  00000000             901FF6C 765D34CB 0 0 901FF6C
lback()+367384440                                  0
___dyn_tls_init_cal  CALLrel  ___dyn_tls_init_cal  401326 901FF6C 0 0 0 0
lback()+367384392             lback()+367384403    
00000000             CALL???  00000000             

--------------------- Binary Stack Dump ---------------------
..................

从以上日志中可以看到在”Trying to expand controlfile section 11 for Oracle Managed Files“扩扎控制文件过程中出现了
_kccwnc_reuse_expan->_kccrsd_expd->_kccrszf->_kcflfi->_VInfreq__kcflfi()
函数的7445错误,kcf意为(manages and coordinates operations on the control file(s),kcf.c),是在处理日志文件中引发了INT_DIVIDE_BY_ZERO除数为零的代码bug。

通过7445和kcflfi关键词在MOS上搜索没有太大的发现,说明该Bug的处罚几率非常低,正好让我碰到说明是某些特殊参数的设置引起了该问题。

目标锁定启动日志中的非默认隐藏参数”_db_file_direct_io_count”,该参数决定了直接路径读写的IO大小,从9i开始该参数的单位调整为bytes而非原先的blocks,之前因为对该参数进行一些测试所以设置了一个较小值。

Parameter: DB_FILE_DIRECT_IO_COUNT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Versions:	8.0 - 8.1
                This parameter is hidden in 9.0 onwards.

 Parameter type:        integer
 Parameter class:       dynamic, scope = ALTER SYSTEM DEFERRED
 Default value:         64
 Range of values:       operating system-dependent

Description:
~~~~~~~~~~~~
DB_FILE_DIRECT_IO_COUNT is used to specify the number of blocks to be used
for IO operations done by backup, restore or direct path read and write
functions. The IO buffer size is a product of DB_FILE_DIRECT_IO_COUNT and
DB_BLOCK_SIZE. The IO buffer size cannot exceed max_IO_size for your
platform.

Assigning a high value to this parameter results in greater use of PGA or
SGA memory.

o In Oracle8i, minimize the number of I/O requests by setting the
  DB_FILE_DIRECT_IO_COUNT instance parameter so that

  DB_BLOCK_SIZE x DB_FILE_DIRECT_IO_COUNT = max_io_size of system

  In Oracle8i the default for this is 64 blocks.

  (In Oracle9i, it is replaced by _DB_FILE_DIRECT_IO_COUNT which governs
   the size of direct I/Os in BYTES (not blocks). The default is 1Mb but
   will be sized down if the max_io_size of the system is smaller.)

ORA-19863 during RMAN duplicate

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.3
This problem can occur on any platform.
Symptoms
-- Problem Statement:
Duplicate failed during the datafile restore stage:

Starting restore at 2008-Apr-09 09:28:24
using channel ORA_AUX_DISK_1

channel ORA_AUX_DISK_1: starting datafile backupset restore
channel ORA_AUX_DISK_1: specifying datafile(s) to restore from backup set
restoring datafile 00003 to /u06/oradata/hcmprdc/sysaux01.dbf
...
restoring datafile 00121 to /u06/oradata/hcmprdc/waapp.dbf
channel ORA_AUX_DISK_1: reading from backup piece
/u04/oradata/flash_recovery_area/HCMPRD/mdjd8s5v_1_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 04/09/2008 09:28:28
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /u04/oradata/flash_recovery_area/HCMPRD/mdjd8s5v_1_1
ORA-19863: device block size 1040384 is larger than max allowed: 262144

Cause
The database parameter _db_file_direct_io_count in the target and auxiliary instance does not match.
Solution

-- To implement the solution:

Ensure that parameter _db_file_direct_io_count on the target and auxiliary database the same

_DB_FILE_DIRECT_IO_COUNT need to be set to the same value between the source database 
where the backup was taken and the target database where the backup is being restored.

2.0 Size of Input/Output Buffers
================================

a. input buffers
----------------

NOTE : DB_FILE_DIRECT_IO_COUNT is not available in Oracle9i onwards.
       In Oracle9i, it is replaced by a hidden _DB_FILE_DIRECT_IO_COUNT which 
       governs the size of direct I/Os in BYTES (not blocks). The default is 
       1Mb butwill be sized down if the max_io_size of the system is smaller.

The input buffer size is:
  buffersize = db_block_size * db_file_direct_io_count

As there are 4 input buffers, the total input buffer memory use per channel is:
 memory(input) = #buffers * #files * buffersize
               = 4 * #files * buffersize

For example, if 2 channels are used, and each of these channels backs up 3 
files, then for each channel

 memory(input) = 4 * 3 * db_block_size * db_file_direct_io_count

b. output buffers
-----------------

For disk channels, the output buffer size is:
  buffersize = db_block_size * db_file_direct_io_count

For SBT_TAPE channels, the output buffer size in Oracle8/8i is o/s dependant. (On Solaris,
this defaults to 64k) On 9i/10g it defaults to 256k for all platforms. The BLKSIZE argument to 'allocate channel...' can be
used to override the default value.

As there are 4 output buffers,
  memory(output) = #buffers * buffersize
                 = 4 * buffersize

一般来说使用该隐藏参数的默认值即可,通过重置该参数后修复启动问题:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> alter system reset "_db_file_direct_io_count" scope=spfile;

System altered.

SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
 FROM SYS.x$ksppi x, SYS.x$ksppcv y
 WHERE x.inst_id = USERENV ('Instance')
 AND y.inst_id = USERENV ('Instance')
 AND x.indx = y.indx
AND x.ksppinm LIKE '%db_file_direct_io_count%'
/

NAME                           VALUE                DESCRIB
------------------------------ -------------------- ------------------------------
_db_file_direct_io_count       1048576              Sequential I/O buf size

windows上的11gr2默认该参数为1MB

Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例

一套Linux x86-64上的11.2.0.1 4节点RAC系统中LMS GCS服务进程遭遇到内部错误ORA-00600[kjbmprlst:shadow],导致节点实例意外终止,具体日志如下:

Fri Jul 08 02:04:43 2011
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc  (incident=1011732):
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/PROD/PROD1/incident/incdir_1011732/PROD1_lms1_536_i1011732.trc
Fri Jul 08 02:04:44 2011
Trace dumping is performing id=[cdmp_20110708020444]
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 536): terminating the instance due to error 484
Fri Jul 08 02:04:45 2011
opiodr aborting process unknown ospid (27387) as a result of ORA-1092
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_diag_513.trc
Fri Jul 08 02:04:54 2011
Termination issued to instance processes. Waiting for the processes to exit
Fri Jul 08 02:04:58 2011
ORA-1092 : opitsk aborting process

该ORA-00600[kjbmprlst:shadow]错误定位为11.2.0.1上的Bug 10121589或Bug 9458781:

Bug 10121589  ORA-600 [kjbmprlst:shadow] can occur in RAC
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release)
        11.2.0.2 Bundle Patch 2 for Exadata Database
        11.2.0.1 Bundle Patch 7 for Exadata Database 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow] 

    RAC (Real Application Clusters) / OPS 

Description

    An ORA-600 [kjbmprlst:shadow] can occur if the fix for bug 9979039
    is present.

    Note:
     One off patches for 10200390 should also include this fix.

Bug 9458781  Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error

Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions >= 11.2.0.1 but BELOW 11.2.0.2
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        11.2.0.2 (Server Patch Set)
        11.2.0.1 Bundle Patch 4 for Exadata Database 

Symptoms:

Related To:

    Instance May Crash
    Internal Error May Occur (ORA-600)
    ORA-600 [KJBMPRLST:SHADOW]
    ORA-600 [KJBMOCVT:RID]
    ORA-600 [KJBRREF:PKEY]
    ORA-600 [KJBRASR:PKEY] 

    RAC (Real Application Clusters) / OPS 

Description

    A lock is closed without sending a message to the master.
    This causes closed lock dangling at the master crashing the instance with different internal errors.

    Reported internal errors so far are :
    - KJBMPRLST:SHADOW
    - KJBMOCVT:RID
    - KJBRREF:PKEY
    - KJBRASR:PKEY

该kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞,目前除了安装补丁外没有已验证的workaround办法(disable drm似乎是无效的):

oradebug lkdebug (track resources, take dumps)
KCL history
KJBL history
KJL history

PCM (GCS) and non-PCM (GES) resources are kept separate and use separate code paths.
GES:
Resource table: kjr and kjrt
Lock table: kjlt
Processes: kjpt
GCS:
Resource table: kjbr
Lock table: kjbl

DLM Structures (continued)
/* PCM resource structure */
typedef struct kjbr {                                /* 68 bytes on sun4u */
  kjsolk       hash_q_kjbr;                             /* hash list : hp */
  ub4          resname_kjbr[2];	                     /* the resource name */
  kjsolk       scan_q_kjbr; /* chain to lmd scan q of grantable resources */
  kjsolk       grant_q_kjbr;                 /* list of granted resources */
  kjsolk       convert_q_kjbr;       /* list of resources being converted */
  ub4          diskscn_bas_kjbr;         /* scn(base) known to be on disk */
  ub2          diskscn_wrap_kjbr;        /* scn(wrap) known to be on disk */
  ub2          writereqscn_wrap_kjbr;    /* scn(wrap) requested for write */
  ub4          writereqscn_bas_kjbr;     /* scn(base) requested for write */
  struct kjbl *sender_kjbr;                 /* lock elected to send block */
  ub2          senderver_kjbr;                  /* version# of above lock */
  ub2          writerver_kjbr;                  /* version# of lock below */
  struct kjbl *writer_kjbr;                /* lock elected to write block */
  ub1          mode_role_kjbr; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1          flags_kjbr;                        /* ignorewip, free etc. */
  ub1          rfpcount_kjbr;                      /* refuse ping counter */
  ub1          history_kjbr;                /* resource operation history */
  kxid         xid_kjbr;                          /* split transaction ID */
} kjbr ;

/* kjbl - PCM lock structure
** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns  */

typedef struct kjbl {                                /* 52 bytes on sun4u */
  union {                     /* discriminate lock@master and lock@client */
    struct {                                           /* for lock@master */
      kgglk        state_q_kjbl;             /* link to chain to resource */
      kjbopqi     *rqinfo_kjbl;                             /* target bid */
      struct kjbr *resp_kjbl;                   /* pointer to my resource */
    } kjbllam;                                 /* KJB Lock Lock At Master */
    struct {                                           /* for lock@client */
      ub4         disk_base_kjbl;        /* disk version(base) for replay */
      ub2         disk_wrap_kjbl;        /* disk version(wrap) for replay */
      ub1         master_node_kjbl;                   /* master instance# */
      ub1         client_flag_kjbl;     /* flags specific to client locks */
      ub2         update_seq_kjbl;               /* last update to master */
    } kjbllac;                                 /* KJB Lock Lock At Client */
  } kjblmcd;                        /* KJB Lock Master Client Discrimnant */
  void  *remote_lockp_kjbl;           /* pointer to client lock or shadow */
  ub2    remote_ver_kjbl;                         /* remote lock version# */
  ub2        ver_kjbl;                                     /* my version# */
  ub2        msg_seq_kjbl;                         /* client->master seq# */
  ub2        reqid_kjbl;                         /* requestid for convert */
  ub2        creqid_kjbl; /* requestid for convert that has been cancelled */
  ub2        pi_wrap_kjbl;                     /* scn(wrap) of highest pi */
  ub4        pi_base_kjbl;                     /* scn(base) of highest pi */
  ub1        mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1        state_kjbl;       /* _L|_R|_W|_S, notify, which q, lock type */
  ub1        node_kjbl;                       /* instance lock belongs to */
  ub1        flags_kjbl;                                /* lock flag bits */
  ub2        rreqid_kjbl;                               /* save the reqid */
  ub2         write_wrap_kjbl;        /* last write request version(wrap) */
  ub4         write_base_kjbl;        /* last write request version(base) */
  ub4         history_kjbl;                     /* lock operation history */
} kjbl;

PCM DLM locks that are owned by the local instance are allocated and embedded in an LE structure.
PCM DLM locks that are owned by remote instances and mastered by the local instance are allocated in SHARED_POOL.

PCM Locks and Resources
Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or name(id1,id2);
kcllemode or held-mode; kclleacq or acquiring; kcllelck or DLM lock.

Fields of interest in the kjbr structure: resname_kjbr[2] or resource name; grant_q_kjbr or grant queue;
convert_q_kjbr or convert queue; mode_role_kjbr, which is a bitwise merge of grant mode and
role-interpreted NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1 Global with PI (0x018).

The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock mode: 0x00 if grant NULL;
0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened at master; 0x08 if global role (otherwise local);
0x10 has one or more PI; 0x20 if request CR; 0x40 if request S; 0x80 if request X.

Someone has to keep a list of all buffers and where they are mastered
This is called Global Resource Directory (GRD)
GRD is present on all the instances of the cluster
To find out the master:
select  b.dbablk, r.kjblmaster master_node
from x$le l, x$kjbl r, x$bh b
where b.obj =
and b.le_addr = l.le_addr
and l.le_kjbl = r.kjbllockp

Oracle Support宣称可以通过11.2.0.2 (Server Patch Set)11.2.0.1 Bundle Patch 4 for Exadata Database修复该bug,但是有迹象表明在11.2.0.2上仍可能发生该ORA-00600[kjbmprlst:shadow]内部错误,同时该bug更多地发生在超过2个节点的RAC系统中。

 

Oracle等待事件KJC:Wait for msg sends to complete

KJC: Wait for msg sends to complete是一种只会发生在RAC环境中的等待事件,目前对于该等待事件已知的信息不多。
KJ意为Kernel Lock Management Layer,Messaging Enhancements (KJC)。
猜测该等待事件为global enqueue message的网络传输等待,前台进程和后台进程如LMS、LMD均可能发生该等待事件。

 

select name,parameter1,parameter2,parameter3,wait_class from v$event_name where name like '%KJC%';

NAME                                                             PARAMETER1
---------------------------------------------------------------- ----------------------------------------------------------------
PARAMETER2                                                       PARAMETER3
---------------------------------------------------------------- ----------------------------------------------------------------
WAIT_CLASS
----------------------------------------------------------------
KJC: Wait for msg sends to complete                              msg
dest|rcvr                                                        mtype
Other

SQL> select * from gV$system_Event where event like '%KJC%';

   INST_ID EVENT                                                            TOTAL_WAITS
---------- ---------------------------------------------------------------- -----------
TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT TIME_WAITED_MICRO TOTAL_WAITS_FG TOTAL_TIMEOUTS_FG
-------------- ----------- ------------ ----------------- -------------- -----------------
TIME_WAITED_FG AVERAGE_WAIT_FG TIME_WAITED_MICRO_FG   EVENT_ID WAIT_CLASS_ID WAIT_CLASS#
-------------- --------------- -------------------- ---------- ------------- -----------
WAIT_CLASS
----------------------------------------------------------------
         2 KJC: Wait for msg sends to complete                                      415
             0           1            0              7638            163                 0
             0               0                 2494 4079369678    1893977003           0
Other

         1 KJC: Wait for msg sends to complete                                       93
             0           0            0              2094             37                 0
             0               0                  615 4079369678    1893977003           0
Other

注意到该等待事件的类别是other而非concurrency,若发生该等待事件我们应当关心的是网络性能
netstat -s or ifconfig?

以下为具体的等待发生场景:

PARSING IN CURSOR #140438030304224 len=169 dep=2 uid=0 oct=3 lid=0
tim=1310310630713104 hv=2411583655 ad='b5a49898' sqlid='637txtu7vvq57'
select service_name, service_name_hash,wait_class_id, wait_class#,  wait_class,total_waits, time_waited  from
gv$service_wait_class where inst_id = USERENV('Instance')

END OF STMT
PARSE #140438030304224:c=6999,e=7685,p=0,cr=5,cu=0,mis=1,r=0,dep=2,og=4,plh=809657567,tim=1310310630713101
CLOSE #140438030304224:c=0,e=20,dep=2,type=0,tim=1310310630713388
WAIT #140438029567384: nam='row cache lock' ela= 530 cache id=8 mode=0 request=3 obj#=6623 tim=1310310630716555
EXEC #140438032451672:c=0,e=61,p=0,cr=0,cu=0,mis=0,r=0,dep=2,og=4,plh=2853959010,tim=1310310630716708
FETCH #140438032451672:c=0,e=49,p=0,cr=4,cu=0,mis=0,r=1,dep=2,og=4,plh=2853959010,tim=1310310630716809
CLOSE #140438032451672:c=0,e=10,dep=2,type=3,tim=1310310630716877

WAIT #140438029567384: nam='KJC: Wait for msg sends to complete' ela= 18
msg=3235372280 dest|rcvr=65536 mtype=12 obj#=6623 tim=1310310630719146

EXEC #140438032420328:c=0,e=190,p=0,cr=1,cu=5,mis=0,r=1,dep=1,og=4,plh=0,tim=1310310630781989

WAIT #140438032420328: nam='KJC: Wait for msg sends to complete' ela= 55
msg=3235374760 dest|rcvr=65537 mtype=32 obj#=74418 tim=1310310630782830

这里的msg猜测可能是递增的message id
mytpe为该msg的类型
obj#为对象号

KJC: Wait for msg sends to complete可能的相关stack call:

LMS's stack:
ksedms ksdxfdmp ksdxcb sspuser sigacthandler libc_poll
kslwait ksxprcv kjctr_rksxp kjctsrw kjccwcrm kjccqmg
kjbmsclose kjbclose kclpdc kclpdcl kclpbi kjmsm
ksbrdp opirip opidrv sou2o main start 

 LMS4 was waiting for  "KJC: Wait for msg sends to complete"

  'KJC: Wait for msg sends to complete', at one of the stack trace
  below

  o kjccqmg => kjctcwcrm
  o kjccsmg => kjctcwcrm

o  an Oracle process is spinning in kjctcwcrm() on 9.0, 9.2 or 10.1

o  an Oracle process got ORA-600[kjccgmb:1] on 9.0, 9.2 or 10.1

o Got ORA-7445[] at an unknown function called from kjbmmsg2res()
o A LMS process hung in kjctcwcrm() on shutdown

LMS may hang waiting for 'kjc: wait for msg sends to complete'
during shutdown of an instance with another session
reporting ORA-600 [kjccgmb:1]

IAGNOSTIC ANALYSIS:
--------------------
Let the customer tracing the lmon, lmd when doing the shutdown immediate,
find out:
    In sur11_lmon_5948.trc
    WAIT #0: nam='ges LMD to shutdown' ela= 496057 p1=0 p2=0 p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49213 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49811 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49772 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49813 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49779 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49787 p1=0 p2=0
p3=0
    WAIT #0: nam='KJC: Wait for msg sends to complete' ela= 49779 p1=0 p2=0 

	LMS4's stack:
ksedms ksdxfdmp ksdxcb sspuser sigacthandler libc_poll
kslwait ksxprcv kjctr_rksxp kjctsrw kjccwcrm kjccqmg
kjbmsclose kjbclose kclpdc kclpdcl kclpbi kjmsm
ksbrdp opirip opidrv sou2o main start 

LMS4 was encountered ORA-4031 during shutting down,

#0  0x0000000004f5bdbe in kjccqmg ()
#1  0x0000000005172378 in kjbmflushsc ()
#2  0x0000000005175351 in kjbmfto ()
#3  0x0000000004ff425f in kjmsm ()
#4  0x00000000022677a7 in ksbrdp ()
#5  0x000000000248c8a3 in opirip ()
#6  0x000000000171eefd in opidrv ()
#7  0x0000000001d95f5f in sou2o ()
#8  0x0000000000a079ba in opimai_real ()
#9  0x0000000001d9b408 in ssthrdmain ()
#10 0x0000000000a078a1 in main ()

(gdb) bt
#0  0x00000000053907aa in kclpbi ()
#1  0x0000000004ff46dc in kjmsm ()
#2  0x00000000022677a7 in ksbrdp ()
#3  0x000000000248c8a3 in opirip ()
#4  0x000000000171eefd in opidrv ()
#5  0x0000000001d95f5f in sou2o ()
#6  0x0000000000a079ba in opimai_real ()
#7  0x0000000001d9b408 in ssthrdmain ()
#8  0x0000000000a078a1 in main ()

(gdb) bt
#0  0x0000000004f5bdbe in kjccqmg ()
#1  0x00000000051a232a in kjbmswrite ()
#2  0x00000000051014f1 in kjbrswrite ()
#3  0x00000000050fec8d in kjbrwreq ()
#4  0x0000000005189fcc in kjbmpwrite ()
#5  0x0000000004fe98fe in kjmxmpm ()
#6  0x0000000004feceee in kjmpmsgi ()
#7  0x0000000004ff2d8e in kjmsm ()
#8  0x00000000022677a7 in ksbrdp ()
#9  0x000000000248c8a3 in opirip ()
#10 0x000000000171eefd in opidrv ()
#11 0x0000000001d95f5f in sou2o ()
#12 0x0000000000a079ba in opimai_real ()
#13 0x0000000001d9b408 in ssthrdmain ()
#14 0x0000000000a078a1 in main ()

以下列出了与该’KJC: Wait for msg sends to complete’相关的Bug信息

Bug Affected Version Description
Bug 5075434 Small performance overhead in RAC (waits for “KJC: Wait for msg sends to complete”) 10.2.0.2 RAC session can suffer a small performance overhead due to waiting for a timeout on “KJC: Wait for msg sends to complete”.
Bug 11687519: PZ99 PROCESS SPIN WITH “KJC: WAIT FOR MSG SENDS TO COMPLETE” 10.2.0.3 Abstract: PZ99 PROCESS SPIN WITH “KJC: WAIT FOR MSG SENDS TO COMPLETE”
Bug 4672266: LMS WAITING FOR ‘KJC: WAIT FOR MSG SENDS TO COMPLETE’ WHILE SHUTDOWN 9.2.0.4 Abstract: LMS WAITING FOR ‘KJC: WAIT FOR MSG SENDS TO COMPLETE’ WHILE SHUTDOWN LMS2 has been waiting for the event ‘KJC: Wait for msg sends to complete’
while SHUTDOWN IMMEDIATE.
My customer decided to issue SHUTDOWN ABORT to stop the instance in one hour.
Bug 4672266 LMS hang waiting for ‘kjc: wait for msg sends to complete’ during shutdown 9.2.0.4 9.2.0.7 10.1.0.5 10.2.0.2 LMS may hang waiting for ‘kjc: wait for msg sends to complete’ during shutdown of an instance with another session reporting ORA-600 [kjccgmb:1]
Bug 2447621 – Unneccessary waits for “KJC: WAIT FOR MSG SEND COMPLETION” <9.2 Unneccessary waits for “KJC: Wait for message completion” can occur in a RAC environment.
Bug 5347566: SHUTDOWN IMMEDIATE VERY SLOW WITH ‘KJC: WAIT FOR MSG SENDS TO COMPLETE’ 9.2.0.7 Exactly similar as Bug 5289823 and Bug 4672266 LMS hang waiting for ‘kjc: wait for msg sends to complete’ during shutdown
Bug 5084699: RACGIMON WAITING FOR ‘KJC: WAIT FOR MSG SENDS TO COMPLETE’ 10.2.0.1 racgimon on one of the 4 nodes cluster has been exhibiting process spin.
It appears to wait for 15 minutes every 20 minutes on “KJC: Wait for msg
sends to complete” in the following PL/SQL execution:
‘begin dbms_lock.allocate_unique(:a1, :a2); end;’
Bug 3561040: SHUTDOWN HANG DUE TO LMS PROCESS WAITING FOR KJC: WAIT FOR MSG SENDS TO COMPLETE 9.2.0.4 LMS process is waiting for  “KJC: Wait for msg sends to complete” while shutdown and shutdown does not complete.

Oracle内部错误:ORA-00600[kgskdecrstat1]一例

一套Itanium HP-UX上的9.2.0.5系统最近出现了ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [],内部错误,其错误日志如下:

Mon Apr 18 12:32:20 2011
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []
ARC0: Completed archiving  log 6 thread 1 sequence 831803
Mon Apr 18 12:32:22 2011
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []
Mon Apr 18 12:32:23 2011
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []

Trace file
----------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0.5
System name:	HP-UX
Release:	B.11.23
Version:	U
Machine:	ia64
Redo thread mounted by this instance: 1
Oracle process number: 28
Unix process pid: 14827, image: oracle@(TNS V1-V3)

*** 2011-04-18 02:39:15.043
*** SESSION ID:(276.16183) 2011-04-18 02:39:15.042
DEADLOCK DETECTED
Current SQL statement for this session:
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                      ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TX-00030022-0070d56f        28     276     X            344     259           X
TX-000a0012-0058e4fc       344     259     X             28     276           X
session 276: DID 0001-001C-0000002C	session 259: DID 0001-0158-000020E6
session 259: DID 0001-0158-000020E6	session 276: DID 0001-001C-0000002C
Rows waited on:
Session 259: obj - rowid = 00001F0D - AABNCDAAEAAAAXYAAL
 (dictionary objn - 7949, file - 4, block - 1496, slot - 11)
Session 276: obj - rowid = 00001F0B - AABNCAAAEAAAAXCAAY
 (dictionary objn - 7947, file - 4, block - 1474, slot - 24)
Information on the OTHER waiting sessions:
Session 259:
 pid=344 serial=17280 audsid=106170227 user: 41
           program: JDBC Thin Client
 application name: JDBC Thin Client, hash value=0
 Current SQL Statement:
End of information on OTHER waiting sessions.
*** 2011-04-18 12:32:20.392
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kgskdecrstat1], [], [], [], [], [], [], []
----- Call Stack Trace -----
... kgskdecrstat  kskdecrstat  ktudecrustat  ktcdso  ktcrcm  ktdcmt  k2lcom ...
----- End of Call Stack Trace -----

PROCESS STATE
-------------
   program: JDBC Thin Client
   application name: JDBC Thin Client, hash value=0
   last wait for 'SQL*Net message from client' blocking sess=0x0 seq=39332 wait_time=21242
               driver id=74637000, #bytes=1, =0

可以看到以上引发ORA-00600[kgskdecrstat1]内部错误的进程同时发现了死锁(DEADLOCK),在MOS上搜索可以发现”Bug 2894072: ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [KGSKDECRSTAT1]”,但是该Bug最早发生在9.2.0.3上,已经确定影响9.2.0.x所有版本,并且在9.2.0.5上没有backport的bug fix。

提交sr后,Oracle Gcs给出了解决的solution:
1.升级数据库到10.2.0.5,11.1.0.7,11.2.0.2等目前支持的版本
2.解决引发该bug的ORA-00060 dead lock问题
3.如果在解决ORA-00060后仍出现以上ORA-00600[kgskdecrstat1]内部错误,且启用了9i早期版本中的resource manager的话,可以尝试禁用该特性:
ALTER SYSTEM SET resource_manager_plan=” SCOPE=BOTH;
以便绕过bug 2494790。

ORA-07445:[SIGFPE] [Integer divide by zero]内部错误一例

一套SUNOS 5.10上的单节点10.2.0.3系统出现了ORA-07445: exception encountered: core dump [SIGFPE] [Integer divide by zero] [42788866] [] [] []内部错误,具体trace日志如下:

mon_ora_17633.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0
System name: SunOS
Node name: monitor-a
Release: 5.10
Version: Generic_139556-08
Machine: i86pc

ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [SIGFPE] [Integer divide by zero] [42788866] [] [] []
Current SQL statement for this session:
select req_time into :b0 from t_FIX_TranSerial where
((((tran_bank=:b1 and tran_type=:b2) and term_no=:b3)
and trace_no=:b4) and local_time='0')
----- Call Stack Trace -----

sigsetjmp <- call_user_handler
<- sigacthandler <- kpopfr <- kposdi <- kpopsdi <- opiefn0
<- kpoal8 <- opiodr <- ttcpip <- opitsk <- opiino
<- opiodr <- opidrv <- sou2o <- opimai_real <- main
<- 0000000000E54FE7

PROCESS STATE
-------------
O/S info: user: monitor, term: pts/5, ospid: 17621, machine: monitor-a
program: CTPDATA@monitor-a (TNS V1-V3)
application name: CTPDATA@monitor-a (TNS V1-V3), hash value=0
last wait for 'SQL*Net message to client' blocking sess=0x0 seq=6213 wait_time=1 seconds since wait started=0
driver id=62657100, #bytes=1, =0

通过在MOS上查询以上ORA-07445错误的arguement可以发现Note <ORA-7445 [KPOPFR] [SIGFPE] [INTEGER DIVIDE BY ZERO] When Repeatedly Executing a Query (Doc ID 421203.1)> :

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3
This problem can occur on any platform.
Symptoms

1. Repeatedly executing a query can lead to the following error: 

ORA-7445 [kpopfr] [SIGFPE] [INTEGER DIVIDE BY ZERO]

2. The call stack from the ORA-07445 trace file should contain the following functions:
kposdi  kpopsdi

The error is caused by BUG 5753629.
Abstract: QUERY FAILS BY ORA-7445 [KPOPFR]
Repeatedly executing a query can lead to an ORA-7445[kpopfr] error.

Solution
To implement the solution, do one of the following: 

1. Upgrade to 11.1 or 10.2.0.4, when available.
At the time of writing the article these version were not yet available. (July 2007).
2. Apply one-off Patch 5753629 from MetaLink, if available for your platform and version.

There is no known workaround available for this bug.

References

BUG:5753629 - QUERY FAILS BY ORA-7445 [KPOPFR].

Hdr: 5753629 10.2.0.2 RDBMS 10.2.0.2 PRG INTERFACE PRODID-5 PORTID-23 ORA-7445
Abstract: QUERY FAILS BY ORA-7445 [KPOPFR].

*** 01/09/07 06:12 pm ***
TAR:
----

PROBLEM:
--------
When executing query again and again from one session, query fails
by ORA-7445[kpopfr].

  ====================================================================
  Sat Dec 30 00:22:39 2006
  Errors in file /var/log/oracle/trace/felica2_ora_6156.trc:
  ORA-7445: exception encountered: core dump [kpopfr()+536] [SIGFPE]
  [Integer divide by zero] [0x1023BCE18] [] []
  ====================================================================

DIAGNOSTIC ANALYSIS:
--------------------
From disassemble, %o3 is devided by %o0 and %o0 seems to be 0x0.

  0x1023f43d0 :       umul  %g5, %g4, %o0
  0x1023f43d4 :       mov  %g0, %y
  0x1023f43d8 :       udiv  %o3, %o0, %o3

%o0 is calcurated by %g5 X %g4 at kpopfr+528.

From our trace file, this value(%g5) is 0x100000.

  ub4 kponc_p [FFFFFFFF7B22AAAC, FFFFFFFF7B22AAB0) = 00100000

And %g4 seems to be 0x1000 from below trace file output.
 ========== FRAME [6] (kpopsdi()+148 -> kposdi()) ==========
  %l0 0000000100302B00 %l1 0000000000000002 %l2 FFFFFFFF7B263CA8
  %l3 00000003C36AFB80 %l4 0000000105D7CE20 %l5 0000000000000001
  %l6 0000000000000007 %l7 0000000105D7A920 %i0 0000000000000000
  %i1 FFFFFFFF7FFFB9EC %i2 0000000000001000 %i3 0000000105E03220
                       ~~~~~~~~~~~~~~~~~~~~ <--(*) here
  %i4 0000000000800000 %i5 0000000000105C00 %fp FFFFFFFF7FFFB241 

If %g4=0x1000 and %g5=0x100000, %g4 X %g5 = 0x100000000.
0x100000000 is 0x0 as ub4, and this may bring 0 divide and ORA-7445.

I can reproduce the similar problem in my house, so I'll upload testcase.
Problem reproduce at the following case.

 * sum of all column size is 1048576(0x100000)
 * run query again and again from one session (about 4096(0x1000) times)

From this results, above guess seems to be correct.

WORKAROUND:
-----------
n/a

RELATED BUGS:
-------------
n/a

REPRODUCIBILITY:
----------------
I have confirmed that this problem reproduces at the below env.

 * Linux x86 32bit, 10.2.0.3 : ORA-7445[kpopfr()+300]
 * Linux x86 64bit, 10.2.0.2 : ORA-7445[kpopfr()+339]
 * Solaris 64bit, 10.2.0.2   : ORA-7445[kpopfr()+536]
 * HP-UX Itanium, 10.2.0.2   : ORA-7445[_div32U()+34]

TEST CASE:
----------
At first, creating table like follows.

conn scott/tiger
drop table test;
create table test
( c000 char(2000),
  c001 char(2000),
    ... 
  c523 char(2000),
  c524 char(576));

  --> sum of all column size is 1048576(0x100000).

Run next shell script.

  while [ 1 ]
  do
  echo "set feedback off"
  echo "select * from test where c001 = 'A';"
  done | sqlplus -s scott/tiger

It takes 3-10 minutes to reproduce the problem.
Required time for reproducing depends on hardware spec.

STACK TRACE:
------------
 ksedmp ssexhd sighndlr call_user_handler kposdi kpopsdi kpoal8
 opiodr ttcpip opitsk opiino opiodr opidrv sou2o opimai_real
 main start

具体向Oracle GCS提交SR以后确认为 BUG 5753629. Oracle GCS给出了2种解决方案:
1.升级到10.2.0.4或更高版本
2.应用Apply one-off Patch 5753629

Oracle内部错误ORA-07445:[_memcmp()+88] [SIGSEGV]一例

一套Sparc Solaris上的10.2.0.1数据库,告警日志中出现ORA-07445:[_memcmp()+88] [SIGSEGV]内部错误日志,具体日志如下:

Errors in file /global/oracle1/centDB/admin/centDB/udump/centdb_ora_8749.trc:
ORA-07445: exception encountered: core dump [_memcmp()+88] [SIGSEGV] [Address not mapped to object] [0x000000010] []

/global/oracle1/centDB/admin/centDB/udump/centdb_ora_8749.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /global/oracle1/ORAHOME1/product/10.2/db_1
System name: SunOS
Node name: ora03ud-us
Release: 5.10
Version: Generic_142900-13
Machine: sun4u
Instance name: centDB
Redo thread mounted by this instance: 1
Oracle process number: 41
Unix process pid: 8749, image: oraclecentDB@ora03ud-us

*** SERVICE NAME:(SYS$USERS) 2011-03-08 04:54:18.528
*** SESSION ID:(1226.58882) 2011-03-08 04:54:18.528
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x10, PC: [0xffffffff7d600ca4, _memcmp()+88]
*** 2011-03-08 04:54:18.533
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [_memcmp()+88] [SIGSEGV] [Address not mapped to object] [0x000000010] [] []
----- Call Stack Trace -----
ksedmp <- ssexhd <- sighndlr <- call_user_handler <- sigacthandler
  <- memcmp <- kpzgkvl <- kziaia <- kpolnb <- kpolon
   <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr
   <- opidrv <- sou2o <- opimai_real <- main <- start

(session) sid: 1226 trans: 0, creator: 3bd522c00, flag: (41) USR/- BSY/-/-/-/-/-
     DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
     txn branch: 0
     oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
O/S info: user: root, term: unknown, ospid: , machine: 7cta-031-eqism
     program: JDBC Thin Client
application name: JDBC Thin Client, hash value=0
last wait for 'SQL*Net message from client' blocking sess=0x0 seq=2 wait_time=5705 seconds since wait started=0
      driver id=74637000, #bytes=1, =0
Dumping Session Wait History
 for 'SQL*Net message from client' count=1 wait_time=5705
      driver id=74637000, #bytes=1, =0
 for 'SQL*Net message to client' count=1 wait_time=5
      driver id=74637000, #bytes=1, =0
temporary object counter: 0
 ----------------------------------------
 Virtual Thread:
 kgskvt: 3a6bc1be8, sess: 3bcdc71e8, vc: 0, proc: 0
 consumer group cur: (upd? 0), mapped: , orig:
 vt_state: 0x0, vt_flags: 0x20, blkrun: 0
 is_assigned: 0, in_sched: 0 (0)
 vt_active: 0 (pending: 0)
 used quanta: 0 (cg: 0)
 cpu start time: 0, quantum status: 0x0
 quantum checks to skip: 0, check thresh: 0
 idle time: 0, active time: 0 (cg: 0)
 cpu yields: 0 (cg: 0), waits: 0 (cg: 0), wait time: 0 (cg: 0)
 queued time outs: 0, time: 0 (cur 0, cg 0)
 calls aborted: 0, num est exec limit hit: 0
 undo current: 0k max: 0k

以上7445内部错误并未导致实例意外终止crash,可以看到其最近的stack call为:memcmp kpzgkvl kziaia kpolnb kpolon opiodr ttcpip opitsk opiino opiodr opidrv sou2o opimai_real main start;通过Metalink搜索可以同Bug 5292883的调用堆栈匹配,Bug Note如下:

Bug 5292883  Dump from OCI client using OCI7 olog() call
Affects:

Product (Component)	 Oracle Server (Rdbms)
Range of versions believed to be affected	 Versions < 11
Versions confirmed as being affected	
10.2.0.3
Platforms affected	 Generic (all / most platforms affected)
Fixed:

This issue is fixed in	
10.2.0.2 Patch 10 on Windows Platforms
10.2.0.3 Patch 7 on Windows Platforms
10.2.0.4 (Server Patch Set)
11.1.0.6 (Base Release)
Symptoms:

Related To:

Process May Dump (ORA-7445) / Abend / Abort
Dump in or under kpzgkvl / kziaia
OCI
Description

On a 64 bit machines a dump can occur with the following stack
if the client uses the olog() OCI call to connect.
  memcmp()<-kpzgkvl()<-kziaia()<--kpolnb()<-kpolon() 

Workaround
 Use OCIServerAttach and OCISessionBegin instead of olog()

Hdr: 5292883 10.2.0.2.0 RDBMS 10.2.0.2.0 SECURITY PRODID-5 PORTID-197 ORA-7445
Abstract: ORA-7445: EXCEPTION ENCOUNTERED: CORE DUMP [_MEMCMP()+160] WHEN CONNECTING TO I

PROBLEM:
--------
 1. Clear description of the problem encountered

    OCI client connection fails with ORA-7445 [_memcmp()+160]. At the 
    time of error occurence no connection to database could be established 
    by OCI clients. Only sqlplus sessions were able to connect. The alertlog
    confirms a lot of ORA-7445 which were logged. Client version is 9.2.0.6. 

    ALERT LOG
    ---------
    Tue Jun  6 18:59:14 2006
     Submitted all GCS remote-cache requests
     Post SMON to start 1st pass IR
     Fix write in gcs resources
    Reconfiguration complete
    Tue Jun  6 19:02:32 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_16756.trc:
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2000000000] [] []
    Tue Jun  6 19:02:32 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_16756.trc:
    ORA-81: address range [0x60000000000A7D70, 0x60000000000A7D74) is not 
readable
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2000000000] [] []
    Tue Jun  6 19:04:42 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_20524.trc:
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2A00000000] [] []
    Tue Jun  6 19:04:42 2006
    Errors in file /u01/app/oracle/admin/XAL/udump/xal1_ora_20524.trc:
    ORA-81: address range [0x60000000000A7D70, 0x60000000000A7D74) is not 
readable
    ORA-7445: exception encountered: core dump [_memcmp()+160] [SIGSEGV] 
[Address not mapped to object] [0x2A00000000] [] []
    ...    

 2. Pertinent configuration information (MTS/OPS/distributed/etc)  
 
    3 instance RAC database.
    Errors were logged for 2 of these instances.
    
 3. Indication of the frequency and predictability of the problem  
 
    Intermittend occurence.
    Not reproducible at will.
    
 4. Technical impact on the customer. Include persistent after effects.

    No connections possible from ERP application which fails with ORA-7445.
    More than 500 ERP users affected.

STACK TRACE:
------------
_memcmp()+160        call                  
kpzgkvl()+192        call     _memcmp()            2000000002 ?
                                                   4000000001229FF0 ?
                                                   000000011 ?
kziaia()+480         call     kpzgkvl()            9FFFFFFFFFFFADB0 ?
                                                   9FFFFFFFFFFFAE18 ?
                                                   4000000001229FF0 ?
                                                   000000011 ? 000000000 ?
                                                   9FFFFFFFFFFF6ED8 ?
                                                   9FFFFFFFFFFF6EE0 ?
                                                   9FFFFFFFFFFF6ED0 ?
kpolnb()+1344        call     kziaia()             9FFFFFFFFFFF8040 ?
                                                   9FFFFFFFFFFF6EE0 ?
                                                   9FFFFFFFFFFF6ED8 ?
                                                   9FFFFFFFFFFF81E0 ?
                                                   9FFFFFFFFFFF81D8 ?
                                                   9FFFFFFFFFFF81E8 ?
                                                   000000000 ?
                                                   400000000233B440 ?
kpolon()+336         call     kpolnb()             9FFFFFFFFFFF8030 ?
                                                   4000000003F47E10 ?
                                                   9FFFFFFFFFFF6F80 ?
                                                   600000000009DB00 ?
                                                   00000820D ?
opiodr()+2064        call     kpolon()             000000051 ?
                                                   60000000000219A8 ?
                                                   9FFFFFFFFFFF81F0 ?
                                                   9FFFFFFFFFFF81B0 ?
                                                   60000000000AAA50 ?
                                                   40000000030BE570 ?
ttcpip()+1824        call     opiodr()             60000000000AA3B0 ?
                                                   6000000000015DD0 ?
                                                   9FFFFFFFFFFFA9A0 ?
                                                   6000000000015DD0 ?
                                                   9FFFFFFFFFFF82D0 ?
                                                   600000000009DB00 ?
                                                   00000001A ?
                                                   6000000000021838 ?

该Bug 5292883在10.2.0.1上没有相应的one-off patch补丁,而在11g和10.2.0.4补丁集中得到修复(fix)。如果无法实施补丁的话,那么一般可以通过以下2种途径绕过该问题:
1)限制用户名和密码的长度在9个字符以内
2)若使用OCI,登录使用OCIServerAttach和OCISessionBegin函数

沪ICP备14014813号-2

沪公网安备 31010802001379号