ORA-600:[ksnpost:ksnigb]错误一例

一套HP-UX Itanium上的10.2.0.3系统出现了ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [],错误,相关的日志如下:

 

ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], []

ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error

ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * FROM gv$sql

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error
Current SQL information unavailable - no session.

 

Mos上相关的Bug 信息:

 

Bug # 4634662 – Ora-600 [kolaslGetLength-1]
Bug # 7479468 – Ora-600 [ksnpost:ksnigb]

 

Issue 1)
This issue is described in the note 357016.1.Please apply the patch mentioned in the note 357016.1 to avoid this issue.

 

Issue 2)
This is due to OS resource limitation. You need to check with HP for this.We are running out of OS resources and hence
this issue is coming.
Also you can set the ulimit -a values to unlimited for oracle and root user to avoid this issue.

 

Bug 4634662 OERI:kolaslGetLength-1 from V$SQL in RAC
ORA-600[kolaslGetLength-1] selecting from V$SQL (of GV$SQL)
Range of versions believed to be affected Versions < 11
Versions confirmed as being affected

10.1.0.5
10.2.0.3

his issue is fixed in

10.2.0.3 Patch 5 on Windows Platforms
10.2.0.4 (Server Patch Set)
11.1.0.6 (Base Release)

on a RAC database with a varying width character set.

Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 10.2.0.3 – Release: 10.1 to 10.2
Information in this document applies to any platform.
Symptoms

select * from gv$sql on a RAC enviornment

errors with:
ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], [].

with a multibye characterset (ie:al32utf8, utf8 )
Note: This error has also been reported when select from SQLT$_GV$SQL_PLAN while running Sqlt Execute diagnostic utility – SQLTXPLAIN.SQLT$*

Cause
This is Bug 4634662

Details: ora-600[kolaslGetLength-1] received selecting from v$sql over a RAC db which has a varying width character set

Solution

Bug 4634662 is fixed in 10.2.0.4 and Rel 11.

To check if a patch exists for your platform, go to Metalink:

a) Click on Patches & Updates Folder.
b) Click on Simple Search.
c) Enter patch#: Patch 4634662
d) Select your O/S
e) Click Go.

ORA-600 [Ksnpost:Ksnigb] ORA-12157

Applies to:
Oracle Server – Enterprise Edition – Version: 10.2.0.4 and later [Release: 10.2 and later ]
Information in this document applies to any platform.
Symptoms

In the alert log is reported the following:

ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error

Cause

ORA-600 [ksnpost:ksnigb] along with ORA-600 [729] errors are creating problem while creating new sessions.
All these point to insufficient memory or OS limits.

Bug 7479468 was raised for similar issue that was closed as there was not enough swap swap space hence was the problem.
Solution

Check the swap space in the system and increase if required.
References
BUG:7479468 – ORA-600 [KSNPOST:KSNIGB] AND ORA-12157

 

分析发现在10.2.0.3等版本上存在RAC中查询gv$sql视图会引发ORA-600[kolaslGetLength-1]错误的bug,该bug 可以通过实施补丁4634662来解决。

 

此外当有效内存不足或OS参数上限过小时可能导致Oracle无法成功创建新的session ,引发ORA-600[ksnpost:ksnigb]错误。

建议检查ulimit -a 的输出, 在HP-UX上不妨将stack size设置为unlimited(set the stack size to unlimited.)

$ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 1048576
stack(kbytes) 131072
memory(kbytes) unlimited
coredump(blocks) 4194303
nofiles(descriptors) 2048

ORA-00600[kjpsod1]&ORA-44203错误一例

一套HPUX-Itanium平台上版本为10.2.0.2 的系统出现ORA-00600: internal error code, arguments: [kjpsod1], [], [], [], [], [], [], [],并伴随有”ORA-44203: timeout waiting for lock on cursor”.错误,详细的日志如下:

 

Database get error
Errors in file /s01/admin/udump/prod_ora_14084.trc:
ORA-00600: internal error code, arguments: [kjpsod1], [], [], [], [], [], [], []
ORA-44203: timeout waiting for lock on cursor
ORA-44203: timeout waiting for lock on cursor

44203, 0000, "timeout waiting for lock on cursor"
// *Document : Yes
// *Cause    : A timeout occured while waiting for a cursor to be compiled.  
//             This is usually caused by the SQL parse requiring access to
//             system resources which are locked by concurrently executing
//             sessions.
// *Action   : Investigate possible causes of resource contention. If
//             neccessary, contact support for additional information
//             on how to diagnose this problem.
/////////////////////////////////////////////////////////////////////////////
//    Reserving 44301 - 44400 for DBMS_SERVICES errors
/////////////////////////////////////////////////////////////////////////////

 

 

与该Internal error相关的知识如下:

 

CAUSE DETERMINATION
===================
A fix for “ORA-00600: internal error code, arguments: [kjpsod1],” issue is in a patch 5169475.

CAUSE JUSTIFICATION
===================
A fix for “ORA-00600: internal error code, arguments: [kjpsod1],” issue is in a patch 5169475.

PROPOSED SOLUTION(S)
====================
Apply patch 5169475

PROPOSED SOLUTION JUSTIFICATION(S)
==================================
A fix for “ORA-00600: internal error code, arguments: [kjpsod1],” issue is in a patch 5169475.

SOLUTION / ACTION PLAN
======================
— To implement the solution, please execute the following steps::
Apply patch 5169475

KNOWLEDGE CONTENT
=================
Errors in file /s01/admin/udump/prod_ora_14084.trc:
ORA-00600: internal error code, arguments: [kjpsod1], [], [], [], [], [], [], []
ORA-44203: timeout waiting for lock on cursor
ORA-44203: timeout waiting for lock on cursor

Solution: Apply patch 5169475

 Bug 5169475  HPUX-Itanium: trace files may not dump all memory
 This note gives a brief overview of bug 5169475.
 The content was last updated on: 30-APR-2008
 Click here for details of each of the sections below.
Affects:

    Product (Component)    Oracle Server (Rdbms)
    Range of versions believed to be affected    Versions >= 10.2.0.2 but < 11
    Versions confirmed as being affected    

        10.2.0.2

    Platforms affected    

        HPUX Itanium 64bit

     It is believed to be a regression in default behaviour thus:
       Regression introduced in 10.2.0.2

Fixed:

    This issue is fixed in    

        10.2.0.3 (Server Patch Set)
        11.1.0.6 (Base Release)

Symptoms:

Related To:

    Diagnostic Output Problem / Improvement

    Miscellaneous

Description

    This problem is introduced in 10.2.0.2 on HPUX Itanium.

    Trace files produced on HPUX Itanium may not dump all
    memory , may show memory as inaccessible (“**********”)
    when it is accessible and may abort early due with
    false errors such as
      ERROR, BAD EXTENT ADDRESS IN DS(800000010021e098)
    due to incorrect response from the internal slrac()
    function.

    This can make it hard to progress a problem as the trace
    may omit essential information.

    Sub-Note:
      See bug 5918964 for a similar issue on HPUX PA-Risc.

 

解决方案是在10.2.0.2 的基础上打5169475补丁(Patch 5169475: SLRAC() IS UNRELIABLE ON HPUX – FALSE KGHU ERRORS AND KGH MESSAGES),或者升级到Patchset 10.2.0.5+ psu 10.2.0.5.5。

 

 

ORA-00600[kglhdunp2_2]错误一例

一套 AIX上的10.2.0.3 数据库出现了ORA-00600: internal error code, arguments: [kglhdunp2_2]错误,详细日志如下:

 

ORA-00600: internal error code, arguments: [kglhdunp2_2], [0x7000007A061F8A0], [
3], [0x7000007F4AEC160], [0x7000007A061F990], [0x7000007A06639A8], [1000], [18]
Sat Aug 20 05:11:26 2011
Fatal internal error happened while SMON was doing Unpin KGL handles with depend
ency.
Sat Aug 20 05:11:26 2011
Errors in file /u01/app/oracle/product/10.2.0/admin/bdump/prod_smon_4915426.trc:
ORA-00600: internal error code, arguments: [kglhdunp2_2], [0x7000007A061F8A0], [
3], [0x7000007F4AEC160], [0x7000007A061F990], [0x7000007A06639A8], [1000], [18]
SMON: terminating instance due to error 474
Instance terminated by SMON, pid = 4915426
Sat Aug 20 05:26:15 2011

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kglhdunp2_2], [0x7000007A061F8A0], [3],
[0x7000007F4AEC160], [0x7000007A061F990], [0x7000007A06639A8], [1000], [18]
----- Call Stack Trace -----

ksedst <- ksedmp <- ksfdmp <- kgerinv <- kgesinv
<- kgesin <- kglhdunp2 <- kglScanDependencyHa <- ndles4Unpin <- ktmmon
<- ktmSmonMain <- ksbrdp <- opirip <- opidrv <- sou2o
<- opimai_real <- main <- start

Stack trace matches Bug 7254367----
 kgesinv <- kgesin <- kglhdunp2 <- kglScanDependencyHa <- ndles4Unpin
 <- ktmmon <- ndles4Unpin <- ksbrdp <- opirip <- opidrv
 <- sou2o <- opimai_real <- main <- start

 

对比stack call 确认为Bug 7254367 – SMON crashes instance with OERI[kglhdunp2_2] (Doc ID 7254367.8)

 

Bug 7254367  SMON crashes instance with OERI[kglhdunp2_2]

Affects:

    Product (Component)    Oracle Server (Rdbms)
    Range of versions believed to be affected    Versions < 11
    Versions confirmed as being affected    

        10.2.0.4

    Platforms affected    Generic (all / most platforms affected)

Fixed:

    This issue is fixed in    

        10.2.0.3 Patch 30 on Windows Platforms
        10.2.0.4 Patch 14 on Windows Platforms
        10.2.0.5 (Server Patch Set)

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    Instance May Crash
    ORA-600 [kglhdunp2_2]

    (None Specified)

Description

    The SMON process may fail due to an ORA-600 [kglhdunp2_2]
    leading to an instance crash.

ORA-600 [kglhdunp2_2]

PURPOSE:
  This article represents a partially published OERI note.

  It has been published because the ORA-600 error has been
  reported in at least one confirmed bug.

  Therefore, the SUGGESTIONS section of this article may help
  in terms of identifying the cause of the error.

  This specific ORA-600 error may be considered for full publication
  at a later date. If/when fully published, additional information
  will be available here on the nature of this error.

SUGGESTIONS:

  If the Known Issues section below does not help in terms of identifying
  a solution, please submit the trace files and alert.log to Oracle
  Support Services for further analysis.

  Known Issues:

  Bug# 7254367   See Note:7254367.8
      SMON crashes instance with OERI[kglhdunp2_2]
      Fixed: 10.2.0.5, 10.2.0.4.P14, 10.2.0.3.P30

  Bug# 3749490   See Note:3749490.8
      KGL linked lists may become corrupt if process killed at specific time
      Fixed: 9.2.0.6

  Bug# 3724548   See Note:3724548.8
      OERI[kglhdunp2_2] / OERI[1100] under high load
      Fixed: 9.2.0.6, 10.1.0.4, 10.2.0.1:

 

解决方案是升级到10.2.0.5 或者 在原AIX 10.2.0.3.0的基础上安装7254367补丁:

 

We are hitting the bug 7254367

It is fixed in 10.2.0.5

There is a patch available for 10.2.0.3.0 on AIX

The patch 7254367 is available via My Oracle Support

To implement the solution, please execute the following steps:
1) Please download and review the read me for Patch 7254367
2) Please apply Patch. 7254367 in a test environment.
3) Please retest the issue.
4) If the issue is resolved, please migrate the solution as appropriate to other environments.

If you are going to apply the latest patchset in the same release ( 10.2.0.5 patchset or any patch on 10.2.0.3 ) , the other things like SGA recommendations / settings do not change.
So please go ahead and apply the patch.

Applying latest patchset is strongly recommended .

Why latest Patch set?

a)Latest Patch set has resolved many Bugs
b)Interim patches are provided on latest Patch set on Supported releases
c)Interim Patches are “not” rigorously tested but Patch Sets are tested rigorously.
d)We always recommend latest Patch set.
e) Refer following note.

Article-ID: Note 742060.1
Title: Release Schedule of Current Database Patch Sets

To obtain latest patchset from metalink
1.1 Go to metalink
1.2 Click on patches and updates
1.3 Click “Quick Links to the Latest Patchsets, Mini Packs, and Maintenance Packs”
1.4 Choose Oracle Database
1.5 Choose Your operating System
1.6 Choose & click 10.2.0.5

We also recommend you to apply latest PSU ( patchset update ) on top of latest patchset to make the DB stable .

ORA-00600:[1112]内部错误&ROW CACHE ENQUEUE LOCK一例

一套AIX 上的9.2.0.6 2节点RAC系统出现了ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []内部错误伴随有ROW CACHE ENQUEUE LOCK并引发clusterware split-brain resolution,详细的日志及ass.awk输出如下:

 

ALERT LOG
=============
Sun Jun 19 09:06:24 2011
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=24
Sun Jun 19 09:06:29 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:06:29 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:06:30 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:06:30 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:06:31 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:06:31 2011
Errors in file /s01/admin/prod/udump/prod2_ora_1061088.trc:
ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Sun Jun 19 09:08:06 2011
Waiting for clusterware split-brain resolution
Sun Jun 19 09:13:17 2011
ALTER SYSTEM SET event='10511 trace name context forever, level 1' SCOPE=SPFILE SID='*';
Sun Jun 19 09:14:44 2011
Trace dumping is performing id=[cdmp_20110619091444]
Sun Jun 19 09:18:05 2011
Errors in file /s01/admin/prod/bdump/prod2_lmon_422072.trc:
ORA-29740: evicted by member 1, group incarnation 9
Sun Jun 19 09:18:05 2011
LMON: terminating instance due to error 29740
Sun Jun 19 09:18:05 2011
Errors in file /s01/admin/prod/bdump/prod2_lms2_725312.trc:
ORA-29740: evicted by member , group incarnation
Sun Jun 19 09:18:05 2011
Errors in file /s01/admin/prod/bdump/prod2_lms7_1008288.trc:
ORA-29740: evicted by member , group incarnation
Instance terminated by LMON, pid = 422072
Sun Jun 19 09:21:16 2011
Starting ORACLE instance (normal)

TRACE FILE
==============
prod2_ora_1061088.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/oracle/product/9.2
System name: AIX
Node name: tprod2
Release: 3
Version: 5
Machine: 00CE5E834C00
Instance name: prod2

*** 2011-06-19 09:06:28.931
================================
PROCESS DUMP FROM HANG ANALYZER:
================================
Current SQL statement for this session:
SELECT formatid, globalid, branchid FROM SYS.DBA_PENDING_TRANSACTIONS ORDER BY formatid, globalid, branchid
*** 2011-06-19 09:06:28.931
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?

Repeat 2 times
----- End of Call Stack Trace -----
*** 2011-06-19 09:06:29.111
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?
----- End of Call Stack Trace -----
*** 2011-06-19 09:06:29.133
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?
----- End of Call Stack Trace -----
*** 2011-06-19 09:06:29.162
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?
----- End of Call Stack Trace -----
*** 2011-06-19 09:06:29.175
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?
----- End of Call Stack Trace -----
*** 2011-06-19 09:06:29.192
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedms+00dc bl ksedst 102905E64 ?
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl nsdo 1102A8098 ? 5500000055 ?
1102DFD20 ? 1102A8200 ?
FFFFFFFFFFFC4E0 ? 000000000 ?
300000003 ?
opikndf2+06a8 bl _ptrgl
opitsk+05fc bl _ptrgl
opiino+0798 bl opitsk 000000000 ? 000000000 ?
opiodr+08e8 bl _ptrgl
opidrv+032c bl opiodr 3C00000018 ? 4101F62A0 ?
FFFFFFFFFFFF8C0 ? 0A057DC60 ?
sou2o+0028 bl opidrv 3C0C000000 ? 4A0644B50 ?
FFFFFFFFFFFF8C0 ?
main+0138 bl 01FD7B5C
__start+0098 bl main 000000000 ? 000000000 ?
----- End of Call Stack Trace -----
Files currently opened by this process:
===================================================
PROCESS STATE
-------------
Process global information:
process: 700000676099520, call: 0, xact: 0, curses: 0, usrses: 700000673decd98
----------------------------------------
SO: 700000676099520, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=224, calls cur/top: 0/7000006c2ca3df8, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 50
last post received-location: kcbzww
last process to post me: 700000676119f00 7 0
last post sent: 0 0 21
last post sent-location: ksqrcl
last process posted by me: 700000676428258 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 700000676cc19b0
O/S info: user: oracle, term: UNKNOWN, ospid: 1061088
OSD pid info: Unix process pid: 1061088, image: oracle@tprod2 (TNS V1-V3)
----------------------------------------

END OF PROCESS STATE
******************** Cursor Dump ************************
Current cursor: 2, pgadep: 0
pgactx: 7000006f8bc2d40 ctxcbk: 0 ctxqbc: 0 ctxrws: 700000716aecfd0
Explain plan:
Plan Table
--------
-------------------------------------------------------------------------------------------------------------------------
| Operation | Name | Rows | Bytes | Cost | TQ |IN-OUT| PQ Distrib |Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------------
| SELECT STATEMENT | | 0 | 0 | 0 | | | | | |
| SORT ORDER BY | | 0 | 0 | 0 | | | | | |
| VIEW | | 0 | 0 | 0 | | | | | |
| SORT UNIQUE | | 0 | 0 | 0 | | | | | |
| UNION-ALL | | 0 | 0 | 0 | | | | | |
| MINUS | | 0 | 0 | 0 | | | | | |
| SORT UNIQUE | | 0 | 0 | 0 | | | | | |
| VIEW | | 0 | 0 | 0 | | | | | |
| FIXED TABLE FULL | X$K2GTE2 | 0 | 0 | 0 | | | | | |
| SORT UNIQUE | | 0 | 0 | 0 | | | | | |
| NESTED LOOPS | | 0 | 0 | 0 | | | | | |
*** 2011-06-19 09:06:29.376
ksedmp: internal or fatal error

ORA-00600: internal error code, arguments: [1112], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT formatid, globalid, branchid FROM SYS.DBA_PENDING_TRANSACTIONS ORDER BY formatid, globalid, branchid
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp+0148 bl ksedst 102905C84 ?
ksfdmp+0018 bl 01FD8148
kgeriv+0118 bl _ptrgl
kgesiv+0080 bl kgeriv 07FFFFFFC ? 800000000000000 ?
1000000000000000 ?
1800000000000000 ?
028828228 ?
ksesic0+005c bl kgesiv 7000006BE3BB328 ? 000010550 ?
7000006BE3AADD8 ? 10297D7E8 ?
FFFFFFFFFFF3A20 ?
kssadf_stage+0084 bl ksesic0 45800000458 ? 11007A2F8 ?
000000000 ? 000000000 ?
000000000 ? 70000000001DB80 ?
000000000 ? 700000703BBF040 ?
kqreqa+008c bl kssadf_stage 7000006BE3AADD8 ? 10297D7E8 ?
068A31055 ? 000006BB0 ?
000000001 ?
kqrpre1+06e4 bl kqreqa 000000001 ?
kqrpre+001c bl kqrpre1 BAC3F8E66 ? 000000001 ?
FFFFFFFFFFF4008 ? 1101F9A14 ?
1101F9A14 ? FFFFFFFFFFF4000 ?
07FFFFFFF ? 000000000 ?
kkdlobni+0058 bl kqrpre 100F29A04 ?
4222442400000000 ?
14DFD4B95 ?
166CCD19101F62A0 ?
000000002 ? 000000000 ?
FFFFFFFFFFF40C0 ?
xplObjnToName+0150 bl kkdlobni 9A0000009A ?
FFFFFFFFFFF4444 ? 000000000 ?
000000000 ?
xplPatchName+00a4 bl xplObjnToName 9AFFFF46F0 ?
FFFFFFFFFFF4444 ?
xplMakeRow+0190 bl xplPatchName 000000000 ? 000000000 ?
000000000 ?
xplFetchRow+00b4 bl _ptrgl
xplDumpRws+0604 bl xplFetchRow 1029CFB48 ? FFFFFFFFFFF4770 ?
1101F9A14 ?
curdmp+0164 bl xplDumpRws 102AE2A20 ?
ksedms+012c bl curdmp
ksdxfdmp+0200 bl _ptrgl
ksdxcb+02d8 bl _ptrgl
sspuser+0084 bl 01FD7CA8
000044C0 ? 00000000
snttread+0028 bl 00009CFC
nttrd+0118 bl snttread FFFFFFFFFFFBBB3 ?
FFFFFFFFFFFBBA8 ?
FFFFFFFFFFFB2C0 ?
nsprecv+0984 bl _ptrgl
nsrdr+01d0 bl nsprecv 000000000 ? 110299C00 ?
000000000 ?
nsdo+1818 bl nsrdr 000000000 ? 000000000 ?
nioqrc+05c4 bl

Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate. (The holder may have released the resource before we
dumped the state object tree of the blocking process).
o Lines with 'Enqueue conversion' below can be ignored *unless*
other sessions are waiting on that resource too. For more, see
http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html#enqcnv)

Resource Holder State
Latch 70000000000a4b8 115: Blocker
Latch 70000000000a4b8 210: Blocker
Latch 70000000000a4b8 270: Blocker
Latch 70000000000a4b8 406: Blocker
Latch 70000000000a4b8 614: Blocker
Latch 70000000000a4b8 626: Blocker
Latch 70000000000a4b8 882: Blocker
Latch 70000000000a4b8 1489: Blocker
Latch 70000000000a4b8 1617: Blocker
Latch 70000000000a4b8 1878: Blocker
Latch 70000000000a4b8 1916: Blocker
Latch 70000000000a4b8 1947: Blocker
Latch 70000000000a4b8 1963: Blocker
Latch 70000000000a4b8 2121: 2121: is waiting for Latch 700000675dae330
Latch 70000000000a4b8 2245: Blocker
Latch 70000000000a4b8 2351: Blocker
Latch 70000000000a4b8 2566: Blocker
Latch 70000000000a4b8 2585: Blocker
Latch 70000000000a4b8 2643: Blocker
Latch 70000000000a4b8 2773: 2773: is waiting for Latch 700000675daf3a8
Latch 70000000000a4b8 2791: Blocker
Latch 70000000000a4b8 2795: Blocker
Latch 70000000000a4b8 2966: Blocker
Latch 70000000000a4b8 2969: Blocker
Latch 700000675dadf50 ??? Blocker
Latch 700000675dadc68 ??? Blocker
Latch 700000675dadb70 ??? Blocker
Latch 7000006be3a6530 ??? Blocker
Latch 700000675dae808 ??? Blocker
Latch 700000675db0040 ??? Blocker
Latch 7000006d1d71138 ??? Blocker
Latch 700000675dad3b0 ??? Blocker
Latch 700000675dae330 ??? Blocker
Latch 7000006b2d4fd28 2211: Blocker
Latch 7000006b2d4fd28 2220: Blocker
Latch 7000006b2e5df68 2660: Blocker
Latch 7000006b2e5e3e8 2752: Blocker
Latch 7000006b2e5e3e8 2876: Blocker
Latch 7000006b2d06b28 ??? Blocker
Latch 7000006b2f9f928 ??? Blocker
Latch 7000006b2d4db68 ??? Blocker
Latch 7000006b2e5e868 ??? Blocker
Latch 7000006b2d4e6a8 ??? Blocker
Latch 7000006b2d4eb28 2434: Blocker
Latch 7000006b2d4eb28 2437: 2437: is waiting for 2434: 2437:
Latch 7000006b2d4f428 2925: Blocker
Latch 7000006b2d4f428 2948: Blocker
Latch 7000006b2d07428 ??? Blocker
Latch 7000006b2d4e588 ??? Blocker
Latch 7000006b2e5ece8 ??? Blocker
Latch 7000006b2d4efa8 ??? Blocker
Latch 7000006b2d07c08 ??? Blocker
Latch 7000006b2f9e968 ??? Blocker
Latch 700000675daf3a8 ??? Blocker
Latch 7000006b2a49f68 3198: Blocker
Latch 70000000001a968 ??? Blocker

Some of the above latches may be child latches. Please check the section
named 'Child Latch Report' below for further notes.

Blockers According to Tracefile Wait Info:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. This may not work for 64bit platforms. See bug 2902997 for details.
2. If the blocking process is shown as 0 then that session may no longer be
present.
3. If resources are held across code layers then sometimes the tracefile wait
info will not recognise the problem.

No blockers seen.

Object Names
~~~~~~~~~~~~
Latch 70000000000a4b8 enqueues
Latch 700000675dadf50 Child enqueue hash chains
Latch 700000675dadc68 Child enqueue hash chains
Latch 700000675dadb70 Child enqueue hash chains
Latch 7000006be3a6530 Child row cache objects
Latch 700000675dae808 Child enqueue hash chains
Latch 700000675db0040 Child enqueue hash chains
Latch 7000006d1d71138 Child library cache pin
Latch 700000675dad3b0 Child enqueue hash chains
Latch 700000675dae330 Child enqueue hash chains
Latch 7000006b2d4fd28 Child cache buffers chains
Latch 7000006b2e5df68 Child cache buffers chains
Latch 7000006b2e5e3e8 Child cache buffers chains
Latch 7000006b2d06b28 Child cache buffers chains
Latch 7000006b2f9f928 Child cache buffers chains
Latch 7000006b2d4db68 Child cache buffers chains
Latch 7000006b2e5e868 Child cache buffers chains
Latch 7000006b2d4e6a8 Child cache buffers chains
Latch 7000006b2d4eb28 Child cache buffers chains
Latch 7000006b2d4f428 Child cache buffers chains
Latch 7000006b2d07428 Child cache buffers chains
Latch 7000006b2d4e588 Child cache buffers chains
Latch 7000006b2e5ece8 Child cache buffers chains
Latch 7000006b2d4efa8 Child cache buffers chains
Latch 7000006b2d07c08 Child cache buffers chains
Latch 7000006b2f9e968 Child cache buffers chains
Latch 700000675daf3a8 Child enqueue hash chains
Latch 7000006b2a49f68 Child cache buffers chains
Latch 70000000001a968 Parent transaction allocation

Child Latch Report
~~~~~~~~~~~~~~~~~~
Some processes are being blocked waiting for child latches.

At the moment this script does not detect the blocker because the
child latch address differs to the parent latch address. To manually
detect the blocker please take the following steps :
1. Determine the TYPE of latch (Eg library cache) that is involved.
2. Search the source trace file for a target of :
holding.*Parent.*library cache
(Assuming we have a child library cache and have vi-like regular expressions)

If this shows nothing then the blocker may have released the resource
before we got to dump the state object tree of the blocked process.

A list of processes that hold parent latches is given below :

No processes found.

Summary of Wait Events Seen (count>10)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
No wait events seen more than 10 times

 

 

ORA-00600:[1112]内部错误的相关知识如下:

 

ERROR:
ORA-600 [1112] [a] [b] [c] [d] [e]

VERSIONS:
versions 7.3 to 9.2

DESCRIPTION:

ORA-600 [1112] is getting raised while trying to add a
row cache enqueue to a transaction state object during
lookup of the default tablespace number during table
creation.

FUNCTIONALITY:
STATE OBJECT MANAGEMENT

IMPACT:
PROCESS FAILURE
NON CORRUPTIVE – No underlying data corruption.

Bug 2489130 – OERI:1112 can occur while dumping PROCESSSTATE informatio (Doc ID 2489130.8)
Bug 4126973: ORA-600[504] AND ORA-600[1112] OCCURED WHEN GETTING “ERRORSTACK”
Base Bug 2489130
Bug 3954753: ORA-600 [1112] AND SESSION CRASH

 

经过诊断发现该ORA-00600:[1112]内部错误是由Bug 2489130所引起的,而触发该Bug的直接原因是WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!:

 

The cause for the ORA-00600 [1112] appears due to Bug 2489130
This error can occur on dumping of process state which is what occurred here.
The primary issue is the WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!
This then triggers a system state and process state to be dumped due to nature of the problem.
The ORA-00600 [1112] gets dumped out when process state is done.

Stack for trace very similar to Bug 2489130 and this is only known bug on 9.2 like this with a fix.

A fix for bug 2489130 is included in the 9.2.0.7 patchset.
Recommend applying 9.2.0.8 patchset to have this and other bug fixes.
This would only prevent the ORA-00600 [1112] from occurring on state dumps.

 

解决方案是 优化SQL性能以避免出现WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!, 或者至少升级数据库版本到9.2.0.8 这个推荐的patchset。

ORA-00600

ORA-00600 Internal Error 是我们在学习使用Oracle的过程中,必然会经历的一个站点。

很多同学一遇到ORA-00600 错误信息,就认为自己碰到了Oracle Database软件的Bug,实际上这一观点是不准确的。

ORA-00600可能由多种原因造成,包括软件漏洞、Bug、程序运行异常、内存讹误和数据讹误造成。

举例来说在数据异常恢复过程中常遇到的ORA-00600[2662](Block SCN is ahead of Current SCN) 和ORA-00600[4000](回滚段rollback数据块时发现rollback segment存在讹误)错误 均是数据讹误引起的而非bug 。

我们在分析ORA-00600 Internal Error, 定位具体故障的时候,从600 trace中能够找到的最为有用的信息就是600所附带的Argument信息:

 

实际600 Internal Error 的Argument 可以分成 2种:

 

a.   第一位是数字类型的Argument , 例如之前说的2662 和 4000 , 不同的数字代表不同的错误含义。 数字类型的argument 所代表的内部错误相对更为普遍、常见。  实际这些数字Argument 也是来源于 不同的Oracle Kernel Function内核函数,如kddummy_blkchk、kclchkinteg_2 等; 但是因为这些错误较为常见, 一方面为了照顾用户的使用体验( 用户对RDBMS软件的内核函数是不感兴趣的,当然可能我们感兴趣), 另一方面这些函数涉及到很多Oracle的内部原理,为了不让这些内核函数暴露在外, 所以Oracle开发部门对这些常见的Internal Error状态进行了编码,转换成数字代码的形式, 实际上这些数字代码形式的Argument 都有其与OERR类似的注释,这些注释没有被包含在oraus.msg中,但是在该msg文件中说明了这些注释仅仅是不公开, Oracle公司的员工是可以看到的:

 

 Programmer's Comments
 ---------------------
 If you wish to add comments regarding a message that should not be seen by
 the public, use "// *Comment: " as follows:

   e.g.
       32769, 00000, "incompatible SQL*Net version"
        *Cause: An attempt was made to use an older version of SQL*Net that
         is incompatible with current version of ORACLE.

 

数字编码Argument 的Internal error 如果不只打印出一位的Argument的话,那么后续几位的Argument 一般都是有其实际意义的,如ORA-00600[2662]的后续Argument 的含义为:

ARGUMENTS:
Arg [a] Current SCN WRAP
Arg [b] Current SCN BASE
Arg [c] dependent SCN WRAP
Arg [d] dependent SCN BASE
Arg [e] Where present this is the DBA where the dependent SCN came from.

这就便于Oracle Support 来诊断和解决这些Internal Error。对于数字类型的Argument ,Metalink上一般会公开其后续Argument的含义,且因为这些问题较为常见,所以一般都已经提供专门的Resolved Solution 或者 Workaround 方法来提供。

总而言之数字编码的ORA-00600 argument 一般我们可以通过 在Metalink 上搜索 ORA-00600 + 第一位 Argument ,或者使用<ORA-600/ORA-7445 Error Look-up Tool [ID 153788.1]>诊断工具页面来找到相关的有用Note。

 

b. 函数名形式的Argument 。 这类Argument 代表的Internal Error 相对于前一种要出现的频率低一些, Oracle开发部门尚来没有在相关版本中将这些Internal Error 编码。 这样我们就可以看到出现问题的完整Kernel Function Name , 可以使用ORA-600 + 第一位 Argument 在Metalink 上搜索来找到一些相关的Note , 但是函数名形式的Argument  往往不能精确定位到问题 ,因为 不同的错误原因 可能在同一个内核函数中引发不同的异常 , 而这个时候我们只能看到 函数名的Argument 信息。 更精确定位的 方式是找出 在调用这个函数时的 详细stack call , 我们来看一个ORA-600[KCBZ_CHECK_OBJD_TYP_1]的stack call:

 

ksedst()+40
ksedmp()+168
ksfdmp()+32
kgerinv()+152
kgeasnmierr()+88
kcbassertbd3()+204
kcbz_check_objd_typ
kcbzib()+
kcbgtcr()+
ktecgsc()+168
ktecgetsh()+196
ktecgshx()+40
kteinicnt1()+648
ktssdrbm_segment()+
ktssdro_segment()+3
ktssdt_segs()+1128
ktmmon()+3500
ktmSmonMain()+64
ksbrdp()+1276
opirip()+
opidrv()+1088
sou2o()+120
opimai_real()+496
main()+240
$START$()+

 

注意以上stack call中 只有 ktmSmonMain -> kcbassertbd3 这部分是有意义的, 开始部分的main()-> ksbrdp() 是很普通的入口函数 , 而从kgeasnmierr (Kernel generic Error ) 开始的代码是Oracle 报错层使用的函数 , 都是对定位问题没有帮助的。 将这部分有用的stack call 填入Metalink <ORA-600/ORA-7445 Error Look-up Tool [ID 153788.1]> 600问题诊断页面的 stack call 栏 会以较严格的筛选条件找出问题相关的Note:

 

 

针对ORA-00600 的解决 一般 Oracle Support 会给出  补丁修复 和 Workaround 绕过该问题的 2 类解决方案 , 当然也还是存在Oracle 研发部门无法在他们的环境中重现你所遇到的ORA-00600的可能性,这意味着部分600错误可能是官方无解的,也可能是Oracle Support 已经掌握某种Workaround 的方法, 但是没有在现有的Note 文档中提交的情况 , 当然这都是少数现象。

 

如果实在找不到可用的解决方案, 或者您的产品数据库有极高的服务等级要求,那么提交Service Request (SR) 有些老人还是习惯于称其为Tar的服务请求 , 可能是一种终极手段。 但是我不得不说一句 并非所有的问题 都是有解的 , 您使用的TV 电视机的制造商可以解决 所有其在使用环节中遇到的问题吗? 理论上是可以的 , 但是当解决一个问题的成本非常高时 , 制造商可能更情愿给你换一台电视 ,但是您的产品数据库 可以轻易更换吗?  这是一个值得深思的问题 , 也是RDBMS市场的 一条悖论。

 

 

 

来读读 由Maclean Liu 所编写的ORA-00600 Oracle Internal Error 的相关文章:

Oracle内部错误:ORA-00600:[4097]一例
Oracle内部错误:ORA-00600[15801], [1]一例
Oracle内部错误:ORA-00600:[6033]一例

Oracle内部错误:ORA-00600[OSDEP_INTERNAL]一例

Oracle内部错误:ORA-00600[kgskdecrstat1]一例

Oracle内部错误:ORA-00600[kfioTranslateIO03]一例

Oracle内部错误ORA-00600:[pfri.c: pfri8: plio mismatch ]一例

Oracle内部错误:ORA-00600[2608]一例

Oracle内部错误:ORA-00600[13013][5001]故障诊断一例

Oracle内部错误:ORA-00600[17175]一例

Oracle内部错误ORA-00600:[2667]一例

Oracle RAC内部错误:ORA-00600[keltnfy-ldmInit]一例

ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [729], [10992], [SPACE LEAK] Example

手工模拟Oracle数据块逻辑讹误引发,ORA-00600:[13013] [5001]一例

ORA-00600
[4400][48]错误一例

ORA-00600 [KCBZPB_1], [59033077], [4], [1], [] example

ORA-00600:[qctcte1]内部错误一例

ORA-00600
: internal error code, arguments: [15160]

ORA-00600
: internal error code, arguments: [kdsgrp1] example

Oracle内部错误:ORA-00600[25012]一例

ora-00600
:[17281], [1001]一例

ORA-00600
:[kclchkinteg_2]及[kjmsm_epc]内部错误一例

Oracle内部错误:ORA-00600[kccchb_3]一例

ORA-00600
: [qksrcBuildRwo]内部错误一例

ORA-00600
:[32695], [hash aggregation can’t be done]错误一例

ORA-00600
[6711]错误一例

ora-00600
[kkocxj:pjpCtx]内部错误一例

ORA-00600
[kcbz_check_objd_typ_3]错误一例

ORA-00600
:[15570]内部错误一例

ORA-00600
[3756]内部错误一例

ORA-00600
[kddummy_blkchk]错误一例

How to trigger ORA-00600,ORA-7445 by manual

ora-600 [17182]错误一例

Database Force open example

ora-600[qesmmCValStat4]一例

ORA-600 [kddummy_blkchk] [18038] 一例

Oracle内部错误:ORA-00600[kgskdecrstat1]一例

famous summary stack trace from Oracle Version 8.1.7.4.0 Bug Note

Oracle内部错误ORA-600:[1112]

一次Exadata上的ORA-600[kjbmprlst:shadow]故障分析
ORA-600 quick reference guide
ORA-00600[kglhdunp2_2]错误一例
ORA-00600:[kclchkinteg_2]及[kjmsm_epc]内部错误一例
ORA-00600: [7005], [192]内部错误一例
ORA-600 internal error[kqrfrpo]一例
ORA-600[4194]错误一例
How to trigger ORA-00600,ORA-7445 by manual
ORA-00600[kjpsod1]&ORA-44203错误一例


一次Exadata上的ORA-600[kjbmprlst:shadow]故障分析

 

这是一套Exadata 上的11.2.0.1   四节点RAC 系统,从今年初开始频繁地因为LMS后台进程出现内部错误ORA-600[kjbmprlst:shadow]而导致实例意外终止, 虽然是4节点RAC 保证了其高可用性, 但是仍因为 实例经常意外终止 导致应用程序交易失败。

实际我在7月份已经分析过该问题了, 详见<Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例>一文 , kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞。 实际上当时我已经给出了禁用DRM 可能可以避免该ORA-600[kjbmprlst:shadow] 的看法 ,但是 翻阅MOS 上所有现有的ORA-600[kjbmprlst:shadow] case均没有提及disable DRM可以是一种workaround的途径, 提交SR 也没有得到Oracle GCS关于该看法的 肯定回答, 此外还考虑到 核心的产品环境 使用11.2.0.1 这个bug 较多的版本确实不合适, 升级到 11.2.0.2 已修复该Bug 10121589 的bundle Patch 也算是一种不错的方案 ,所以也就没有深究下去。

之后就一直在做升级Exadata到11.2.0.2 的一些准备工作, 但是用户最后考虑到升级Exadata的步骤过于复杂, 很多步骤是不可回退的, 而且在国内升级的案例目前也不多, 推翻了 升级的方案, 于是有了下面的这一段故障分析

 

故障特征

 

如上面所说的这是一套Exadata上的4节点 11.2.0.1 RAC , 应用程序是 Oracle 自己的Billing and Revenue Management (BRM) 。从今年初开始多次出现ORA-600[kjbmprlst:shadow]内部错误, 引发该内部错误的是RAC 关键后台进程 LMS (  global cache server process) ,结果是导致实例意外终止。

 

该ORA-600[kjbmprlst:shadow]在10月份频发的发生(几乎一个礼拜2次),故障发生还存在以下2个特征:

  1. 主要集中在凌晨03:00 以后的20分钟内发生
  2. 1、2、4节点都陆续发生过故障 , 唯有 3号节点从未发生过 ( 这个后面会介绍)

 

初步分析

 

通过应用方拿到了某次故障发生当天每一个小时的AWR报告, 这次故障发生在凌晨 03:05 的 1号节点上, 于是开始分析当时的 1号实例的性能负载。

分析AWR 报告的一种方法就是 通过对比多个时间段的报告 来了解性能变化 ,当然前提是知道问题发生的时段, 这里我们来对比2个时段的信息,这三个时段分别是:

1. 问题发生前的几分钟, 因为1号实例上 3点以后的AWR 报告缺失,所以这里使用的是 02:00 ~ 03:00 的AWR
2. 问题发生前的1小时 , 01:00~02:00 的AWR

 

首先观测DB TIME:

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14225 12-Oct-11 02:00:05 284 152.2
End Snap: 14226 12-Oct-11 02:59:59 288 150.1
Elapsed: 59.91 (mins)
DB Time: 809.98 (mins)

 

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14224 12-Oct-11 01:00:10 284 152.3
End Snap: 14225 12-Oct-11 02:00:05 284 152.2
Elapsed: 59.91 (mins)
DB Time: 257.62 (mins)

 

AWR中的DB TIME可以一定程度上反映数据库实例的繁忙程度。这里 01:00 到 02:00 时间段的AAS= 257/60 = 4.28,而问题发生的时间段的 AAS = 809 / 60 = 13 , 说明实例的工作负载明显上升了。

 

再来看主要等待事件:

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: TX – row lock contention 10,072 16,079 1596 33.08 Application
log file sync 166,546 10,579 64 21.77 Commit
gc buffer busy acquire 55,828 8,017 144 16.50 Cluster
gc buffer busy release 21,203 5,904 278 12.15 Cluster
gc current block busy 29,196 3,362 115 6.92 Cluster

 

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
log file sync 258,415 4,875 19 31.54 Commit
gc buffer busy acquire 44,471 3,349 75 21.66 Cluster
DB CPU 1,880 12.16
gc current block busy 42,722 1,748 41 11.31 Cluster
gc buffer busy release 10,730 1,127 105 7.29 Cluster

 

问题发生时段的主要等待事件是 enq: TX – row lock contention 这说明存在大量的行锁争用,此外gc buffer busy acquire/release 和 gc current block busy 全局缓存争用等待时长也明显较 之前的 01:00- 02:00 多了数倍,这说明问题发生时段 节点间对全局缓存资源的争用进一步被加剧了。01:00 – 02:00 时段的 enq: TX – row lock contention 不是前5的等待事件,但实际每秒处理的事务数更多为71.8个(问题发生时段为每秒46.3个事务),这说明造成大量行所争用的的原因不是事务变多, 而是应用程序特性造成的。

 

引起频繁 enq: TX – row lock contention等待事件 和 全局缓存资源争用 的最主要SQL语句均是1usg75g3cx4n6:

 

Cluster Wait Time (s) Executions %Total Elapsed Time(s) %Clu %CPU %IO SQL Id SQL Module SQL Text
4,047.30 20,470 22.75 17,790.11 21.01 0.04 0.00 1usg75g3cx4n6 update

 

SQL “1usg75g3cx4n6” 是一条Update更新语句,在此不列出。

 

问题发生时段该语句 1 号实例上运行了20470次,平均每次运行耗时 0.87s , cluster wait time 占该SQL总执行时间的22.75% ,而在01:00 – 02:00 期间该语句共运行了32845 次,平均每次运行耗时 0.02s , cluster wait time 占该SQL总执行时间的 67.44 %。

以上信息说明1号实例在问题发生时段 由SQL语句1usg75g3cx4n6 因为enq: TX – row lock contention(主要因素) 和 gc 全局缓存争用 (次要因素)相互作用引起了巨大的负载。引起enq: TX – row lock contention行锁争用的主要原因是 并发的DML更新相关的行数据。

 

深入分析

除了在问题发生时段数据库负载加剧,出现大量的行锁和全局缓存争用外,问题时段还出现了大规模的动态资源管理操作DRM( Dynamic Resource Management)。

引入DRM的目的是尽可能减少RAC 中CPU和 消息通信的消耗,DRM的资源管理决定依据相关对象的全局缓存操作历史, 该特性会定期自动找出集群中频繁访问某个缓存对象的节点实例,并将该缓存对象re-master到该实例上以优化RAC性能。  DRM 由 LMON、LMD 和 LMS 进程协同完成, LMS负责传输实际的 gcs resouce和 lock 锁信息 , 11g中受到_gc_policy_time隐藏参数的影响,默认最短每10分钟可能被触发。当设置_gc_policy_time=0 ,则object affinity 和 11g 中引入的read mostly policies新特性均将被禁用。

 

AWR报告中1号实例问题时段的 DRM 统计:

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 2 1.00
remastered objects 2 1.00
replayed locks received 215,517 107,758.50
replayed locks sent 339,349 169,674.50
resources cleaned 0 0.00
remaster time (s) 7.4 3.70
quiesce time (s) 4.3 2.13
freeze time (s) 0.1 0.04
cleanup time (s) 0.4 0.21
replay time (s) 1.0 0.48
fixwrite time (s) 0.9 0.47
sync time (s) 0.8 0.38
affinity objects 525 525

 

以下为01:00- 02:00 时段的DRM 统计:

Name Total per Remaster Op Begin Snap End Snap
remaster ops 5 1.00
remastered objects 8 1.60
replayed locks received 110,902 22,180.40
replayed locks sent 68,890 13,778.00
resources cleaned 0 0.00
remaster time (s) 12.9 2.57
quiesce time (s) 1.4 0.28
freeze time (s) 0.2 0.03
cleanup time (s) 0.9 0.17
replay time (s) 0.5 0.10
fixwrite time (s) 1.9 0.38
sync time (s) 8.0 1.60
affinity objects 526 525

 

可以看到在问题发生时段 DRM 发送(sent)了更多的replayed locks锁资源 ,这里的replay是指将本地锁信息传送到新的master 实例(REPLAY – Transfer of the local lock information to the new master.) , 而负责发送这些replayed locks锁资源的 正是LMS 进程。

如上文所述1、2、4节点均出现过该ORA-600[kjbmprlst:shadow]故障,但是唯有3号节点从没有出现过该问题, 打开3号节点在问题发生时段(03:00-04:00) 的AWR报告后就有重大的发现:

 

Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
3,590.52 0 17.30 34.71 63.63 59v4zh1ac3v2a DBMS_SCHEDULER DECLARE job BINARY_INTEGER := …
3,589.83 0 17.29 51.69 37.98 b6usrg82hwsa3 DBMS_SCHEDULER call dbms_stats.gather_databas…
2,173.79 1 2,173.79 10.47 62.16 28.75 b20w3p880j2gs DBMS_SCHEDULER /* SQL Analyze(1) */ select /*…

 

可以发现3号节点当时正在运行”call dbms_stats.gather_database_stats_job_proc ( )”  11g中自动收集统计信息的作业,该作业运行了较长时间(一个小时内尚未完成), 我们知道10g以后引入了默认的自动收集统计信息的作业,在11g中得到了加强, 这一作业会自动收集自上一次收集以来更新超过总数据量10%的 对象的统计信息, 因为该核心产品数据库是大型的OLTP应用,在03:00之前已经发生了大量的数据更改,所以这导致gather_database_stats_job_proc 要收集许多大对象上的统计信息,而这会引发该3号实例频繁Request之前在 1、2号实例中被更新的全局缓存资源,频繁地访问影响了DRM的decision, 导致大量的object 被 re-master 到 3号节点上。

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 14 1.00
remastered objects 25 1.79
replayed locks received 1,088,009 77,714.93
replayed locks sent 202,112 14,436.57
resources cleaned 0 0.00
remaster time (s) 30.8 2.20
quiesce time (s) 8.5 0.60
freeze time (s) 0.4 0.03
cleanup time (s) 3.3 0.24
replay time (s) 6.4 0.46
fixwrite time (s) 7.7 0.55
sync time (s) 4.5 0.32
affinity objects 4,294,967,289 4,294,967,289

 

可以看到3号实例接收了大量的replayed lock (replayed locks received) ,而发送的lock则很少。而实际发送(sent) 给 3号实例这些锁信息 资源的 正是 1号实例上的LMS进程。

翻阅alert.log 和视图 我发现每一次的gather_database_stats_job_proc均在凌晨3点的 3号实例发生, 因此每一次故障发生时3号实例只要接收这些replayed locks   资源即可,其LMS进程 仅发送少量的 replayed locks , 所以造成了上面所说的3号实例从不发生该ORA-600[kjbmprlst:shadow]故障 , 这一古怪的现象。

 

我们再来阅读上发生ORA-600[kjbmprlst:shadow]故障的LMS进程的trace文件中的部分内容:

*** 2011-10-12 02:25:22.770
DRM(22812) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(1) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:29.717
 lms 1 finished fixing gcs write protocol
DRM(22813) win(2) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:30.312
DRM(22813) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(4) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(5) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:31.280
 lms 1 finished fixing gcs write protocol
DRM(22813) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(7) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:32.269
DRM(22813) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(1) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(2) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:33.479
 lms 1 finished fixing gcs write protocol
DRM(22814) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(4) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:34.333
 lms 1 finished fixing gcs write protocol
DRM(22814) win(5) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:35.315
DRM(22814) win(7) lms 1 finished replaying gcs resources
Incident 1035732 created, dump file:
/maclean1_lms1_20075_i1035732.trc
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []

2011-10-12 03:05:36.811503 : kjbmprlst: internal error on replaying 0x618cd.371 from 3
2011-10-12 03:05:36.811541 : kjbmbassert [0x618cd.371]
2011-10-12 03:05:36.811559 : kjbmsassert(0x618cd.371)(2)
2011-10-12 03:05:36.811636 : kjbmsassert(0x618cd.371)(3)
2011-10-12 03:05:36.811667 : kjbmsassert(0x618cd.371)(4)
kjmpbmsg fatal error on 39
MSG [39:KJX_B_REPLAY] .Yq inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
    flow ctrl: ver=60 flag=169 len=136 tkts=0 seq=0 wrp=0
               sndr=3 dest=1 rcvr=2
   FUSION MSG 0x2b178baa5120,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x618cd.371, 574166.0] c[0x59f904e08,-17597] [0x1,xc69]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99097 drmbkt# 99097
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 9 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99097, drmb 99097, apifrz 1
   FUSION MSG DUMP END
MSG [65521:PBATCH] inc=64 len=8376 sender=3 seq=279807301 fg=q
    flow ctrl: ver=1 flag=37 len=8376 tkts=0 seq=279807301 wrp=1
               sndr=3 dest=1 rcvr=2
MSG [39:KJX_B_REPLAY] .Y'z inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
   FUSION MSG 0x2b178baa31c8,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x527f5.27a, 574166.0] c[0xa2f7eddc0,2158] [0x1,xf33]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99582 drmbkt# 99582
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 33 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99582, drmb 99582, apifrz 1
   FUSION MSG DUMP END

.........................

    ------------process 0xcf22a6d40--------------------
    proc version      : 0
    Local node        : 0
    pid               : 20075
    lkp_node          : 0
    svr_mode          : 0
    proc state        : KJP_NORMAL
    Last drm hb acked : 114687
    flags             : x80
    current lock op    : (nil)
    Total accesses    : 11763805
    Imm.  accesses    : 11763774
    Locks on ASTQ     : 0
    Locks Pending AST : 0
    Granted locks     : 0
    AST_Q: 
    PENDING_Q: 
    GRANTED_Q: 
    KJM HIST LMS1: 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 1:0 
      13:65521:0 20:39:0 20:39:0 20:39:0 
    START DEFER MSG QUEUE 0 ON LMS1 flg x87:

MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262147
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca303f2f0 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=109225856
     fg=s stat=KJUSERSTAT_ATTACHFAILED spnum=11 flg=x0
MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262146
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca405e758 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=6400108
     fg=s stat= spnum=11 flg=x20
    END DEFER MSG QUEUE 0 DUMP ON LMS1
     SEQUENCES: 
    [sidx 0]  0:0.0  1:711171046.0  2:279807300.1  3:43212975.0
     DEFER MSG QUEUE 1 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 1]  0:0.0  1:711171047.0  2:279807292.1  3:43212974.0
     DEFER MSG QUEUE 2 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 2]  0:0.0  1:711171047.0  2:279807300.1  3:43212975.0


可以看到在实际出发ORA-600[kjbmprlst:shadow]之前,LMS后台进程确实在进行频繁地DRM操作。

我们不难看出出发ORA-600[kjbmprlst:shadow]的原因是:数据库较高的性能负载, 1号实例上SQL 语句引发的大量行锁争用 和 3号节点上的 gather_database_stats_job_proc 自动统计信息作业 同时存在 , 导致1号实例上的LMS进程频发地执行DRM操作 , 最终在较高的工作负载情况下触发了管理kjbm shadow锁的代码bug 。

经过分析发现2、4号实例发生该ORA-600[kjbmprlst:shadow]故障与1号节点的情形相似,整个RAC中唯有3号节点例外,因为总是由它来执行gather_database_stats_job_proc, 导致它总是成为DRM的受益者 , 而其他3个节点成为受害者。

但是注意 可能你要问 那不就是简单的由自动统计作业引发的故障吗? 用得着分析这么多吗, 直接关掉该作业 或者换个时段就可以了。

回答:实际上如果仅仅是gather_database_stats_job_proc的话是不足以引发该故障的, 一方面由于节点之间的所跑得应用程序都是一样的,也没有采取分区或者分schema , 这导致节点之间的全局缓存资源争用加剧 , 另一方面 因为每个Buffer Cache都较大, 均存放了大量的全局缓存资源, 且数据的更新十分频繁 , 这导致gather_database_stats_job_proc作业开始执行后 大量的全局资源涌向3号实例 ,同时也触发了DRM的决定, 以至于1、2、4号节点上的LMS 在短期内有极大的工作压力 , 最终导致了 bug 被触发。仅仅禁用该gather_database_stats_job_proc 自动信息搜集作业 ,或者修改该作业的运行时间 都不足以彻底规避该ORA-600[kjbmprlst:shadow]故障, 因为只要是LMS后台进程处于较高的工作负载情况下都仍有可能触发bug。

 

虽然以上分析了一大堆但是我对解决该ORA-600[kjbmprlst:shadow]故障的看法没有丝毫改变, 而且更坚定了。Disable DRM 禁用DRM特性可以从根本上改变LMS进程的行为方式,彻底杜绝该ORA-600[kjbmprlst:shadow]故障发生的可能性。

但我仍需要一些佐证!

 

资料信息

 

翻阅MOS 一个Note 映入我的眼帘,<Bug 12834027 – ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking [ID 12834027.8]> 这个Note 我在上一次分析这个问题的时候已经阅读过了, 当时该Note的描述该问题会在11.2.0.1 上发生,不提供任何workaround 方法 ,唯一的解决途径是打上11.2.0.2 + Bp 的补丁。

但是我发现这个Note 在最近的 21-OCT-2011 被修改了,内容如下:

Bug 12834027  ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
 This note gives a brief overview of bug 12834027.
 The content was last updated on: 21-OCT-2011
 Click here for details of each of the sections below.
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.2 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release) 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow]
    ORA-600 [kjbrasr:pkey] 

    RAC (Real Application Clusters) / OPS
    _gc_read_mostly_locking 

Description

    ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] can occur with RAC
    read mostly locking.

    Workaround
      Disable DRM
     or
      Disable read-mostly object locking
      eg: Run with "_gc_read_mostly_locking"=FALSE

 

 

这一次翻到这个Note 是在10月28日 ,而这个Note是在10月21日更新的, 不得不说这个更新来的很及时, 提供了2种 workaround 的方式 , 包括彻底禁用DRM ,和之前所说的禁用_gc_read_mostly_locking 特性,相关的bug note 说明了 另一个Exadata case的经过:

 

Hdr: 12834027 N/A RDBMS 11.2.0.2.0 RAC PRODID-5 PORTID-226 12730844
Abstract: ORA-600 [KJBMPRLST:SHADOW] & [KJBRASR:PKEY] IN A READ MOSTLY & SKIP LOCK ENV

*** 08/04/11 04:22 am ***

PROBLEM:
--------
3 nodes got crashed and restarted in exadata cluster.

DIAGNOSTIC ANALYSIS:
--------------------
8 node Exadata RAC instance. Below instances got terminated and it rejoined
cluster around 11:46 Today (01-Aug).

Instance2 (gmfcdwp2) got ORA-600 [kjbmprlst:shadow] at 11:46 AM and
instance got terminated by LMS1 process.
LMS1 (ospid: 14922): terminating the instance due to error 484

instance got terminated by LMS1 process
LMS1 (ospid: 28176): terminating the instance due to error 484

instance got terminated by LMS1 process

which was seen in instance 4 and 8.

 

这个case和我们的惊人的相似, 也是几个Exadata上特定的instance number 会出现ORA-600 [kjbmprlst:shadow]。

另一方面需要评估禁用DRM 对于整体性能的影响,其实这一点我已经在<Important parameters For Oracle BRM Application in 11gR2>中介绍了, Oracle 官方推荐为 BRM 应用程序 禁用 DRM 特性 (真绕口) , 而因为这套系统使用的是默认的Exadata 出场的实例参数设置,所以。。。

 

解决方案

1. 禁用DRM 特性,可以彻底避免LMS进程因为DRM操作而引发bug , 具体做法是设置隐藏参数_gc_policy_time为0。

影响评估:

DRM 特性可以一定程度上优化数据库性能, 但是对于没有分区或分schema 的应用程序而言 DRM并不会带来好处。  Oracle 官方推荐在BRM应用程序 上 要禁用DRM特性, 见Metalink 文档<Questions About BRM Support For Oracle DB 11GR2 And Exadata [ID 1319678.1]>:

“_gc_policy_time=0 to disable DRM.” 修改该_gc_policy_time参数 需要重启实例

 

2. 部分禁用DRM 特性 , 设置 _gc_read_mostly_locking= False。 设置 _gc_read_mostly_locking= False可以避免replayed lock 大量传输 ,极大程度上减少LMS触发bug的可能性。  修改该参数同样需要重启实例

 

案例总结

写了这么久有点累了, 也有点不知所云。 可能会再写一篇 介绍11g 中新加入的Read-mostly locking DRM 新特性,本来想在这篇里一并介绍但是发现篇幅有点长了。

有时候我们解决一个问题只需要某个恰当的方案就足够了, 这个方案可以来源于Metalink  、Google 或者朋友, 我们并不知道这种做法、解决方式实际这么生效 , 但是总归或多或少 它生效了。 但是这不代表 真正意义上理解这个故障或者问题了, 譬如 我花了大约 3-4个小时的时间来分析这个问题, 虽然得到的 结论几乎和我在7月份的那一次是一样的 ,但是透过 分析的过程 我还是获益良多 , 这是唯有自己分析方能产生的结果。

同学们, 享受这分析过程带来的 乐趣吧!

ORA-600 [17003]错误一例

一套AIX 上的10.2.0.4系统运行catupgrd.sql脚本时出现ORA-600 [17003]错误, 详细的日志如下:

 

ALTER TYPE lcr$_row_record ADD MEMBER FUNCTION
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [17003], [0x70000008E6DA8C8], [1],
[1], [], [], [], []

1. Provide a list of invalid objects taken before upgrade was initially attempted.

spool invalid_pre.lst
select substr(owner,1,12) owner,
substr(object_name,1,30) object,
substr(object_type,1,30) type, status from
dba_objects where status <> ‘VALID’;
spool off

2. Provide output of select obj# ,name from obj$ where name in (‘LCR$_ROW_LIST’,’LCR$_ROW_RECORD’);

3. Provide an output of :
select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status
psta, P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID
and do.object_ID in (select object_id from dba_objects where
OBJECT_NAME=’LCR$_ROW_LIST’)
/

4. Run same SQL in step 3 in another database of the same version – 10.1.0.4

select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status
psta, P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID
and do.object_ID in (select object_id from dba_objects where
OBJECT_NAME=’AQ$_REG_INFO’)
/

Compare and verify that this object has as many parent entries in dependency$ as they
are in a fresh Database of the same version. If they are not, alter
compile this object,and using same query above to verify that parent rows are created.

5. To verify timestamp discrepancy, run SQL

spool &spoolfilename
set pagesize 10000
column d_name format a20
column p_name format a20
select do.obj# d_obj,do.name d_name, do.type# d_type,
po.obj# p_obj,po.name p_name,
to_char(p_timestamp,’DD-MON-YYYY HH24:MI:SS’) “P_Timestamp”,
to_char(po.stime ,’DD-MON-YYYY HH24:MI:SS’) “STIME”,
decode(sign(po.stime-p_timestamp),0,’SAME’,’*DIFFER*’) X
from sys.obj$ do, sys.dependency$ d, sys.obj$ po
where P_OBJ#=po.obj#(+)
and D_OBJ#=do.obj#
and do.status=1 /*dependent is valid*/
and po.status=1 /*parent is valid*/
and po.stime!=p_timestamp /*parent timestamp not match*/
order by 2,1;

SQL> select obj# ,name from obj$ where name in (‘LCR$_ROW_LIST’,’LCR$_ROW_RECORD’);

OBJ# NAME
———- ——————————
5738 LCR$_ROW_RECORD
5776 LCR$_ROW_RECORD
256054 LCR$_ROW_LIST

SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status
2 3 psta, P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
4 where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID
5 and do.object_ID in (select object_id from dba_objects where
6 OBJECT_NAME=’LCR$_ROW_LIST’)
7 /

no rows selected

SQL> spool timestamp
SQL> set pagesize 10000
column d_name format a20
column p_name format a20
select do.obj# d_obj,do.name d_name, do.type# d_type,
po.obj# p_obj,po.name p_name,
SQL> to_char(p_timestamp,’DD-MON-YYYY HH24:MI:SS’) “P_Timestamp”,
to_char(po.stime ,’DD-MON-YYYY HH24:MI:SS’) “STIME”,
decode(sign(po.stime-p_timestamp),0,’SAME’,’*DIFFER*’) X
from sys.obj$ do, sys.dependency$ d, sys.obj$ po
where P_OBJ#=po.obj#(+)
and D_OBJ#=do.obj#
SQL> and do.status=1 /*dependent is valid*/
2 3 4 5 6 7 8 and po.status=1 /*parent is valid*/
9 10 11 and po.stime!=p_timestamp /*parent timestamp not match*/
12 order by 2,1;

no rows selected

SQL> spool off

Errors in file ora_594098.trc:
ORA-07445: exception encountered: core dump [kgghstfel+0074] [SIGSEGV]
[Address not mapped to object] [0x204000000208] [] []

1. Install 10.2.0.1, install 10.2.0.4
2. run utlu102i.sql, check output.
3. change Oracle Home 10.1 to 10.2
4. startup upgrade and run catupgrd.sql

SQL> set lines 200
SQL> col comp_name format a50
SQL> select comp_name,version,status from dba_registry;

COMP_NAME VERSION STATUS
————————————————– —————————— ———————————
Oracle XML Database 10.2.0.4.0 VALID
Oracle Enterprise Manager 10.2.0.4.0 VALID
Oracle Data Mining 10.2.0.4.0 VALID
OLAP Catalog 10.2.0.4.0 VALID
Oracle Text 10.2.0.4.0 VALID
Spatial 10.2.0.4.0 VALID
Oracle interMedia 10.2.0.4.0 VALID
Oracle Database Catalog Views 10.2.0.4.0 VALID
Oracle Database Packages and Types 10.2.0.4.0 INVALID
Oracle Real Application Clusters 10.2.0.4.0 VALID
JServer JAVA Virtual Machine 10.2.0.4.0 VALID
Oracle XDK 10.2.0.4.0 VALID
Oracle Database Java Packages 10.2.0.4.0 VALID
OLAP Analytic Workspace 10.2.0.4.0 VALID
Oracle OLAP API 10.2.0.4.0 VALID

If the above query shows that everything is valid and is in proper version,
it means, that we would need to correct the problem with lcr$_row_record before upgrade.

I see that one component is invalid ‘Database packages and types’. To resolve this, please perform the following:

SQL> shutdown immediate
SQL> startup upgrade
SQL> @?/rdbms/admin/catproc.sql
SQL> @?/rdbms/admin/utlrp.sql
SQL> shutdown immediate
SQL> startup
SQL> set lines 200
SQL> col comp_name format a50
SQL> select comp_name,version,status from dba_registry;

Recompiled also get ORA-00600: internal error code, arguments: [17003]

SQL> alter type LCR$_ROW_RECORD compile;
alter type LCR$_ROW_RECORD compile
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [17003], [0x70000007C3A0378], [1],
[1], [], [], [], []

select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status psta, P_TIMESTAMP
from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID =
select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’);

SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
2 D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status psta, P_TIMESTAMP
3 from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
4 where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID =
5 (select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’);

(select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’)
*
ERROR at line 5:
ORA-01427: single-row subquery returns more than one row

SQL> set lines 200
SQL> col comp_name format a50
SQL> select comp_name,version,status from dba_registry;

SQL> col object_name format a20
SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status psta, P_TIMESTAMP
from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID in
(select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’);

 

MOS Bug Info:

 

Hdr: 6611530 10.2.0.2.0 RDBMS 10.2.0.2.0 AQ PRODID-5 PORTID-197 ORA-600
Abstract: ORA-600 17003

PROBLEM:
——–
Customer was trying to apply the 10.2.0.3 patchset and kept getting ora-600
17003 errors running catproc.  They tested the running of catalog, catproc
and utlrp in the 10.2.0.2 database version before upgrade and got the same
error.  The errors are reproducing trying to compile at least these 2
objects:
ALTER PACKAGE “SYS”.”DBMS_AQADM_SYS” COMPILE BODY REUSE SETTINGS
ALTER TYPE “SYS”.”LCR$_ROW_RECORD” COMPILE SPECIFICATION REUSE SETTINGS

This issue started in SR 6535356.994 and since the issue reproduces before
upgrade it was determined that this isn’t an upgrade issue.

DIAGNOSTIC ANALYSIS:
——————–
Have reviewed alert.log and trace file.
Had customer run hcheck – output will be uploaded.

In the trace files, I couldn’t find the handle for the second argument of the
ora-600 17003 error, so can not determine the object.

WORKAROUND:
———–
none known

RELATED BUGS:
————-
Looks exactly like Bug 5857558.  Was going to try the generic fix in this bug
but cannot determine the object because cannot find the handle in the trace
files.

REPRODUCIBILITY:
—————-
It is reproducible everytime they run catproc.  catalog runs fine.

TEST CASE:
———-
none

STACK TRACE:
————
ksedst
ksedmp
ksfdmp
kgeriv
kgeasi
kglget
kglgob
kgldpo0
kgldpo
kgldon
pl_don
ptgxtn
ptg_nd
phdbte
phncrr_check_remote_refs
phncee_check_extra_errors
phnr_resolve
ph2exp
ph2ext
ph2osa
ph2of1
ph2exp
ph2ext
ph2osa
ph2of1

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

The query returned no rows:
SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status
psta, P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po where
D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID = (select
object_id from dba_objects where OBJECT_NAME=’AQ$_SRVNTFN_MESSAGE’);
  2    3    4    5  
/

no rows selected

Therefore, I have asked him to send up the files needed to set-up the system
tablespace in-house.  Do you need any other tablespaces?

Customer did the following:

1.    Recovered the database back to 10.2.0.2 before we ran catalog and catproc.
2. Performed the steps for dictionary fix as following ( he had tested this
out ? please see dev.log with the test he did before this ):
(1) update obj$ set status=5 where obj#=(select object_id from dba_objects
where OBJECT_NAME=’AQ$_SRVNTFN_MESSAGE’);
     commit;
     select obj#, name, status from obj$ where obj# = (select object_id from
dba_objects where OBJECT_NAME=’AQ$_SRVNTFN_MESSAGE’);
.
(2) FLUSH the shared pool(or bounce the DB), to reflect this changed status
in cache as well.
.
(3) alter type AQ$_SRVNTFN_MESSAGE compile;
.
(4) Check required dependency$ rows are recreated.
     select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp, po.status
psta, P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po where
D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID = (select
object_id from dba_objects where OBJECT_NAME=’AQ$_SRVNTFN_MESSAGE’);
.
(5) ALTER PACKAGE “SYS”.”DBMS_AQADM_SYS” COMPILE BODY REUSE SETTINGS;
     select obj#, name, status from obj$ where obj# = (select object_id from
dba_objects where OBJECT_NAME=’DBMS_AQADM_SYS’ and OBJECT_TYPE in (‘PACKAGE
BODY’));
3. Executed catalog
4. Executed catproc and encountered Ora-600 errors:
    First error:
       UPDATE SYS.AQ_SRVNTFN_TABLE tab
           *
ERROR at line 1:
ORA-81: address range [0x60000000000DBDB0, 0x60000000000DBDB4) is not
readable
ORA-600: internal error code, arguments: [kksfbc-reparse-infinite-loop],
[0x9FFFFFFFBE9CED88], [], [], [], [], [], []

     Second error:
Warning: Type created with compilation errors.

Errors for TYPE LCR$_ROW_RECORD:

LINE/COL ERROR
——– —————————————————————–
0/0      ORA-81: address range [0x60000000000DBD80, 0x60000000000DBD84)
         is not readable
         ORA-600: internal error code, arguments: [17003],
         [0xC0000004F1DD89D0], [1], [1], [], [], [], []

Invalids after utlrp:

SYS                STREAMS$_EVALUATION_CONTEXT      EVALUATION CONTEXT
INVALID
                   DBMS_STREAMS                     PACKAGE            
INVALID
                   DBMS_STREAMS_DATAPUMP            PACKAGE BODY       
INVALID
                   DBMS_STREAMS                     PACKAGE BODY       
INVALID
                   DBMS_LOGREP_IMP                  PACKAGE BODY       
INVALID
                   LCR$_ROW_RECORD                  TYPE               
INVALID
                   LCR$_ROW_RECORD                  TYPE BODY          
INVALID
                   AQ$_AQ_SRVNTFN_TABLE_F           VIEW               
INVALID
                   AQ$AQ_SRVNTFN_TABLE              VIEW               
INVALID

I then asked him to up the following parameters which did no good (didn?t
think they would but wanted to try):
shared_pool_size greater then 800M
large_pool_size greater then 250M
java_pool_size greater then 250M
db_cache_size greater then 500M
compatible set to 10.2.0.0
pga_aggregate_target greater then 1000M

Then had him do the following:
SQL> select distinct o.name, o.stime, d.p_timestamp
from obj$ o, dependency$ d
where o.stime != d.p_timestamp and o.type#=13 and o.obj#=d.p_obj# ;

  2    3  
NAME                           STIME     P_TIMESTA
—————————— ——— ———
AQ$_SRVNTFN_MESSAGE            08-NOV-07 11-NOV-06

SQL> startup restrict
SQL> conn / as sysdba

SQL> update dependency$ d set p_timestamp = (select stime from obj$ p where
d.p_obj#=p.obj#)
           where d.p_obj# in (select obj# from obj$ where type#=13) and
d.p_timestamp != (select stime
           from obj$ o where d.p_obj#=o.obj#);

ENSURE BY THE PREVIOUS STATEMENT that only 1 rows are updated.

——— THIS RETURNED 3 rows therefore he rolled back and this is where we
are now.

/upload/bug6611530 directory:
Dev.log ? gives the outputs after the dictionary patch was done
alert_oasc1[1].log.11082007.txt ? the alert.log after the dictionary patch
was done and the migration was then attempted
oasc1_ora_3475.trc – The first trace file in the series

You can see in the trace file the following 2 ora errors:
ORA-600: internal error code, arguments: [kksfbc-reparse-infinite-loop],
[0x9FFFFFFFBE9CED88], [], [], [], [], [], []
AND
ORA-600: internal error code, arguments: [17003], [0xC0000004F1DD89D0],
[1], [1], [], [], [], []
—- where I still cannot find the handle in the trace file.

The ora-600 kksfbc-reparse-infinite-loop occured after the data dictionary
patching so I included it here for clarity.  Will work on that one after this
is fixed (if it can be).

Here is the results of the requested query.  Since this object has both an
OBJECT_TYPE of TYPE and TYPE BODY there are 2 queries that were run:

SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
  2  D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp,
po.status psta,
  3   P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
  4  where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID =
  5  (select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’
and OBJECT_TYPE = ‘TYPE’);

    D_OBJ#
———-
OBJECT_NAME
——————————————————————————

DTYP                DSTA    D_TIMESTA     ORDER#     P_OBJ#
——————- ——- ——— ———- ———-
OBJECT_NAME
——————————————————————————

PTYP                PSTA    P_TIMESTA
——————- ——- ———
   1462820
LCR$_ROW_RECORD
TYPE                INVALID 09-NOV-07          2    3320201
LCR$_ROW_LIST
TYPE                VALID   11-NOV-06

   1462820
LCR$_ROW_RECORD
TYPE                INVALID 09-NOV-07          1        309
STANDARD
PACKAGE             VALID   18-APR-03

   1462820
LCR$_ROW_RECORD
TYPE                INVALID 09-NOV-07          0    1462817
LCR_ROW_LIB
LIBRARY             VALID   15-NOV-03

SQL>
SQL> select D_OBJ#, do.object_name, do.object_type dtyp, do.status dsta,
  2  D_TIMESTAMP, ORDER#, P_OBJ#, po.object_name, po.object_type ptyp,
po.status psta,
  3   P_TIMESTAMP from dependency$ d, DBA_OBJECTS do, DBA_OBJECTS po
  4  where D_OBJ# = do.object_ID and P_OBJ#= po.object_ID and do.object_ID =
  5  (select object_id from dba_objects where OBJECT_NAME=’LCR$_ROW_RECORD’
and OBJECT_TYPE = ‘TYPE BODY’);

    D_OBJ#
———-
OBJECT_NAME
——————————————————————————

DTYP                DSTA    D_TIMESTA     ORDER#     P_OBJ#
——————- ——- ——— ———- ———-
OBJECT_NAME
——————————————————————————

PTYP                PSTA    P_TIMESTA
——————- ——- ———
   1462858
LCR$_ROW_RECORD
TYPE BODY           INVALID 09-NOV-07          0    1462820
LCR$_ROW_RECORD
TYPE                INVALID 09-NOV-07

 

 

可以尝试在startup upgrade模式下重新运行catproc.sql脚本来解决组件失效的问题(catproc.sql is to try validate the registry and that should resolve the issue)。

沪ICP备14014813号-2

沪公网安备 31010802001379号