PMON: TERMINATING INSTANCE DUE TO ERROR 600 on 8i

Alert logfile reported as below:

*********************
Wed May 27 13:11:47 2009
Errors in file /u01/app/oracle/admin/proa021/udump/proa021_ora_9533.trc:
ORA-07445: exception encountered: core dump [memset()+116] [SIGSEGV] [Address not mapped to object] [0] [] []

From Trace file
********************
Dump file /u01/app/oracle/admin/proa021/udump/proa021_ora_9533.trc

Oracle8i Enterprise Edition Release 8.1.7.4.0 - Production
With the Partitioning option
JServer Release 8.1.7.4.0 - Production
ORACLE_HOME = /u01/app/oracle/product/817proa021
System name:	SunOS
Node name:	v08k01
Release:	5.8
Version:	Generic_117350-38
Machine:	sun4u
Instance name: proa021
Redo thread mounted by this instance: 1

Process Info
******************
Oracle process number: 117
Unix process pid: 9533, image: oracle@v08k01 (TNS V1-V3)

Error
*********
2009-05-27 13:11:47.847
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [memset()+116] [SIGSEGV] [Address not mapped to object] [0] [] []

Current SQL(Current SQL statement for this session)
***********************************************************************
:
SELECT COUNT(PO_LINE_ID) FROM PO_LINES_INTERFACE WHERE PO_HEADER_ID = :b1

Call Stack functions
*************************
ksedmp <- ssexhd <- sigacthandler <- memset


#####################################################################################
From Alert logfile
*********************
Wed May 27 13:18:39 2009
Errors in file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc:
ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], []
Wed May 27 13:18:56 2009
Errors in file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc:
ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], []


From Tracefile
*******************
Dump file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc

Oracle8i Enterprise Edition Release 8.1.7.4.0 - Production
With the Partitioning option
JServer Release 8.1.7.4.0 - Production
ORACLE_HOME = /u01/app/oracle/product/817proa021
System name:	SunOS
Node name:	v08k01
Release:	5.8
Version:	Generic_117350-38
Machine:	sun4u
Instance name: proa021
Redo thread mounted by this instance: 1

Process Info
****************
Oracle process number: 2
Unix process pid: 9584, image: oracle@v08k01 (PMON)


Error
********
2009-05-27 13:18:39.766
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], []

Call Stack Functions:
****************************
ksedmp <- kgeriv <- kgesiv <- ksesic0 <- kssdch
<- ksuxds <- kssxdl <- kssdch <- ksudlp <- kssxdl
<- ksuxdl <- ksuxda <- ksucln <- ksbrdp <- opirip
<- opidrv <- sou2o <- main <- start

CURRENT SESSION'S INSTANTIATION STATE
*********************************************************
current session=8c8fdfbc
---- Cursor Dump ------
Current cursor: 0, pgadep: 0
Cursor Dump:
End of cursor dump
END OF PROCESS STATE
******************** Cursor Dump ************************
Current cursor: 0, pgadep: 0
Cursor Dump:
End of cursor dump
ksedmp: no current context area

ERROR: ORA-600 [1115]

VERSIONS: versions 6.0 to 10.1

DESCRIPTION: We are encountering a problem while cleaning up a state object.

The State Object is already on free list or has the wrong parent State Object.

FUNCTIONALITY: Kernal Service State object manager

IMPACT:
POSSIBLE INSTANCE FAILURE
PROCESS FAILURE
NON CORRUPTIVE - No underlying data corruption.

SUGGESTIONS: This error may be reported as a direct result of another earlier problem.

Lot of bugs reported

Bug 3837965 : Abstract: ORA-7445'S AND 600'S LEADING UP TO DB CRASH
Comp Version: 8.1.7.4.0
Fixed In Version: 9.2.0.
-------------------------------------------------------------

Bug 3134843 : Abstract: ORACLE PROCESSES CRASHING WITH ORA-7445 SEGVIO ON A NUMBER OF DATABASES
Comp Version: 8.1.7.4
Status: Closed, could not be reproduced
----------------------------------------------------------------

Bug 2760836: Abstract: PMON cleanup of dead shared servers/dispatchers can crash instance(OERI:26599 / OERI 1115)

--------------------------------------------------------------
Note 2760836.8 PMON cleanup of dead shared servers/dispatchers can crash instance (OERI 26599 / OERI 1115)
----------------------------------------------------------------

PROPOSED SOLUTION JUSTIFICATION(S)
==================================
1. One-off patch for Bug 2760836 has fixed this issue...so after customer apply the one-off patch...then this issue will be solved.

OR

2. 9.2.0.4 or later version has fixed this issue...so after customer upgrade to at least 9.2.0.4 version...then this issue will be solved.

The solution can be justified by the followings:

Note 2760836.8 PMON cleanup of dead shared servers/dispatchers can crash instance (OERI 26599 / OERI 1115)

Comments

  1. admin says

    Hdr: 3837965 8.1.7.4.0 RDBMS 8.1.7.4.0 VOS HEAP MGMT PRODID-5 PORTID-23 ORA-7445 4284478
    Abstract: ORA-7445’S AND 600’S LEADING UP TO DB CRASH

    Completed: alter tablespace PRECISE_I3_OR_TMP end backup
    Tue Aug 17 08:43:23 2004

    Tue Aug 17 08:43:23 2004
    Beginning log switch checkpoint up to RBA [0x41.2.10], SCN: 0x0708.39ff37e8
    Thread 1 advanced to log sequence 65
    Current log# 16 seq# 65 mem# 0: /starsdb1/d07/oradata/cacst/redo16_1.log
    Current log# 16 seq# 65 mem# 1: /starsdb1/d08/oradata/cacst/redo16_2.log
    Tue Aug 17 08:43:23 2004
    ARCH: Beginning to archive log# 15 seq# 64
    Tue Aug 17 08:43:23 2004
    ARC0: Beginning to archive log# 15 seq# 64
    ARC0: Failed to archive log# 15 seq# 64
    Tue Aug 17 08:45:29 2004
    ARC1: Beginning to archive log# 15 seq# 64
    ARC1: Failed to archive log# 15 seq# 64
    Tue Aug 17 08:47:16 2004
    ARC0: Beginning to archive log# 15 seq# 64
    ARC0: Failed to archive log# 15 seq# 64
    Tue Aug 17 08:48:37 2004
    ARCH: Completed archiving log# 15 seq# 64
    Tue Aug 17 08:48:57 2004
    Errors in file /apps/oracle/admin/cacst/bdump/cacst_s011_28609.trc:
    +RA-7445: exception encountered: core dump [00000001002F3384] [SIGSEGV]
    [Addres
    s not mapped to object] [0] [] []
    Tue Aug 17 08:48:57 2004

    DIAGNOSTIC ANALYSIS:
    ——————–
    The ct. put did an end backup at 8:43 at Tue Aug 17 08:48:57 2004
    he received his first ORA-7445 error which then cascaded into more
    ORA-600 Ora-7445 errors and a db crash.

    …..

    Errors in file /apps/oracle/admin/cacst/bdump/cacst_lgwr_28571.trc:
    +RA-600: internal error code, arguments: [2667], [1], [65], [39986], [0],
    [0],
    [0], [0]
    Tue Aug 17 08:49:37 2004
    LGWR: terminating instance due to error 600
    Tue Aug 17 08:49:37 2004
    Errors in file /apps/oracle/admin/cacst/bdump/cacst_s093_28778.trc:
    +RA-600: internal error code, arguments: [], [], [], [], [], [], [], []
    Instance terminated by LGWR, pid = 28571
    Tue Aug 17 08:53:59 2004

    The first error
    cacst_s011_28609.trc
    ORA-7445: exception encountered: core dump [00000001002F3384] [SIGSEGV]
    [Addres
    s not mapped to object] [0] [] []
    Current SQL statement for this session:
    SELECT CO_USERS_MSTR.USER_ID, CO_USERS_MSTR.USER_NAME,
    CO_USERS_MSTR.USER_NAME || ‘ ‘|| ‘-‘ || ‘ ‘|| CO_USERS_MSTR.USER_ID
    computed
    _name FROM CO_USERS_MSTR WHERE CO_USERS_MSTR.CAP_LVL_CODE >ora-7445 kghufreeuds_01

    cacst_s002_6489.trc >>ORA-7445 kghssgfr2

    acst_s042_28673.trc>>ora-600 17182
    cacst_s002_6489.trc>>divide by 0
    cacst_s006_28599.trc>>
    RA-600: internal error code, arguments: [kghssgfr2], [0], [], [], [], [],
    [],
    []
    +RA-1555: snapshot too old: rollback segment number 95 with name “RBS044″
    too small
    Tue Aug 17 08:49:02 2004

    cacst_s006_28599.trc:
    +RA-7445: exception encountered: core dump [0000000101138C74] [SIGSEGV]
    [Addres
    s not mapped to object] [200] [] []
    +RA-7445: exception encountered: core dump [00000001010CEB80] [SIGSEGV]
    [Address not mapped to object] [4294967280] [] []
    cacst_s001_26644.trc:
    +RA-7445: exception encountered: core dump [00000001003D74A0] [SIGSEGV]
    [Addres
    s not mapped to object] [4104] [] []
    +RA-7445: exception encountered: core dump [00000001003D74A0] [SIGSEGV]
    [Addres
    s not mapped to object] [4104] [] []
    +RA-7445: exception encountered: core dump [FFFFFFFF7D101A40] [SIGSEGV]
    [Addres
    s not mapped to object] [196870144] [] []
    Tue Aug 17 08:49:02 2004

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————
    immediate

    Uploaded evidence:
    ~~~~~~~~~~~~~~~~~~~
    cacst_s001_26644
    SEGV accessing 0x60bbc0000 No trace

    cacst_s002_6489
    OERI:kghssgfr2 0
    ORA-1555
    Private frame segmented array entry looks zeroed out.

    cacst_s006_28599
    SEGV accessing 0xfffffffffffffff0
    Instantiation object looks all zeroed out
    SEGV looks like ptr-0x10 from the zeroed out data

    cacst_s011_28609
    SEGV accessing 0x0
    private frame segmented array looks zeroed again

    cacst_s024_28635
    OERI:kghufreeuds_01 [0x5d60d8e38]
    Memory around this heap all zeroed out

    cacst_s042_28673
    OERI:17182 addr=0x5e56ffd10
    Memory around this zeroed out

    DB was terminated by LGWR seeing a corruption in memory.
    Trace cacst_lgwr_28571.trc is not uploaded.

    From the above sample of traces it looks like zeroes were seen across
    lots of different chunks of memory. An Oradebug ipc dump would help
    confirm these were in SGA space or not – as this is MTS it looks
    like these may have been in SGA space.

    So we need:
    Some more traces to cross check the address ranges with problems.
    Oradebug IPC when instance is up to get mapped address range
    of the SGA.
    cacst_lgwr_28571 to check why LGWR died.

    Setting 10501 is unlikely to be of any help for this scenario.

    From the IPC dump:
    SGA variable range is 05217ca000 – 05ED4CA000-1
    Java is at 05ed4ca000 +
    Redo is after this at 060b1a2000 +
    SGA ends at 0x60bbc0000-1

    cacst_s001_26644 dumped with SEGV accessing 0x60bbc0000 but is
    has no trace content. This is just past the end of the SGA.

    It is highly likely that this pid 26644 (s001) was the culprit and
    that it span across memory writing zeroes across various chunks of
    the SGA. As this is MTS those zeroes mostly hit other sessions
    private structures resulting in various OERI/core dumps. Finally
    it looks like this pid went to write past the end of the SGA and
    so seg faulted itself.

    LGWR was correct to crash the instance.

    The trace file cacst_s001_26644 contains no information whatsoever.
    Is there a core dump from this process ?
    If so please try to get the core file and a stack from it on the system
    (see note 1812.1 for how to extract a stack from a core file. This is
    best done on the customer machine and has to use the oracle executable
    as it was at the time of the error – ie: If oracle has been relinked
    or patched since then this may not give a good stack).

    Stack from adb only shows the dump routines themselves which are
    invoked when a process aborts so there is no clue as to what
    this process was actually doing in the first place.

    It is highly likely this process went beserk for some reason and
    trashed lots of the SGA with 0x00’s in lots of locations but there
    is nothing in the stack to give a clue why it did so nor what code
    it was in when it did so.
    There is only evidence above from one failure.

    I have outlined what appears to have happened for that failure.
    The problem is knowing what OS PID 26644 was doing and for that
    I have no visible evidence. The following may help:
    a. Check all trace files from the time of the errors to see if
    any contain a SYSTEMSTATE dump. If so that may give a clue
    what SQL-26644 was running.
    b. Get the raw corefile from pid 26644 – There is a very small
    chance we can get an idea of any SQL from it.

    For future:
    Try to get a SYSTEMSTATE when things start going wrong.
    The simplest way may be event=”7445 trace name systemstate level 10”
    .
    As soon as a dump occurs the offending process (for the scenario
    in this bug) is probably in a reasonably tight CPU loop so try
    to locate any high CPU sessions and get stacks from them (using
    pstack or oradebug with errorstacks).
    .
    Keep an audit (any form you like) of DB connections and what they are
    doing. Even frequent selects from v$session for active sessions may help
    get some direction to help isolate what the culprit is up to.

    I got the following from ess30 under bug3837965/aug10:
    alert_xcacst.log test_RDA.rda.tar.Z xcacst_ora_22724.trc
    alertandtrc.tar traceAndCore.tar xcacst_ora_22728.trc
    core_26644.txt usertraces.tar xcacst_ora_22736.trc
    cores.tar xcacst_ora_22714.trc
    test_cores.tar xcacst_ora_22720.trc

    I have extracted all the files from those tar archives and have nothing
    from July (other than some core files which may be July but I cannot
    do anything with a core file without the matching alert log and
    trace files):

    aug10/bdump All files are from cacst instance around 17 Aug 8:49
    aug10/udump Most files from cacst around 16-Aug 12:08 + 13:30
    but these are deadlock traces (ORA-60). Other 2 files
    are just an http message.

    aug10/xcacst*.trc All from xcacst instance Around 10-Aug 9:49 – 10:15
    No sign of any crash but alert ends with shutdown
    at 10:11

    I will proceed seeing if I can extract anything at all from core_26644
    but this will take time and may get nowhere.

    Please see about getting the alert + trace + core from the July 22 crash
    and also please clarify what the “xcacst” instance has to do with
    the issue in this bug report as that seems to be running on a totally
    different machine.

    Ive spent time looking at what information we can get from the
    core image from PID 26644 but it isnt very much as the process
    was a shared server so most of information was stored in the
    SGA. As you have SHADOW_CORE_DUMP=PARTIAL the SGA has not been
    dumped to the core file so all the information about current
    cursors etc.. is missing. All I can say is that it was executing
    cursor 52 which was a SELECT statement.

    From the core file itself it looks like the stack may be intact
    but just not readable by “adb”. It would be very worthwhile
    seeing of the stack can be unwound further by “dbx” rather than
    “adb”. This is best done on the customer machine but if that is not
    possible then please get the oracle executable itself and all files
    from the output of “ldd $ORACLE_HOME/bin/oracle”.
    eg: ldd $ORACLE_HOME/bin/oracle
    Note all filenames
    COPY all of those files into a directory somewhere
    TAR them up , compress it and upload it.
    IMPORTANT: If Oracle has been relinked since Aug 17 then the stack
    unwind may give false information so please get the timestamp from
    ls -l $ORACLE_HOME/bin/oracle also.

    At present I have no more useful trace to analyze available on ess30.
    I will try to put some notes together on the 17-Aug corruption
    and on steps which might be useful to help should this occur again.

    Looked at Jul 22 traces and they show same problem occurred.
    Offending process is calling memset() and trashing the SGA.
    Diagnosis and suggestions mailed out.

    Main thing I need from this bug are a stack from either/both of
    the core files for PID 26644 (aug 17) and PID 16193 (Jul 22)
    so we can see which function calls memset() to see if a diagnostic
    can be produced (to check the bounds on the memset before it is
    called).

    Jul 22 traces do show Java is being used , which is what the above
    is related to freeing up.

    I have logged bug 3902349 to request the above code segment
    be protected and have diagnostics added so that should the problem
    recur in dedicated mode then we have a chance of getting a complete
    trace rather than a zero filled UGA.

    Putting this bug to 10 so you see the new bug# and also to get any
    trace should you see a dump from one of the dedicated connections
    which looks like it was due to a kokmeoc->memset call in case it
    already has sufficient information to help progress the cause.
    *** 09/28/04 01:19 am ***
    We have a diagnostic patch pending under bug 3902349 .

    The diagnostic would do the following:
    From the core dump uploaded the trashing of memory occurred when
    freeing up data for a JAVA call from process private information.
    This was in the large pool so the trash trampled the SGA.
    Now that the sessions are dedicated it is expected that if the
    problem condition arose in the SAME location of code as before
    then memory would be trashed but it would be private memory and
    then the process would be likely to core dump, but may not have
    sufficient information to get closer to the true root cause.

    The diagnostic adds a simple check on the relevant free of memory
    to see if the amount of memory to be cleared looks sensible.
    If not the process will die with an ORA-600 instead BEFORE
    it trashes lots of memory.
    The overhead of the extra check is minimal (a few extra assembler
    instructions which are not executed often).

    However, the module which would have the debug patch added is also
    included in the security patch for alert #68.

    As this would be a diagnostic patch then in order to cut it we need
    to know:
    a. What patches customer currently has applied to the system, if any.
    b. If they do intend to apply the security patch for alert #68 to
    this system or not.

    The failing cursor is:
    select CO_DRM_PKG.GET_SCRIPT
    ( :”SYS_B_0″ , :”SYS_B_1″ , :”SYS_B_2″ , :”SYS_B_3″ ) from dual

    This has undergone literal replacement.

    It looks like the system changes the value of CURSOR_SHARING
    daily between EXACT and FORCE as there are numerous sessions
    using the above SQL with pure literals as well as some using
    the replaced form.

    It looks like this call probably invokes some Java code underneath it
    as the error is for cleaning up ILMS data and KGMECID_JAVA is set.

    Please see if we can get / do the following:
    1. Definition of CO_DRM_PKG.GET_SCRIPT
    2. Set an event=”7445 trace name library_cache level 10″
    to get the library cache detail on the next error.
    (Can be set from alter system when convenient)
    3. DBMS_SHARED_POOL.KEEP the cursor for the above select
    and the PLSQL package and any Java object is directly calls
    and any TYPES it references
    4. It is not clear why the system repeatedly changes CURSOR_SHARING.
    Is it possible to stick with one setting (FORCE for online)
    and have other sessions ALTER SESSION to change it if they need
    EXACT? If possible this would help with ‘3’ as when it is EXACT
    you have seperate cursors for the CO_DRM_PKG.GET_SCRIPT
    SQL which makes it hard to KEEP them (as they are literals).
    If this is not possible just KEEP the literal replaced cursor.
    5. If you have the core dump from PID 24411 I can probably use that
    to get further but ‘3’ may provide a workaround.

    I have reproduced this problem and built a stand alone testcase
    which is logged in bug 4284478.

    The underlying problem is fixed in 9i due to a rather large code
    change in the ILMS area.

    I need to know how you want to proceed now as
    8.1.7.4 fixes are only possible for customers with EMS.

    The options are:
    If you have EMS then I can put bug 4284478 to DDR for an 8i code fix.
    (Note that any 8i fix will clash with the CPU bundle and be subject to
    CPU clash MLR release procedures when fixed).

    If you do not have EMS or are moving off of 8i then I will mark
    bug 4284478 as fixed in 9.0 fixed by transaction “varora_ilms_inh_1”
    as there is no code change needed in 9i or higher releases to
    address the dump.

  2. admin says

    Hdr: 3134843 8.1.7.4 RDBMS 8.1.7.4 VOS KG SUPP PRODID-5 PORTID-453 ORA-7445
    Abstract: ORACLE PROCESSES CRASHING WITH ORA-7445 SEGVIO ON A NUMBER OF DATABASES
    PROBLEM:
    ——–
    Oracle foreground processes supporting 20 different databases
    on the same E15k Sun machine are crashing sporadically in a large number
    of functions within the Oracle code. Oracle Home is based on Veritas VXFS file
    system on a Hitachi SAN based LUN. This file system is accessed exclusively
    from the active node of their HA cluster. Oracle data, log and control files
    are based on raw volumes that can be accessed from each nodes, but are
    accessed only from the active node.

    The problem started happening on August 1 while the Secondary node of their HA
    cluster was in use. They failed over on the 3rd of August to the Primary node
    and have been running on it ever since. The following is a spreadsheet
    showing date, time, Oracle SID, function on top of stack, num of failing
    processes:

    Date Time SID Error # of Errors Notes

    8/1/2003 13:09:36 clp13 ORA-7445: exception encountered: core dump
    [kcbzgm()+320] [SIGBUS] 92 Crashed/Restored
    8/1/2003 9:39:48 clp11 ORA-7445: exception encountered: core dump
    [snttread()+92] 1
    8/1/2003 8:25:54 clp07 ORA-7445: exception encountered: core dump
    [opiodr()+5756] 1
    8/1/2003 13:09:33 csp05 ORA-7445: exception encountered: core dump
    [ssdinit()+2580] 1376
    8/1/2003 16:09:59 csp02 ORA-7445: exception encountered: core dump
    [kcbzib()+7344] 1
    8/1/2003 16:09:59 csp07 ORA-7445: exception encountered: core dump
    [ssdinit()+3208] 3
    8/4/2003 10:53:49 clp07 ORA-7445: exception encountered: core dump
    [opiosq0()+2468] 1
    8/4/2003 12:45:38 clp07 ORA-7445: exception encountered: core dump
    [snttread()+56] 6
    8/4/2003 12:23:28 csp05 ORA-7445: exception encountered: core dump
    8/5/2003 17:55:18 csp04 ORA-7445: exception encountered: core dump
    [snttread()+60] 1
    8/6/2003 18:25:20 csp02 ORA-7445: exception encountered: core dump [01A636AC]
    [SIGILL] [Illegal opcode] 6
    8/6/2003 18:25:22 clp08 ORA-7445: exception encountered: core dump
    [ssdinit()+3208] 1
    8/7/2003 13:59:57 clp13 ORA-7445: exception encountered: core dump
    [sdcreate()+224] 1
    8/7/2003 12:05:28 clp07 ORA-7445: exception encountered: core dump
    [kghalf()+464] 6
    8/7/2003 16:25:23 csp05 ORA-7445: exception encountered: core dump
    [skgfqio()+2664] 6
    8/7/2003 12:07:01 csp02 ORA-7445: exception encountered: core dump
    [memset()+116] 40
    8/7/2003 14:01:40 csp02 ORA-7445: exception encountered: core dump
    [snttread()+92] 3
    8/7/2003 12:07:02 csp10 ORA-7445: exception encountered: core dump
    [snttread()+92] 1
    8/12/2003 9:48:25 csp05 ORA-7445: exception encountered: core dump
    [skgfqio()+2860] 1
    8/13/2003 18:22:12 clp13 ORA-7445: exception encountered: core dump
    [ssdinit()+2904] 1
    8/13/2003 18:22:10 clp11 ORA-7445: exception encountered: core dump
    [01A63F14] [SIGILL] [Illegal opcode] 3
    8/18/2003 8:04:00 clp11 ORA-7445: exception encountered: core dump
    [ksliwat()+1028] [SIGSEGV] 1
    8/18/2003 11:11:42 clp11 ORA-7445: exception encountered: core dump
    [01A63F14] [SIGILL] [Illegal opcode] 6
    8/18/2003 8:45:05 clp07 ORA-7445: exception encountered: core dump
    [snttread()+92] 1
    8/18/2003 17:54:45 clp07 ORA-7445: exception encountered: core dump
    [strlen()+128] 1
    8/18/2003 16:52:04 csp08 ORA-7445: exception encountered: core dump
    [nsprecv()+1428] 1
    8/18/2003 19:49:02 clp08 ORA-7445: exception encountered: core dump
    [is_so_loaded()+1812] 3
    8/19/2003 11:26:20 clp13 ORA-7445: exception encountered: core dump
    [sdcreate()+348] [SIGSEGV] 59 Crashed/Restored
    8/19/2003 12:19:39 clp15 ORA-7445: exception encountered: core dump
    [ldxdts()+168] 1

    DIAGNOSTIC ANALYSIS:
    ——————–
    Until it was pointed out as a problem, some (unknown number of) the customer’s
    instances and listeners have been started by a user having been logged in
    through secure shell, which automatically sets the ulimit for the process for
    a core file dump size to 0.

    Out of the 9 core files dump, 8 showed that seg vio happened while in signal
    handler. This is true for a large majority of trace files dumped.

    According to trace files, Oracle is experiencing either seg vio, or seg bug,
    or sigill – illegal instruction. The referenced addressed is either not
    mapped to object and is comletely out of the process’ memory map, or not
    mapped and misalligned, or attempted to execute an illegal instraction.

    We have not been able to determine if there is a pattern to addresses being
    referenced. There has been ora-600 reported, no database blocks or heap
    corruptions.

    WORKAROUND:
    ———–
    The customer moved Oracle Home from Veritas file system based on a Htachi SAN
    provided LUN to a unix file system bundled with Solaris based on a different
    LUB from the same storage subsystem. The problem has not reproduced in 20
    days.

    RELATED BUGS:
    ————-

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-
    n/a

    STACK TRACE:
    ————

    SUPPORTING INFORMATION:
    ———————–
    The files are on ess30.

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————

    Discussed bug with DDR. Bug assigned and Setting status 11 for continued
    investigation.
    PLEASE INSTRUCT CUSTOMER TO SAVE ORIGINAL VERSION OF SP.O IN CASE PATCH
    NEEDS TO BE BACKED OUT !!!
    1. How is the diagnostic information activated and de-activated:
    Exception handling code is de-activated by default; Oracle processes should
    dump core upon receipt of appropriate signals (SIGSEGV et al).
    2. What new information is being traced and when is it triggered?
    Exception information should be obtained from core file instead of Oracle
    trace file.
    3. What is the performance impact of the new diagnostic tracing?
    None.
    4. What is expected from the generated diagnostic information and where will
    this lead to in terms of bug resolution:
    Hopefully, core file should shed more light on reason(s) for exception(s); if
    not, then at least this might point to a hardware problem, as I strongly
    suspect.

  3. Hdr: 2760836 8.1.7.4 RDBMS 8.1.7.4 SHARED SERVER PRODID-5 PORTID-453 ORA-600
    Abstract: PMON INSTANCE CRASH FOLLOWING ORA-600 [26599]/ORA-600 [1115] ERRORS
    PROBLEM:
    1. Clear description of the problem encountered:

    Intermittently the customer database receives an ORA-600 [26599] [1] [4] error
    followed by an ORA-600 [1115] from a shared server process. PMON then crashes
    the instance due to the ORA-600 [1115].

    2. Pertinent configuration information (MTS/OPS/distributed/etc)
    MTS, JavaVM

    3. Indication of the frequency and predictability of the problem
    Problem is intermittent, having occured twice so far within a few days.

    4. Sequence of events leading to the problem
    Normal use of ct’s database applications.

    5. Technical impact on the customer. Include persistent after effects.
    Loss of service of the database due to instance crash.

    =========================
    DIAGNOSTIC ANALYSIS:
    In both cases the ORA-600 [26599] trace files show the current execution being
    within the ‘oracle/aurora/net/Presentation’ class, and both occurrences
    followed the tablespaces being coalesced. I’m not sure if these are related,
    but it would certainly suggest that some form of kgl lock corruption is
    occuring when executing this JVM class. This looks very similar to
    bug 2278777, however the bad type encountered here is type=157, max=48

    =========================
    WORKAROUND:
    None.
    =========================
    RELATED BUGS:
    Bug 2278777
    =========================
    REPRODUCIBILITY:
    1. State if the problem is reproducible; indicate where and predictability
    Problem is not immediately reproducible, but has occured twice so far within a
    few days of each other.

    2. List the versions in which the problem has reproduced
    8.1.7.4

    3. List any versions in which the problem has not reproduced
    Unknown
    =========================
    TESTCASE:
    n/a
    ========================
    STACK TRACE:

    ORA-600: internal error code, arguments: [26599], [1], [4]

    ksedmp kgeriv kgeasi joxcre_ ioc_lookup_name jox_invoke_java_ jox_invoke_java_
    jox_handle_java_pre opitsk opiino opiodr opirip opidrv sou2o main start

    ORA-600: internal error code, arguments: [1115]

    ksedmp kgeriv kgesiv ksesic0 kssdch ksuxds kssdch kmcdlc kmcddsc opitsk opiino
    opiodr opirip opidrv sou2o main start

    =========================
    SUPPORTING INFORMATION:
    alert log and trace files from both occurences.
    =========================
    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    n/a
    =========================
    DIAL-IN INFORMATION:
    n/a
    =========================
    IMPACT DATE:
    n/a
    WinZIP file uploaded containing alertlog and trace files.
    B2086813_cenweb_pmon_19595.trc
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    *** 12:01:36.757
    *** ID:(1.1) 2003-01-15 12:01:36.740
    KSSDCH: parent 9bea0fb8 cur 9d5ec85c
    prev 9bea0fc8 next 9d5ec864 own 0 link [9d5ec864,9bea0fc8]
    —————————————-
    SO: 9d5ec85c, type: 157, owner: 0, pt: 49612, flag: -/FLST/CLN/0x58
    *** 12:01:36.757
    ksedmp: internal or fatal error
    ORA-600: internal error code, arguments: [1115]
    —– Call Stack Trace —–
    ksedmp kgeriv kgesiv ksesic0 kssdch ksuxds kssxdl kssdch kmcdlc
    kssxdl kssdch ksudlp kssxdl ksuxdl ksuxda ksucln ksbrdp opirip
    opidrv sou2o main _start

    ksudlp deleting “process”,

    kmcdlc deleting a VC ,
    kssdch deleting children of VC
    kssxdl delete SO
    ksuxds deleting a session (short function ksudel()
    kssdch /* chuck children */
    ksesic0 [1115]
    /* make sure not on free list already and owned by this parent */
    if (bit(cur->kssobflg, KSSOFLST) || (cur->kssobown != so))
    {
    ksdwrf(“KSSDCH: parent %lx cur %lx prev %lx next %lx own %lx link “,
    (long)so, (long)cur, (long)prev, (long)next,
    (long)cur->kssobown);
    ksgdml(&cur->kssoblnk, TRUE);
    ksdwrf(“\n”);
    kssdmp1(cur, 1); /* dump the state object */
    /* ksudss(9); */ /* dump system */
    ksesic0(OERI(1115));
    }

    KSSDCH: parent 9bea0fb8 cur 9d5ec85c prev 9bea0fc8
    next 9d5ec864 own 0 link [9d5ec864,9bea0fc8]
    —————————————-
    SO: 9d5ec85c, type: 157, owner: 0, pt: 49612, flag: -/FLST/CLN/0x58

    problem is (cur->kssobown != so) , in fact it is 0
    ie: cur->kssobown = 0

    Ct had not had a re-occurence of the ORA-600 here so was not willing to take
    the downtime to apply any patches. However, they have now had the error occur
    again, and so are willing to apply diagnostic patches providing they don’t
    impact the system performance as its a production system.
    Ct has confirmed they are willing to apply the diagnostic patch here if the
    tracing can easily be disabled.
    assigning to frank as requested (his term playing up) –
    Ct had the database hang an hour after startup with the events set here. I
    have uploaded the alertlog and trace file in bug2760836_1.zip for you to check
    in case they are of use, although no ORA-600 error occured.
    *** 02/28/03 03:09 am ***
    Ct has had the db hang three times so far this morning with the events set,
    and so will turn off the events for now to allow production usage, as running
    with the events set here appears to be causing excessive contention on the
    child library cache latch.

    Would it be posible to reduce the level on the events to reduce the amount of
    diagnostics generated here.
    As onsite team leader I have updated this bug to P1 as my customer British
    Telecom have experienced a complete loss of service and have requested this
    escalation. See also base bug 2278777 and tar 2632656.1
    This is a critical customer application to our customer and needs urgent
    attention.
    Are there any events we could set here to try to catch the corruption at an
    earlier stage? Still checking on possibility of a testcase here.

    Uploaded all the trace files generated from the first hang with the diagnostic
    patch enabled on Friday, in bug2760836_2.zip. Currently working on
    reproducing the ORA-600 [17034]/[1114]/[1115] errors on shutdown.

    I have setup the ct’s testcase and reproduced the ORA-600
    [1115]/[1114]/[17034] errors here. I have uploaded the alertlog and trace
    files produced for your investigation (bug2760836_3.zip). Will upload details
    on how to setup the testcase later if this helps progress this bug.

    Testcase uploaded in a compressed unix tar file ‘testcase.tar.Z’. This will
    create an itweb subdirectory containing all the required files and a
    README.txt file with instructions.

    The ct is making changes to prevent application connections during startup
    which prevent the ORA-600 [15000] errors. The ORA-600 [1115]/[1114]/[17034]
    errors are different though, and happen on a clenly started database which
    tries to shutdown (immediate) whilst application connections are active, and
    so this is a suspected cause of the ORA-600 and corruption issue, rather than
    connections being made on startup. As yet we have no workaround for this, so
    I don’t agree that we can downgrade this just yet.

    Do you want me to setup the testcase on your development machine, if so please
    supply the necessary information. Otherwise, I will need to arrange access to
    a Global T&D instance?

    Testcase now setup on Solaris system as requested.

    No, you don’t need to execute any of the steps in the README once the testcase
    is setup. It should be a simple case of:
    1. Start the database
    3. Perform a Shutdown immediate.

    Executing an “alter system set mts_servers = 0” prior to shutdown and then
    waiting for a while (about a minute or so in my case) seems to avoid the
    problem. Can you check to see whether this workaround resolves the customer
    case ?
    *** 03/12/03 02:03 am *** (CHG: Sta->11)
    *** 03/12/03 02:03 am ***
    I have tested this here, and even after setting MTS_SERVERS to 0, new
    connections can still be made and the PMON crash still occurs. However, as
    the ct is moving away from using dispatchers listening directly on a port to
    use a listener, this should give a workaround to the shutdown problem.

    The main issue of concern here though was that this testcase showed the same
    ORA-600 [1115] and a PMON crash as being investigated here, and also showed
    the state object tree corruption preventing analysis of the hang that they are
    regularly getting. It is this latter issue that is critical for the ct, and
    was deemed to be due to the same issue as being investigated here. I doubt
    they will be willing to lower the priority here until this corruption is
    resolved. Can we not identify why this corruption occurs? e.g.:

    KSSDCH: parent 7c0414d08 cur 7c00ef450 prev 7c0414d28 next 7c00ef460 own
    7c0414d
    08 link [7c0414d28,7c0414d28]
    —————————————-
    SO: 7c00ef450, type: 3, owner: 7c0414d08, pt: 0, flag: -/FLST/CLN/0x00
    ksedmp: internal or fatal error
    ORA-600: internal error code, arguments: [1115], [], [], [], [], [], [], []
    —– Call Stack Trace —–
    ksedmp ksfini kgeriv kgesiv ksesic0 kssdch kmclcl kmcdlp
    kmmdlp ksudlp kssxdl ksustc ksumcl2 ksucln ksbrdp opirip
    opidrv sou2o main start

    PMON: fatal error while deleting s.o. 7c00c1348 in this tree:

    SO: 7c0414d08, type: 41, owner: 7c00c1348, pt: 0, flag: INIT/-/-/0x00
    (circuit) dispatcher process id = (7c00c1348, 1)
    parent process id = (10, 1)
    user session id = (8, 5)
    connection context = 11e8c4c8
    user session = (7c00ef450), flag = (70a), queue = (8)
    dispatcher buffer = (1), status = (0, 0)
    server buffer = (3), status = (0, 0)
    —————————————-
    SO: 7c00ef450, type: 3, owner: 7c0414d08, pt: 0, flag: -/FLST/CLN/0x00
    Aborting this subtree dump because of state inconsistency

    You should be able to relink on the server, but will probably need to set the
    environment variable TMP as follows:

    export TMP=/bugmnt2/em/rmtdcsol4/tar2638867.1/tmp
    *** 03/18/03 08:03 am ***
    In testing on the customers system, we have confirmed that ORA-4031 errors are
    being encountered due to exhuastion of the Java Pool. Therefore, could this
    be related to bug 2351854?

    Rediscovery Information: PMON cleanup of dead shared server or dispatcher
    with non-outbound child VC’s could lead to ORA-600[1115] with this stack:
    ksudlp->kssdch->kssxdl->kmcdlc->kssdch->kssxdl->ksuxds->kssdch

    Workaround: none

    Release Notes:
    ]] PMON cleanup of dead shared servers or dispatchers that have java
    ]] connections could result in ORA-600[1115], bringing the instance down.

  4. Hdr: 2028564 8.1.7.1.0 RDBMS 8.1.7.1.0 VOS PRODID-5 PORTID-59 ORA-600
    Abstract: DATABASE CRASHED WITH ORA-600 [1115] BECAUSE PMON FAILED TO CLEANUP SESSION

    Problem:
    ~~~~~~~~
    1. Clear description of the problem encountered
    -> Cst database crashed with ORA-600 [1115], when PMON was cleaning up another
    failed process

    2. Indication of the frequency and predictability of the problem
    -> One time ocurrence, so far

    3. Sequence of events leading to the problem
    -> Unknown, see below for some more info.

    4. Technical impact on the customer. Include persistent after effects.
    -> Production system crashed, but no persistent after effects

    Diagnostic Analysis:
    ~~~~~~~~~~~~~~~~~~~~
    Problem started at 9:41, when process id 9715 crashed with ORA-600 [729]
    followed by a core dump.

    Wed Sep 19 09:41:37 2001
    Errors in file /opt/oracle8/admin/D9/udump/ora_9715_d9.trc:
    ORA-600: internal error code, arguments: [729], [168], [space leak], [], []
    Wed Sep 19 09:41:52 2001
    Errors in file /opt/oracle8/admin/D9/udump/ora_9715_d9.trc:
    ORA-7445: exception encountered: core dump [11] [4026491400] [240] [0] [] []
    ORA-600: internal error code, arguments: [729], [168], [space leak], [], []

    This is bug 1712645
    Now PMON starts cleaning up this process, and fails with ORA-600 [1115],
    deleting a state object (in this case c0000000010974d0), which was a user
    session
    Workaround:
    ~~~~~~~~~~~
    None

    Related Bugs:
    ~~~~~~~~~~~~~
    There are a couple of bugs that might be related
    All are closed as not reproducible or open pending more info.

    Reproducibility:
    ~~~~~~~~~~~~~~~~
    1. State if the problem is reproducible; indicate where and predictability
    -> Not reproducible

    2. List the versions in which the problem has reproduced
    -> 8.1.7, only once

    3. List any versions in which the problem has not reproduced
    -> N/A

    Testcase:
    ~~~~~~~~~
    Don’t have one
    .Stack Trace:
    ~~~~~~~~~~~~
    Available from Oracle trace file:

    Supporting Information:
    ~~~~~~~~~~~~~~~~~~~~~~~
    Uploading these files into alert.log:

    alert.log – alert.log for the instance
    ora_9715_d9.trc – Oracle trace file with ORA-600 [729]
    parameters.log – output from “show parameters”
    pmon_9203_d9.trc – PMON trace file

    24 Hour Contact Information for P1 Bugs:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    N/A

    Dial-in Information:
    ~~~~~~~~~~~~~~~~~~~~
    N/A

    Impact Date:
    ~~~~~~~~~~~~
    N/A
    donwloading the trace files , will try to work out a testcase ..
    some questions :
    -1 When you said “This is bug 1712645 ” , it is because you know
    . Customer interrupted a ‘drop column’ operation ?
    -2 If answer to previous question is “yes”,
    . does the table has/had any LOB/ CLOB /NLOB /FLOB column ?
    -3 alert.log shows resource_limit=true,
    . which limits are defined for db user “PBIGRP”
    Summary
    ~~~~~~~~~
    1- Customer problem :
    “alter table truncate partition …” interrupted by cntrl-c
    ORA-600 [729], [168], [space leak]
    leaked memory is :
    Chunk 80000001000c4050 sz= 64 freeable “Extent Starting”
    Chunk 80000001000b11b8 sz= 64 freeable “Extent Sizes ”
    Chunk 80000001000b6ee8 sz= 40 freeable “Skip Extents ”
    ORA-7445 core dump [11] [4026491400] [240] [0]
    PMON comes in and hits ORA-600 [1115]
    .
    PMON: fatal error while deleting s.o. c0000000010974d0

    2- test case :

    the more similar error I’m able to reproduce is this :
    runt the testcase “tc2.sql” and follow the instructions documented in it,

    Using Oracle8i Enterprise Edition Release 8.1.7.0.0 :
    sqlplus sees a ORA-3113 and dead Oracle shadow shows:
    ORA-600 [729], [132], [space leak]
    Chunk 972ba60 sz= 52 freeable “Extent Sizes ”
    Chunk 971c974 sz= 28 freeable “Skip Extents ”
    Chunk 971b17c sz= 52 freeable “Extent Starting”

    Using Oracle9i Enterprise Edition Release 9.0.1.0.0 :
    sqlplus sees a ORA-600 [729], [156], [space leak]
    Chunk a229924 sz= 36 freeable “Skip Extents ”
    Chunk a2308b8 sz= 60 freeable “Extent Sizes ”
    Chunk a225654 sz= 60 freeable “Extent Starting”
    Trace file is ora_12326_s817.trc
    Rediscovery information: PMON dies cleaning up a process which had gotten an
    ORA-600 [729] error or an ORA-600 [4400].

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号