Alert logfile reported as below:
********************* Wed May 27 13:11:47 2009 Errors in file /u01/app/oracle/admin/proa021/udump/proa021_ora_9533.trc: ORA-07445: exception encountered: core dump [memset()+116] [SIGSEGV] [Address not mapped to object] [0] [] [] From Trace file ******************** Dump file /u01/app/oracle/admin/proa021/udump/proa021_ora_9533.trc Oracle8i Enterprise Edition Release 8.1.7.4.0 - Production With the Partitioning option JServer Release 8.1.7.4.0 - Production ORACLE_HOME = /u01/app/oracle/product/817proa021 System name: SunOS Node name: v08k01 Release: 5.8 Version: Generic_117350-38 Machine: sun4u Instance name: proa021 Redo thread mounted by this instance: 1 Process Info ****************** Oracle process number: 117 Unix process pid: 9533, image: oracle@v08k01 (TNS V1-V3) Error ********* 2009-05-27 13:11:47.847 ksedmp: internal or fatal error ORA-07445: exception encountered: core dump [memset()+116] [SIGSEGV] [Address not mapped to object] [0] [] [] Current SQL(Current SQL statement for this session) *********************************************************************** : SELECT COUNT(PO_LINE_ID) FROM PO_LINES_INTERFACE WHERE PO_HEADER_ID = :b1 Call Stack functions ************************* ksedmp <- ssexhd <- sigacthandler <- memset ##################################################################################### From Alert logfile ********************* Wed May 27 13:18:39 2009 Errors in file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc: ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], [] Wed May 27 13:18:56 2009 Errors in file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc: ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], [] From Tracefile ******************* Dump file /u01/app/oracle/admin/proa021/bdump/proa021_pmon_9584.trc Oracle8i Enterprise Edition Release 8.1.7.4.0 - Production With the Partitioning option JServer Release 8.1.7.4.0 - Production ORACLE_HOME = /u01/app/oracle/product/817proa021 System name: SunOS Node name: v08k01 Release: 5.8 Version: Generic_117350-38 Machine: sun4u Instance name: proa021 Redo thread mounted by this instance: 1 Process Info **************** Oracle process number: 2 Unix process pid: 9584, image: oracle@v08k01 (PMON) Error ******** 2009-05-27 13:18:39.766 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [1115], [], [], [], [], [], [], [] Call Stack Functions: **************************** ksedmp <- kgeriv <- kgesiv <- ksesic0 <- kssdch <- ksuxds <- kssxdl <- kssdch <- ksudlp <- kssxdl <- ksuxdl <- ksuxda <- ksucln <- ksbrdp <- opirip <- opidrv <- sou2o <- main <- start CURRENT SESSION'S INSTANTIATION STATE ********************************************************* current session=8c8fdfbc ---- Cursor Dump ------ Current cursor: 0, pgadep: 0 Cursor Dump: End of cursor dump END OF PROCESS STATE ******************** Cursor Dump ************************ Current cursor: 0, pgadep: 0 Cursor Dump: End of cursor dump ksedmp: no current context area
ERROR: ORA-600 [1115]
VERSIONS: versions 6.0 to 10.1
DESCRIPTION: We are encountering a problem while cleaning up a state object.
The State Object is already on free list or has the wrong parent State Object.
FUNCTIONALITY: Kernal Service State object manager
IMPACT:
POSSIBLE INSTANCE FAILURE
PROCESS FAILURE
NON CORRUPTIVE - No underlying data corruption.
SUGGESTIONS: This error may be reported as a direct result of another earlier problem.
Lot of bugs reported
Bug 3837965 : Abstract: ORA-7445'S AND 600'S LEADING UP TO DB CRASH
Comp Version: 8.1.7.4.0
Fixed In Version: 9.2.0.
-------------------------------------------------------------
Bug 3134843 : Abstract: ORACLE PROCESSES CRASHING WITH ORA-7445 SEGVIO ON A NUMBER OF DATABASES
Comp Version: 8.1.7.4
Status: Closed, could not be reproduced
----------------------------------------------------------------
Bug 2760836: Abstract: PMON cleanup of dead shared servers/dispatchers can crash instance(OERI:26599 / OERI 1115)
--------------------------------------------------------------
Note 2760836.8 PMON cleanup of dead shared servers/dispatchers can crash instance (OERI 26599 / OERI 1115)
----------------------------------------------------------------
PROPOSED SOLUTION JUSTIFICATION(S)
==================================
1. One-off patch for Bug 2760836 has fixed this issue...so after customer apply the one-off patch...then this issue will be solved.
OR
2. 9.2.0.4 or later version has fixed this issue...so after customer upgrade to at least 9.2.0.4 version...then this issue will be solved.
The solution can be justified by the followings:
Note 2760836.8 PMON cleanup of dead shared servers/dispatchers can crash instance (OERI 26599 / OERI 1115)
Hdr: 3837965 8.1.7.4.0 RDBMS 8.1.7.4.0 VOS HEAP MGMT PRODID-5 PORTID-23 ORA-7445 4284478
Abstract: ORA-7445’S AND 600’S LEADING UP TO DB CRASH
Completed: alter tablespace PRECISE_I3_OR_TMP end backup
Tue Aug 17 08:43:23 2004
Tue Aug 17 08:43:23 2004
Beginning log switch checkpoint up to RBA [0x41.2.10], SCN: 0x0708.39ff37e8
Thread 1 advanced to log sequence 65
Current log# 16 seq# 65 mem# 0: /starsdb1/d07/oradata/cacst/redo16_1.log
Current log# 16 seq# 65 mem# 1: /starsdb1/d08/oradata/cacst/redo16_2.log
Tue Aug 17 08:43:23 2004
ARCH: Beginning to archive log# 15 seq# 64
Tue Aug 17 08:43:23 2004
ARC0: Beginning to archive log# 15 seq# 64
ARC0: Failed to archive log# 15 seq# 64
Tue Aug 17 08:45:29 2004
ARC1: Beginning to archive log# 15 seq# 64
ARC1: Failed to archive log# 15 seq# 64
Tue Aug 17 08:47:16 2004
ARC0: Beginning to archive log# 15 seq# 64
ARC0: Failed to archive log# 15 seq# 64
Tue Aug 17 08:48:37 2004
ARCH: Completed archiving log# 15 seq# 64
Tue Aug 17 08:48:57 2004
Errors in file /apps/oracle/admin/cacst/bdump/cacst_s011_28609.trc:
+RA-7445: exception encountered: core dump [00000001002F3384] [SIGSEGV]
[Addres
s not mapped to object] [0] [] []
Tue Aug 17 08:48:57 2004
DIAGNOSTIC ANALYSIS:
——————–
The ct. put did an end backup at 8:43 at Tue Aug 17 08:48:57 2004
he received his first ORA-7445 error which then cascaded into more
ORA-600 Ora-7445 errors and a db crash.
…..
Errors in file /apps/oracle/admin/cacst/bdump/cacst_lgwr_28571.trc:
+RA-600: internal error code, arguments: [2667], [1], [65], [39986], [0],
[0],
[0], [0]
Tue Aug 17 08:49:37 2004
LGWR: terminating instance due to error 600
Tue Aug 17 08:49:37 2004
Errors in file /apps/oracle/admin/cacst/bdump/cacst_s093_28778.trc:
+RA-600: internal error code, arguments: [], [], [], [], [], [], [], []
Instance terminated by LGWR, pid = 28571
Tue Aug 17 08:53:59 2004
The first error
cacst_s011_28609.trc
ORA-7445: exception encountered: core dump [00000001002F3384] [SIGSEGV]
[Addres
s not mapped to object] [0] [] []
Current SQL statement for this session:
SELECT CO_USERS_MSTR.USER_ID, CO_USERS_MSTR.USER_NAME,
CO_USERS_MSTR.USER_NAME || ‘ ‘|| ‘-‘ || ‘ ‘|| CO_USERS_MSTR.USER_ID
computed
_name FROM CO_USERS_MSTR WHERE CO_USERS_MSTR.CAP_LVL_CODE >ora-7445 kghufreeuds_01
cacst_s002_6489.trc >>ORA-7445 kghssgfr2
acst_s042_28673.trc>>ora-600 17182
cacst_s002_6489.trc>>divide by 0
cacst_s006_28599.trc>>
RA-600: internal error code, arguments: [kghssgfr2], [0], [], [], [], [],
[],
[]
+RA-1555: snapshot too old: rollback segment number 95 with name “RBS044″
too small
Tue Aug 17 08:49:02 2004
cacst_s006_28599.trc:
+RA-7445: exception encountered: core dump [0000000101138C74] [SIGSEGV]
[Addres
s not mapped to object] [200] [] []
+RA-7445: exception encountered: core dump [00000001010CEB80] [SIGSEGV]
[Address not mapped to object] [4294967280] [] []
cacst_s001_26644.trc:
+RA-7445: exception encountered: core dump [00000001003D74A0] [SIGSEGV]
[Addres
s not mapped to object] [4104] [] []
+RA-7445: exception encountered: core dump [00000001003D74A0] [SIGSEGV]
[Addres
s not mapped to object] [4104] [] []
+RA-7445: exception encountered: core dump [FFFFFFFF7D101A40] [SIGSEGV]
[Addres
s not mapped to object] [196870144] [] []
Tue Aug 17 08:49:02 2004
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-
DIAL-IN INFORMATION:
——————–
IMPACT DATE:
————
immediate
Uploaded evidence:
~~~~~~~~~~~~~~~~~~~
cacst_s001_26644
SEGV accessing 0x60bbc0000 No trace
cacst_s002_6489
OERI:kghssgfr2 0
ORA-1555
Private frame segmented array entry looks zeroed out.
cacst_s006_28599
SEGV accessing 0xfffffffffffffff0
Instantiation object looks all zeroed out
SEGV looks like ptr-0x10 from the zeroed out data
cacst_s011_28609
SEGV accessing 0x0
private frame segmented array looks zeroed again
cacst_s024_28635
OERI:kghufreeuds_01 [0x5d60d8e38]
Memory around this heap all zeroed out
cacst_s042_28673
OERI:17182 addr=0x5e56ffd10
Memory around this zeroed out
DB was terminated by LGWR seeing a corruption in memory.
Trace cacst_lgwr_28571.trc is not uploaded.
From the above sample of traces it looks like zeroes were seen across
lots of different chunks of memory. An Oradebug ipc dump would help
confirm these were in SGA space or not – as this is MTS it looks
like these may have been in SGA space.
So we need:
Some more traces to cross check the address ranges with problems.
Oradebug IPC when instance is up to get mapped address range
of the SGA.
cacst_lgwr_28571 to check why LGWR died.
Setting 10501 is unlikely to be of any help for this scenario.
From the IPC dump:
SGA variable range is 05217ca000 – 05ED4CA000-1
Java is at 05ed4ca000 +
Redo is after this at 060b1a2000 +
SGA ends at 0x60bbc0000-1
cacst_s001_26644 dumped with SEGV accessing 0x60bbc0000 but is
has no trace content. This is just past the end of the SGA.
It is highly likely that this pid 26644 (s001) was the culprit and
that it span across memory writing zeroes across various chunks of
the SGA. As this is MTS those zeroes mostly hit other sessions
private structures resulting in various OERI/core dumps. Finally
it looks like this pid went to write past the end of the SGA and
so seg faulted itself.
LGWR was correct to crash the instance.
The trace file cacst_s001_26644 contains no information whatsoever.
Is there a core dump from this process ?
If so please try to get the core file and a stack from it on the system
(see note 1812.1 for how to extract a stack from a core file. This is
best done on the customer machine and has to use the oracle executable
as it was at the time of the error – ie: If oracle has been relinked
or patched since then this may not give a good stack).
Stack from adb only shows the dump routines themselves which are
invoked when a process aborts so there is no clue as to what
this process was actually doing in the first place.
It is highly likely this process went beserk for some reason and
trashed lots of the SGA with 0x00’s in lots of locations but there
is nothing in the stack to give a clue why it did so nor what code
it was in when it did so.
There is only evidence above from one failure.
I have outlined what appears to have happened for that failure.
The problem is knowing what OS PID 26644 was doing and for that
I have no visible evidence. The following may help:
a. Check all trace files from the time of the errors to see if
any contain a SYSTEMSTATE dump. If so that may give a clue
what SQL-26644 was running.
b. Get the raw corefile from pid 26644 – There is a very small
chance we can get an idea of any SQL from it.
For future:
Try to get a SYSTEMSTATE when things start going wrong.
The simplest way may be event=”7445 trace name systemstate level 10”
.
As soon as a dump occurs the offending process (for the scenario
in this bug) is probably in a reasonably tight CPU loop so try
to locate any high CPU sessions and get stacks from them (using
pstack or oradebug with errorstacks).
.
Keep an audit (any form you like) of DB connections and what they are
doing. Even frequent selects from v$session for active sessions may help
get some direction to help isolate what the culprit is up to.
I got the following from ess30 under bug3837965/aug10:
alert_xcacst.log test_RDA.rda.tar.Z xcacst_ora_22724.trc
alertandtrc.tar traceAndCore.tar xcacst_ora_22728.trc
core_26644.txt usertraces.tar xcacst_ora_22736.trc
cores.tar xcacst_ora_22714.trc
test_cores.tar xcacst_ora_22720.trc
I have extracted all the files from those tar archives and have nothing
from July (other than some core files which may be July but I cannot
do anything with a core file without the matching alert log and
trace files):
aug10/bdump All files are from cacst instance around 17 Aug 8:49
aug10/udump Most files from cacst around 16-Aug 12:08 + 13:30
but these are deadlock traces (ORA-60). Other 2 files
are just an http message.
aug10/xcacst*.trc All from xcacst instance Around 10-Aug 9:49 – 10:15
No sign of any crash but alert ends with shutdown
at 10:11
I will proceed seeing if I can extract anything at all from core_26644
but this will take time and may get nowhere.
Please see about getting the alert + trace + core from the July 22 crash
and also please clarify what the “xcacst” instance has to do with
the issue in this bug report as that seems to be running on a totally
different machine.
Ive spent time looking at what information we can get from the
core image from PID 26644 but it isnt very much as the process
was a shared server so most of information was stored in the
SGA. As you have SHADOW_CORE_DUMP=PARTIAL the SGA has not been
dumped to the core file so all the information about current
cursors etc.. is missing. All I can say is that it was executing
cursor 52 which was a SELECT statement.
From the core file itself it looks like the stack may be intact
but just not readable by “adb”. It would be very worthwhile
seeing of the stack can be unwound further by “dbx” rather than
“adb”. This is best done on the customer machine but if that is not
possible then please get the oracle executable itself and all files
from the output of “ldd $ORACLE_HOME/bin/oracle”.
eg: ldd $ORACLE_HOME/bin/oracle
Note all filenames
COPY all of those files into a directory somewhere
TAR them up , compress it and upload it.
IMPORTANT: If Oracle has been relinked since Aug 17 then the stack
unwind may give false information so please get the timestamp from
ls -l $ORACLE_HOME/bin/oracle also.
At present I have no more useful trace to analyze available on ess30.
I will try to put some notes together on the 17-Aug corruption
and on steps which might be useful to help should this occur again.
Looked at Jul 22 traces and they show same problem occurred.
Offending process is calling memset() and trashing the SGA.
Diagnosis and suggestions mailed out.
Main thing I need from this bug are a stack from either/both of
the core files for PID 26644 (aug 17) and PID 16193 (Jul 22)
so we can see which function calls memset() to see if a diagnostic
can be produced (to check the bounds on the memset before it is
called).
Jul 22 traces do show Java is being used , which is what the above
is related to freeing up.
I have logged bug 3902349 to request the above code segment
be protected and have diagnostics added so that should the problem
recur in dedicated mode then we have a chance of getting a complete
trace rather than a zero filled UGA.
Putting this bug to 10 so you see the new bug# and also to get any
trace should you see a dump from one of the dedicated connections
which looks like it was due to a kokmeoc->memset call in case it
already has sufficient information to help progress the cause.
*** 09/28/04 01:19 am ***
We have a diagnostic patch pending under bug 3902349 .
The diagnostic would do the following:
From the core dump uploaded the trashing of memory occurred when
freeing up data for a JAVA call from process private information.
This was in the large pool so the trash trampled the SGA.
Now that the sessions are dedicated it is expected that if the
problem condition arose in the SAME location of code as before
then memory would be trashed but it would be private memory and
then the process would be likely to core dump, but may not have
sufficient information to get closer to the true root cause.
The diagnostic adds a simple check on the relevant free of memory
to see if the amount of memory to be cleared looks sensible.
If not the process will die with an ORA-600 instead BEFORE
it trashes lots of memory.
The overhead of the extra check is minimal (a few extra assembler
instructions which are not executed often).
However, the module which would have the debug patch added is also
included in the security patch for alert #68.
As this would be a diagnostic patch then in order to cut it we need
to know:
a. What patches customer currently has applied to the system, if any.
b. If they do intend to apply the security patch for alert #68 to
this system or not.
The failing cursor is:
select CO_DRM_PKG.GET_SCRIPT
( :”SYS_B_0″ , :”SYS_B_1″ , :”SYS_B_2″ , :”SYS_B_3″ ) from dual
This has undergone literal replacement.
It looks like the system changes the value of CURSOR_SHARING
daily between EXACT and FORCE as there are numerous sessions
using the above SQL with pure literals as well as some using
the replaced form.
It looks like this call probably invokes some Java code underneath it
as the error is for cleaning up ILMS data and KGMECID_JAVA is set.
Please see if we can get / do the following:
1. Definition of CO_DRM_PKG.GET_SCRIPT
2. Set an event=”7445 trace name library_cache level 10″
to get the library cache detail on the next error.
(Can be set from alter system when convenient)
3. DBMS_SHARED_POOL.KEEP the cursor for the above select
and the PLSQL package and any Java object is directly calls
and any TYPES it references
4. It is not clear why the system repeatedly changes CURSOR_SHARING.
Is it possible to stick with one setting (FORCE for online)
and have other sessions ALTER SESSION to change it if they need
EXACT? If possible this would help with ‘3’ as when it is EXACT
you have seperate cursors for the CO_DRM_PKG.GET_SCRIPT
SQL which makes it hard to KEEP them (as they are literals).
If this is not possible just KEEP the literal replaced cursor.
5. If you have the core dump from PID 24411 I can probably use that
to get further but ‘3’ may provide a workaround.
I have reproduced this problem and built a stand alone testcase
which is logged in bug 4284478.
The underlying problem is fixed in 9i due to a rather large code
change in the ILMS area.
I need to know how you want to proceed now as
8.1.7.4 fixes are only possible for customers with EMS.
The options are:
If you have EMS then I can put bug 4284478 to DDR for an 8i code fix.
(Note that any 8i fix will clash with the CPU bundle and be subject to
CPU clash MLR release procedures when fixed).
If you do not have EMS or are moving off of 8i then I will mark
bug 4284478 as fixed in 9.0 fixed by transaction “varora_ilms_inh_1”
as there is no code change needed in 9i or higher releases to
address the dump.
Hdr: 3134843 8.1.7.4 RDBMS 8.1.7.4 VOS KG SUPP PRODID-5 PORTID-453 ORA-7445
Abstract: ORACLE PROCESSES CRASHING WITH ORA-7445 SEGVIO ON A NUMBER OF DATABASES
PROBLEM:
——–
Oracle foreground processes supporting 20 different databases
on the same E15k Sun machine are crashing sporadically in a large number
of functions within the Oracle code. Oracle Home is based on Veritas VXFS file
system on a Hitachi SAN based LUN. This file system is accessed exclusively
from the active node of their HA cluster. Oracle data, log and control files
are based on raw volumes that can be accessed from each nodes, but are
accessed only from the active node.
The problem started happening on August 1 while the Secondary node of their HA
cluster was in use. They failed over on the 3rd of August to the Primary node
and have been running on it ever since. The following is a spreadsheet
showing date, time, Oracle SID, function on top of stack, num of failing
processes:
Date Time SID Error # of Errors Notes
8/1/2003 13:09:36 clp13 ORA-7445: exception encountered: core dump
[kcbzgm()+320] [SIGBUS] 92 Crashed/Restored
8/1/2003 9:39:48 clp11 ORA-7445: exception encountered: core dump
[snttread()+92] 1
8/1/2003 8:25:54 clp07 ORA-7445: exception encountered: core dump
[opiodr()+5756] 1
8/1/2003 13:09:33 csp05 ORA-7445: exception encountered: core dump
[ssdinit()+2580] 1376
8/1/2003 16:09:59 csp02 ORA-7445: exception encountered: core dump
[kcbzib()+7344] 1
8/1/2003 16:09:59 csp07 ORA-7445: exception encountered: core dump
[ssdinit()+3208] 3
8/4/2003 10:53:49 clp07 ORA-7445: exception encountered: core dump
[opiosq0()+2468] 1
8/4/2003 12:45:38 clp07 ORA-7445: exception encountered: core dump
[snttread()+56] 6
8/4/2003 12:23:28 csp05 ORA-7445: exception encountered: core dump
8/5/2003 17:55:18 csp04 ORA-7445: exception encountered: core dump
[snttread()+60] 1
8/6/2003 18:25:20 csp02 ORA-7445: exception encountered: core dump [01A636AC]
[SIGILL] [Illegal opcode] 6
8/6/2003 18:25:22 clp08 ORA-7445: exception encountered: core dump
[ssdinit()+3208] 1
8/7/2003 13:59:57 clp13 ORA-7445: exception encountered: core dump
[sdcreate()+224] 1
8/7/2003 12:05:28 clp07 ORA-7445: exception encountered: core dump
[kghalf()+464] 6
8/7/2003 16:25:23 csp05 ORA-7445: exception encountered: core dump
[skgfqio()+2664] 6
8/7/2003 12:07:01 csp02 ORA-7445: exception encountered: core dump
[memset()+116] 40
8/7/2003 14:01:40 csp02 ORA-7445: exception encountered: core dump
[snttread()+92] 3
8/7/2003 12:07:02 csp10 ORA-7445: exception encountered: core dump
[snttread()+92] 1
8/12/2003 9:48:25 csp05 ORA-7445: exception encountered: core dump
[skgfqio()+2860] 1
8/13/2003 18:22:12 clp13 ORA-7445: exception encountered: core dump
[ssdinit()+2904] 1
8/13/2003 18:22:10 clp11 ORA-7445: exception encountered: core dump
[01A63F14] [SIGILL] [Illegal opcode] 3
8/18/2003 8:04:00 clp11 ORA-7445: exception encountered: core dump
[ksliwat()+1028] [SIGSEGV] 1
8/18/2003 11:11:42 clp11 ORA-7445: exception encountered: core dump
[01A63F14] [SIGILL] [Illegal opcode] 6
8/18/2003 8:45:05 clp07 ORA-7445: exception encountered: core dump
[snttread()+92] 1
8/18/2003 17:54:45 clp07 ORA-7445: exception encountered: core dump
[strlen()+128] 1
8/18/2003 16:52:04 csp08 ORA-7445: exception encountered: core dump
[nsprecv()+1428] 1
8/18/2003 19:49:02 clp08 ORA-7445: exception encountered: core dump
[is_so_loaded()+1812] 3
8/19/2003 11:26:20 clp13 ORA-7445: exception encountered: core dump
[sdcreate()+348] [SIGSEGV] 59 Crashed/Restored
8/19/2003 12:19:39 clp15 ORA-7445: exception encountered: core dump
[ldxdts()+168] 1
DIAGNOSTIC ANALYSIS:
——————–
Until it was pointed out as a problem, some (unknown number of) the customer’s
instances and listeners have been started by a user having been logged in
through secure shell, which automatically sets the ulimit for the process for
a core file dump size to 0.
Out of the 9 core files dump, 8 showed that seg vio happened while in signal
handler. This is true for a large majority of trace files dumped.
According to trace files, Oracle is experiencing either seg vio, or seg bug,
or sigill – illegal instruction. The referenced addressed is either not
mapped to object and is comletely out of the process’ memory map, or not
mapped and misalligned, or attempted to execute an illegal instraction.
We have not been able to determine if there is a pattern to addresses being
referenced. There has been ora-600 reported, no database blocks or heap
corruptions.
WORKAROUND:
———–
The customer moved Oracle Home from Veritas file system based on a Htachi SAN
provided LUN to a unix file system bundled with Solaris based on a different
LUB from the same storage subsystem. The problem has not reproduced in 20
days.
RELATED BUGS:
————-
REPRODUCIBILITY:
—————-
TEST CASE:
———-
n/a
STACK TRACE:
————
SUPPORTING INFORMATION:
———————–
The files are on ess30.
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-
DIAL-IN INFORMATION:
——————–
IMPACT DATE:
————
Discussed bug with DDR. Bug assigned and Setting status 11 for continued
investigation.
PLEASE INSTRUCT CUSTOMER TO SAVE ORIGINAL VERSION OF SP.O IN CASE PATCH
NEEDS TO BE BACKED OUT !!!
1. How is the diagnostic information activated and de-activated:
Exception handling code is de-activated by default; Oracle processes should
dump core upon receipt of appropriate signals (SIGSEGV et al).
2. What new information is being traced and when is it triggered?
Exception information should be obtained from core file instead of Oracle
trace file.
3. What is the performance impact of the new diagnostic tracing?
None.
4. What is expected from the generated diagnostic information and where will
this lead to in terms of bug resolution:
Hopefully, core file should shed more light on reason(s) for exception(s); if
not, then at least this might point to a hardware problem, as I strongly
suspect.
Hdr: 2760836 8.1.7.4 RDBMS 8.1.7.4 SHARED SERVER PRODID-5 PORTID-453 ORA-600
Abstract: PMON INSTANCE CRASH FOLLOWING ORA-600 [26599]/ORA-600 [1115] ERRORS
PROBLEM:
1. Clear description of the problem encountered:
Intermittently the customer database receives an ORA-600 [26599] [1] [4] error
followed by an ORA-600 [1115] from a shared server process. PMON then crashes
the instance due to the ORA-600 [1115].
2. Pertinent configuration information (MTS/OPS/distributed/etc)
MTS, JavaVM
3. Indication of the frequency and predictability of the problem
Problem is intermittent, having occured twice so far within a few days.
4. Sequence of events leading to the problem
Normal use of ct’s database applications.
5. Technical impact on the customer. Include persistent after effects.
Loss of service of the database due to instance crash.
=========================
DIAGNOSTIC ANALYSIS:
In both cases the ORA-600 [26599] trace files show the current execution being
within the ‘oracle/aurora/net/Presentation’ class, and both occurrences
followed the tablespaces being coalesced. I’m not sure if these are related,
but it would certainly suggest that some form of kgl lock corruption is
occuring when executing this JVM class. This looks very similar to
bug 2278777, however the bad type encountered here is type=157, max=48
=========================
WORKAROUND:
None.
=========================
RELATED BUGS:
Bug 2278777
=========================
REPRODUCIBILITY:
1. State if the problem is reproducible; indicate where and predictability
Problem is not immediately reproducible, but has occured twice so far within a
few days of each other.
2. List the versions in which the problem has reproduced
8.1.7.4
3. List any versions in which the problem has not reproduced
Unknown
=========================
TESTCASE:
n/a
========================
STACK TRACE:
ORA-600: internal error code, arguments: [26599], [1], [4]
ksedmp kgeriv kgeasi joxcre_ ioc_lookup_name jox_invoke_java_ jox_invoke_java_
jox_handle_java_pre opitsk opiino opiodr opirip opidrv sou2o main start
ORA-600: internal error code, arguments: [1115]
ksedmp kgeriv kgesiv ksesic0 kssdch ksuxds kssdch kmcdlc kmcddsc opitsk opiino
opiodr opirip opidrv sou2o main start
=========================
SUPPORTING INFORMATION:
alert log and trace files from both occurences.
=========================
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
n/a
=========================
DIAL-IN INFORMATION:
n/a
=========================
IMPACT DATE:
n/a
WinZIP file uploaded containing alertlog and trace files.
B2086813_cenweb_pmon_19595.trc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*** 12:01:36.757
*** ID:(1.1) 2003-01-15 12:01:36.740
KSSDCH: parent 9bea0fb8 cur 9d5ec85c
prev 9bea0fc8 next 9d5ec864 own 0 link [9d5ec864,9bea0fc8]
—————————————-
SO: 9d5ec85c, type: 157, owner: 0, pt: 49612, flag: -/FLST/CLN/0x58
*** 12:01:36.757
ksedmp: internal or fatal error
ORA-600: internal error code, arguments: [1115]
—– Call Stack Trace —–
ksedmp kgeriv kgesiv ksesic0 kssdch ksuxds kssxdl kssdch kmcdlc
kssxdl kssdch ksudlp kssxdl ksuxdl ksuxda ksucln ksbrdp opirip
opidrv sou2o main _start
ksudlp deleting “process”,
…
kmcdlc deleting a VC ,
kssdch deleting children of VC
kssxdl delete SO
ksuxds deleting a session (short function ksudel()
kssdch /* chuck children */
ksesic0 [1115]
/* make sure not on free list already and owned by this parent */
if (bit(cur->kssobflg, KSSOFLST) || (cur->kssobown != so))
{
ksdwrf(“KSSDCH: parent %lx cur %lx prev %lx next %lx own %lx link “,
(long)so, (long)cur, (long)prev, (long)next,
(long)cur->kssobown);
ksgdml(&cur->kssoblnk, TRUE);
ksdwrf(“\n”);
kssdmp1(cur, 1); /* dump the state object */
/* ksudss(9); */ /* dump system */
ksesic0(OERI(1115));
}
…
KSSDCH: parent 9bea0fb8 cur 9d5ec85c prev 9bea0fc8
next 9d5ec864 own 0 link [9d5ec864,9bea0fc8]
—————————————-
SO: 9d5ec85c, type: 157, owner: 0, pt: 49612, flag: -/FLST/CLN/0x58
…
problem is (cur->kssobown != so) , in fact it is 0
ie: cur->kssobown = 0
Ct had not had a re-occurence of the ORA-600 here so was not willing to take
the downtime to apply any patches. However, they have now had the error occur
again, and so are willing to apply diagnostic patches providing they don’t
impact the system performance as its a production system.
Ct has confirmed they are willing to apply the diagnostic patch here if the
tracing can easily be disabled.
assigning to frank as requested (his term playing up) –
Ct had the database hang an hour after startup with the events set here. I
have uploaded the alertlog and trace file in bug2760836_1.zip for you to check
in case they are of use, although no ORA-600 error occured.
*** 02/28/03 03:09 am ***
Ct has had the db hang three times so far this morning with the events set,
and so will turn off the events for now to allow production usage, as running
with the events set here appears to be causing excessive contention on the
child library cache latch.
Would it be posible to reduce the level on the events to reduce the amount of
diagnostics generated here.
As onsite team leader I have updated this bug to P1 as my customer British
Telecom have experienced a complete loss of service and have requested this
escalation. See also base bug 2278777 and tar 2632656.1
This is a critical customer application to our customer and needs urgent
attention.
Are there any events we could set here to try to catch the corruption at an
earlier stage? Still checking on possibility of a testcase here.
Uploaded all the trace files generated from the first hang with the diagnostic
patch enabled on Friday, in bug2760836_2.zip. Currently working on
reproducing the ORA-600 [17034]/[1114]/[1115] errors on shutdown.
I have setup the ct’s testcase and reproduced the ORA-600
[1115]/[1114]/[17034] errors here. I have uploaded the alertlog and trace
files produced for your investigation (bug2760836_3.zip). Will upload details
on how to setup the testcase later if this helps progress this bug.
Testcase uploaded in a compressed unix tar file ‘testcase.tar.Z’. This will
create an itweb subdirectory containing all the required files and a
README.txt file with instructions.
The ct is making changes to prevent application connections during startup
which prevent the ORA-600 [15000] errors. The ORA-600 [1115]/[1114]/[17034]
errors are different though, and happen on a clenly started database which
tries to shutdown (immediate) whilst application connections are active, and
so this is a suspected cause of the ORA-600 and corruption issue, rather than
connections being made on startup. As yet we have no workaround for this, so
I don’t agree that we can downgrade this just yet.
Do you want me to setup the testcase on your development machine, if so please
supply the necessary information. Otherwise, I will need to arrange access to
a Global T&D instance?
Testcase now setup on Solaris system as requested.
No, you don’t need to execute any of the steps in the README once the testcase
is setup. It should be a simple case of:
1. Start the database
3. Perform a Shutdown immediate.
Executing an “alter system set mts_servers = 0” prior to shutdown and then
waiting for a while (about a minute or so in my case) seems to avoid the
problem. Can you check to see whether this workaround resolves the customer
case ?
*** 03/12/03 02:03 am *** (CHG: Sta->11)
*** 03/12/03 02:03 am ***
I have tested this here, and even after setting MTS_SERVERS to 0, new
connections can still be made and the PMON crash still occurs. However, as
the ct is moving away from using dispatchers listening directly on a port to
use a listener, this should give a workaround to the shutdown problem.
The main issue of concern here though was that this testcase showed the same
ORA-600 [1115] and a PMON crash as being investigated here, and also showed
the state object tree corruption preventing analysis of the hang that they are
regularly getting. It is this latter issue that is critical for the ct, and
was deemed to be due to the same issue as being investigated here. I doubt
they will be willing to lower the priority here until this corruption is
resolved. Can we not identify why this corruption occurs? e.g.:
KSSDCH: parent 7c0414d08 cur 7c00ef450 prev 7c0414d28 next 7c00ef460 own
7c0414d
08 link [7c0414d28,7c0414d28]
—————————————-
SO: 7c00ef450, type: 3, owner: 7c0414d08, pt: 0, flag: -/FLST/CLN/0x00
ksedmp: internal or fatal error
ORA-600: internal error code, arguments: [1115], [], [], [], [], [], [], []
—– Call Stack Trace —–
ksedmp ksfini kgeriv kgesiv ksesic0 kssdch kmclcl kmcdlp
kmmdlp ksudlp kssxdl ksustc ksumcl2 ksucln ksbrdp opirip
opidrv sou2o main start
…
PMON: fatal error while deleting s.o. 7c00c1348 in this tree:
…
SO: 7c0414d08, type: 41, owner: 7c00c1348, pt: 0, flag: INIT/-/-/0x00
(circuit) dispatcher process id = (7c00c1348, 1)
parent process id = (10, 1)
user session id = (8, 5)
connection context = 11e8c4c8
user session = (7c00ef450), flag = (70a), queue = (8)
dispatcher buffer = (1), status = (0, 0)
server buffer = (3), status = (0, 0)
—————————————-
SO: 7c00ef450, type: 3, owner: 7c0414d08, pt: 0, flag: -/FLST/CLN/0x00
Aborting this subtree dump because of state inconsistency
You should be able to relink on the server, but will probably need to set the
environment variable TMP as follows:
export TMP=/bugmnt2/em/rmtdcsol4/tar2638867.1/tmp
*** 03/18/03 08:03 am ***
In testing on the customers system, we have confirmed that ORA-4031 errors are
being encountered due to exhuastion of the Java Pool. Therefore, could this
be related to bug 2351854?
Rediscovery Information: PMON cleanup of dead shared server or dispatcher
with non-outbound child VC’s could lead to ORA-600[1115] with this stack:
ksudlp->kssdch->kssxdl->kmcdlc->kssdch->kssxdl->ksuxds->kssdch
Workaround: none
Release Notes:
]] PMON cleanup of dead shared servers or dispatchers that have java
]] connections could result in ORA-600[1115], bringing the instance down.
Hdr: 2028564 8.1.7.1.0 RDBMS 8.1.7.1.0 VOS PRODID-5 PORTID-59 ORA-600
Abstract: DATABASE CRASHED WITH ORA-600 [1115] BECAUSE PMON FAILED TO CLEANUP SESSION
Problem:
~~~~~~~~
1. Clear description of the problem encountered
-> Cst database crashed with ORA-600 [1115], when PMON was cleaning up another
failed process
2. Indication of the frequency and predictability of the problem
-> One time ocurrence, so far
3. Sequence of events leading to the problem
-> Unknown, see below for some more info.
4. Technical impact on the customer. Include persistent after effects.
-> Production system crashed, but no persistent after effects
Diagnostic Analysis:
~~~~~~~~~~~~~~~~~~~~
Problem started at 9:41, when process id 9715 crashed with ORA-600 [729]
followed by a core dump.
—
Wed Sep 19 09:41:37 2001
Errors in file /opt/oracle8/admin/D9/udump/ora_9715_d9.trc:
ORA-600: internal error code, arguments: [729], [168], [space leak], [], []
Wed Sep 19 09:41:52 2001
Errors in file /opt/oracle8/admin/D9/udump/ora_9715_d9.trc:
ORA-7445: exception encountered: core dump [11] [4026491400] [240] [0] [] []
ORA-600: internal error code, arguments: [729], [168], [space leak], [], []
—
This is bug 1712645
Now PMON starts cleaning up this process, and fails with ORA-600 [1115],
deleting a state object (in this case c0000000010974d0), which was a user
session
Workaround:
~~~~~~~~~~~
None
Related Bugs:
~~~~~~~~~~~~~
There are a couple of bugs that might be related
All are closed as not reproducible or open pending more info.
Reproducibility:
~~~~~~~~~~~~~~~~
1. State if the problem is reproducible; indicate where and predictability
-> Not reproducible
2. List the versions in which the problem has reproduced
-> 8.1.7, only once
3. List any versions in which the problem has not reproduced
-> N/A
Testcase:
~~~~~~~~~
Don’t have one
.Stack Trace:
~~~~~~~~~~~~
Available from Oracle trace file:
Supporting Information:
~~~~~~~~~~~~~~~~~~~~~~~
Uploading these files into alert.log:
—
alert.log – alert.log for the instance
ora_9715_d9.trc – Oracle trace file with ORA-600 [729]
parameters.log – output from “show parameters”
pmon_9203_d9.trc – PMON trace file
—
24 Hour Contact Information for P1 Bugs:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
N/A
Dial-in Information:
~~~~~~~~~~~~~~~~~~~~
N/A
Impact Date:
~~~~~~~~~~~~
N/A
donwloading the trace files , will try to work out a testcase ..
some questions :
-1 When you said “This is bug 1712645 ” , it is because you know
. Customer interrupted a ‘drop column’ operation ?
-2 If answer to previous question is “yes”,
. does the table has/had any LOB/ CLOB /NLOB /FLOB column ?
-3 alert.log shows resource_limit=true,
. which limits are defined for db user “PBIGRP”
Summary
~~~~~~~~~
1- Customer problem :
“alter table truncate partition …” interrupted by cntrl-c
ORA-600 [729], [168], [space leak]
leaked memory is :
Chunk 80000001000c4050 sz= 64 freeable “Extent Starting”
Chunk 80000001000b11b8 sz= 64 freeable “Extent Sizes ”
Chunk 80000001000b6ee8 sz= 40 freeable “Skip Extents ”
ORA-7445 core dump [11] [4026491400] [240] [0]
PMON comes in and hits ORA-600 [1115]
.
PMON: fatal error while deleting s.o. c0000000010974d0
2- test case :
the more similar error I’m able to reproduce is this :
runt the testcase “tc2.sql” and follow the instructions documented in it,
Using Oracle8i Enterprise Edition Release 8.1.7.0.0 :
sqlplus sees a ORA-3113 and dead Oracle shadow shows:
ORA-600 [729], [132], [space leak]
Chunk 972ba60 sz= 52 freeable “Extent Sizes ”
Chunk 971c974 sz= 28 freeable “Skip Extents ”
Chunk 971b17c sz= 52 freeable “Extent Starting”
Using Oracle9i Enterprise Edition Release 9.0.1.0.0 :
sqlplus sees a ORA-600 [729], [156], [space leak]
Chunk a229924 sz= 36 freeable “Skip Extents ”
Chunk a2308b8 sz= 60 freeable “Extent Sizes ”
Chunk a225654 sz= 60 freeable “Extent Starting”
Trace file is ora_12326_s817.trc
Rediscovery information: PMON dies cleaning up a process which had gotten an
ORA-600 [729] error or an ORA-600 [4400].