今天没有外出(似乎人不到现场就特别容易出问题),早上10点左右接到电话被告知crm11实例上出现了7445错误,准备用web vpn拨上去查看一下,赫然发觉windows 7 不支持这种vpn(准确说ie8和firefox都不支持);无奈无奈只好用拨号。
发现alert log中出现大量 7445错误记录:
Fri Mar 26 09:24:53 2010 Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_6754320.trc: ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] [] Fri Mar 26 09:24:55 2010 Trace dumping is performing id=[cdmp_20100326092455] Fri Mar 26 09:31:16 2010 Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_2994552.trc: ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
看到kghalp函数第一印象 ,是Oracle中堆管理使用的函数;
让我们猜猜字面意思? k -> kernel g -> generic h-> heap a-> allocation p-> point
再让我们来看一下当时的call stack:
Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped object), addr: 0x3b, PC: [0x1000973e0, kghalp+0500] Registers: iar: 00000001000973e0, msr: a00000000000d0b2 lr: 00000001013a6df8, cr: 0000000022292484 r00: 0000000000000010, r01: 0ffffffffffcb160, r02: 000000011022a9c0, r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100, r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000, r09: 0000000000000000, r10: 00000000101b60d8, r11: 0000000000000004, r12: 0000000024592484, r13: 000000011026bfe0, r14: 0000000000000000, r15: 0000000000009000, r16: 0000000110195b2c, r17: 0000000000000000, r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000, r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001, r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001, r27: 0000000104c7fd44, r28: 0000000000000000, r29: 0000000000000100, r30: 0000000000000000, r31: 0000000110195a58, *** 2010-03-26 09:57:28.679 ksedmp: internal or fatal error ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] [] Current SQL statement for this session: INSERT INTO AUDIT_DDL_LOG (DDL_TIME, SESSION_ID, OS_USER, IP_ADDRESS, TERMINAL, HOST, USER_NAME, DDL_TYPE, OBJECT_TYPE, OWNER, OBJECT_NAME, SQL_TEXT) VALUES (SYSDATE, SYS_CONTEXT('USERENV','SESSIONID'), SYS_CONTEXT('USERENV','OS_USER'), SYS_CONTEXT('USERENV','IP_ADDRESS'), SYS_CONTEXT('USERENV','TERMINAL'), SYS_CONTEXT('USERENV','HOST'), ORA_LOGIN_USER, ORA_SYSEVENT, ORA_DICT_OBJ_TYPE, ORA_DICT_OBJ_OWNER, ORA_DICT_OBJ_NAME, :B1 ) ----- PL/SQL Call Stack ----- object line object handle number name 70000043da500d0 10 anonymous block ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 000000000 ? 104A54EED ? ksedmp+0290 bl ksedst 104A54870 ? ssexhd+03e0 bl ksedmp 300001D15 ? 000044C0 ? 00000000 parchk+01f4 bl kghalp 000000000 ? 2842288200000001 ? 000000000 ? 000000000 ? 000001040 ? 110195B2C ? ptmak+0168 bl parchk FFFFFFFFFFCB560 ? FFFFFFFFFFCB430 ? FFFFFFFFFFCB430 ? pdybF00_Init+0244 bl ptmak 10008049C ? 000000000 ? FFFFFFFFFFCB4F0 ? 07FFFFFFF ? pdy1F79_Init+00c8 bl pdybF00_Init 110BEB1D0 ? pdy1F01_Driver+0048 bl pdy1F79_Init FFFFFFFFFFCBC40 ? pdli_new_cog+00f0 bl pdy1F01_Driver FFFFFFFFFFCBCE0 ? 000000000 ? pdlifu+0264 bl pdli_new_cog 1013885F4 ? FFFFFFFFFFCCB00 ? 7000004383E7680 ? phpcog+0010 bl pdlifu FFFFFFFFFFCD958 ? 7000004383E7680 ? 104C95048 ? phpcmp+0f80 bl phpcog FFFFFFFFFFCC4F0 ? 000000000 ? pcicms2+02d4 bl phpcmp FFFFFFFFFFCD958 ?
发生错误的最上层 kghalp 函数由 parchk 调用, 这似乎是一个package check函数(猜测,呵呵). 我们来整理一下思路, parchk 函数调用了 kghalp函数以帮其分配内存,但却得到了一个非法的低地址[[0x00000003B],正常情况下正文段使用的空间; 这看起来显然是一个bug。
让我们来查查support.oracle.com , 键入7445 kghalp 和sigsegv 关键字 (很多时候不需要使用ora 600/7445 lookup tools).
bug 8244533 赫然显目:
Bug 8244533: ORA-07445 [KGHALP] ERRORS COMPILING PACKAGE WITH DEBUG STACK TRACE: ------------ ksedst <- ksedmp <- ssexhd <- 000044BC <- parchk <- ptmak <- pdybF00_Init <- pdy1F79_Init <- pdy1F01_Driver <- pdli_new_cog <- pdlifu <- phpcog <- phpcmp <- pcicms2 <- pcicms <- kkxcms <- kkxswcm <- kkxmpbms <- kkxmesu <- xtypls <- qctopls <- qctcopn <- qctcopn Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped object), addr: 0x3b, PC: [0x1000973e0, kghalp+0500] Registers: iar: 00000001000973e0, msr: a00000000000d0b2 lr: 000000010139ffb8, cr: 00000000222a2484 r00: 0000000000000010, r01: 0ffffffffffe2980, r02: 00000001101e5ab8, r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100, r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000, r09: 0000000000000000, r10: 0000000010171200, r11: 0000000000000004, r12: 00000000245a2484, r13: 000000011021fbc0, r14: 0000000000000000, r15: 0000000000009000, r16: 0000000110150c54, r17: 0000000000000000, r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000, r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001, r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001, r27: 0000000104c5983c, r28: 0000000000000000, r29: 0000000000000100, r30: 0000000000000000, r31: 0000000110150b80, *** 16:37:14.603 ksedmp: internal or fatal error ORA-7445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] [] Current SQL statement for this session: select dummy from dual where ora_dict_obj_type = 'TABLE' ----- Call Stack Trace -----ptmak pdybF00_Init pdy1F79_Init pdy1F01_Driver pdli_new_cog pdlifuphpcog phpcmp pcicms2 pcicms kkxcms kkxswcm kkxmpbms kkxmesu xtyplsTo Filer.Based on this call stack this would appear a likely match forbug 6951953 Abstract: ORA-7445 [PTMAK] IMPORTING PACKAGE COMPILED DEBUG.This bug is fixed on 10.2.0.5 and there is a 10.2.0.4 patch available for IBM AIX Based Systems (64-bit).It maybe worth while to have the customer apply the patch to seeif it resolves the issue.Also the uploaded files included test.sql is this a reproducable testcase?
这个bug 似乎仅在 IBM AIX on POWER Systems (64-bit) 发生,当以DEBUG 模式编译包时有一定几率出现。
好了,既然已经了解了可能发生的诱因,我们可以进一步分析了,接下来看看 errorstack trace信息中 的SO 记录。
SO: 70000043d217668, type: 53, owner: 70000048cee2238, flag: INIT/-/-/0x00 LIBRARY OBJECT LOCK: lock=70000043d217668 handle=700000446261588 mode=N call pin=0 session pin=0 hpc=0000 hlc=0000 htl=70000043d2176e8[70000042b52b368,70000042bb9a808] htb=70000044929b460 ssga=70000044929ad68 user=70000048cee2238 session=70000048eb33010 count=1 flags=[0000] savepoint=0x4bac1488 LIBRARY OBJECT HANDLE: handle=700000446261588 mtx=7000004462616b8(1) cdp=1 name=ALTER TRIGGER "SHUCRM3O"."TRI_PRODUCT_INSTANCE_RELATED" COMPILE DEBUG REUSE SETTINGS hash=164e6a8942406cee159f8943a1a3c85e timestamp=03-26-2010 09:52:12 namespace=CRSR flags=RON/KGHP/TIM/PN0/SML/KST/DBN/MTX/[120100d0] kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=16 hpc=0002 hlc=0002 lwt=700000446261630[700000446261630,700000446261630] ltm=700000446261640[700000446261640,700000446261640] pwt=7000004462615f8[7000004462615f8,7000004462615f8] ptm=700000446261608[700000446261608,700000446261608] ref=700000446261660[700000446261660,700000446261660] lnd=700000446261678[700000446261678,700000446261678] LIBRARY OBJECT: object=70000045adbc1e8 type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0 CHILDREN: size=16 child# table reference handle 5 70000041776f5c0 70000045ae44720 70000042bfa3a20 DATA BLOCKS: data# heap pointer status pins change whr 0 70000043d9fed20 70000045adbc300 I/P/A/-/- 0 NONE 00
的确有以debug 模式编译对象的语句,不过对象不是包而是trigger ; 看起来只要是可以以debug 模式compile 的对象都有可能引发该问题。
好了,问题到这里已经比较明确了: 应用端以DEBUG模式重新编译包引发了 Oracle bug 8244533,从而导致了对应服务进程的崩溃;总算是虚惊一场,之后通过trace内的machine和user信息找到了实施变更的应用方人员并教育之。