下午尝试用Toad For Data Analysts生成查询语句和ER模型图,感觉还不错;同时配有图形化的执行计划示意图,配合Toad的SQL optimizer可以算是一个很不错的数据库开发组合。
生成查询的Query Builder界面:
下午尝试用Toad For Data Analysts生成查询语句和ER模型图,感觉还不错;同时配有图形化的执行计划示意图,配合Toad的SQL optimizer可以算是一个很不错的数据库开发组合。
生成查询的Query Builder界面:
Oracle中联机日志文件(online redo log)在大多平台上以512 字节为一个标准块。
(HPUX,Tru64 Unix上是1024bytes,SCO UNIX,Reliant UNIX上是2048bytes,而MVS,MPE/ix上是4096bytes,虽然以上许多UNIX已经不再流行,实际情况可以通过
select max(l.lebsz) log_block_size_kccle
from sys.x$kccle l
where l.inst_id = userenv(‘Instance’) 语句查询到)
LGWR后台进程写出REDO时未必能填满最后的当前日志块。举例而言,假设redo buffer中有1025字节的内容需要写出,则1025=512+512+1 共占用三个重做日志标准块,前2个标准块被填满而第三个标准块只使用了1个字节。在LGWR完成写出前,需要释放”redo allocation”闩,在此之前SGA中索引”redo buffer”信息的变量将指向未被填满块后面的一个重做块,换而言之有511字节的空间被LGWR跳过了,这就是我们说的redo wastage;我们可以通过分析v$sysstat动态视图中的redo wastage统计信息了解实例生命周期中的重做浪费量。
SQL> col name for a25 SQL> select name,value from v$sysstat where name like '%wastage%'; NAME VALUE ------------------------- ---------- redo wastage 132032
redo wastage的一个图示:
为什么要浪费这部分空间呢?实际上,这种做法十分有益于LGWR的串行I/O模式。redo wastage并不是问题或者Bug,而是Oracle故意为之的。当然过量的redo wastage也不是什么好事,一般是LGWR写出过于频繁的症状表现。9i以后很少有因为隐式参数_log_io_size过小而导致的LGWR过载了,如果在您的系统中发现redo wastage的问题不小,那么无限制地滥用commit操作往往是引起问题的罪魁祸首,减少不必要的commit语句,把commit从循环中移除都将利于减少redo wastage。
我们来看一下关于redo wastage的演示:
SQL> select distinct bytes/1024/1024 from v$log; BYTES/1024/1024 --------------- 50 /*确认联机日志文件大小为50MB*/ SQL> archive log list; /*确认数据库处于归档状态*/ Database log mode Archive Mode Automatic archival Enabled Archive destination /s01/arch SQL> set time on; 19:49:45 SQL> alter system switch logfile; /*切换日志,清理现场*/ System altered. 19:51:07 SQL> col name for a25 19:51:16 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage'); NAME VALUE ------------------------- ---------- redo size 1418793324 redo wastage 88286544 /*演示开始时的基础统计值*/ 19:51:19 SQL> begin 19:52:10 2 for i in 1..550000 loop 19:52:10 3 insert into tv values(1,'a'); 19:52:10 4 commit; 19:52:10 5 end loop; 19:52:10 6 end; 19:52:11 7 / /*匿名块中commit操作位于loop循环内,将导致大量redo wastage*/ PL/SQL procedure successfully completed. 19:53:07 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage'); NAME VALUE ------------------------- ---------- redo size 1689225404 redo wastage 112011352 /*频繁提交的匿名块产生了 1689225404-1418793324=257MB的redo,其中存在112011352-88286544=22MB的redo wastage*/ 19:53:14 SQL> begin 19:53:33 2 for i in 1..550000 loop 19:53:33 3 insert into tv values(1,'a'); 19:53:33 4 end loop; 19:53:33 5 commit; 19:53:33 6 end; 19:53:34 7 / /* 此匿名块中commit操作被移除loop循环中,批量修改数据后仅在最后提交一次*/ PL/SQL procedure successfully completed. 19:53:59 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage'); NAME VALUE ------------------------- ---------- redo size 1828546240 redo wastage 112061296 /*稀疏提交的匿名块最后产生了1828546240-1689225404=132MB的重做,而redo wastage为112061296-112011352=48k*/
可能您会很奇怪前者不是只比后者多出22MB的redo浪费吗,为什么总的redo量差了那么多?
我们需要注意到commit本身也是要产生redo的,而且其所产生的还不少!就以上演示来看频繁提交的过程中,commit所占用的redo空间几乎接近一半(257-132-22)/257=40%,而每次commit的平均redo量为(257-132-22)*1024*1024/550000=196 bytes。
commit操作是事务ACID的基础之一,合理运用commit可以帮我们构建健壮可靠的应用,而滥用它必将是另一场灾难!
数月之前,一位新疆的同事反映客户的某套单节点数据库存在性能问题,某个SQL长时间运行没有结束,在该SQL运行期间大量回话处于latch free等待中。因为相隔遥遥千里,同事之间是通过Email交流的;他首先给我发来了该问题SQL的相关explain_log。其中该语句的执行计划如下:
----------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ----------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 1956 | 13 | |* 1 | FILTER | | | | | | 2 | NESTED LOOPS OUTER | | 1 | 1956 | 13 | | 3 | NESTED LOOPS OUTER | | 1 | 1942 | 12 | | 4 | NESTED LOOPS OUTER | | 1 | 1924 | 11 | | 5 | NESTED LOOPS | | 1 | 1906 | 10 | | 6 | NESTED LOOPS | | 1 | 1856 | 9 | | 7 | NESTED LOOPS | | 5 | 9055 | 8 | | 8 | NESTED LOOPS OUTER | | 5 | 8575 | 7 | | 9 | NESTED LOOPS OUTER | | 5 | 8280 | 6 | | 10 | NESTED LOOPS OUTER | | 5 | 7785 | 5 | | 11 | NESTED LOOPS OUTER | | 5 | 7290 | 4 | | 12 | NESTED LOOPS OUTER | | 5 | 4130 | 3 | |* 13 | TABLE ACCESS BY INDEX ROWID| S_PROD_INT | 5 | 2960 | 2 | |* 14 | INDEX RANGE SCAN | S_PROD_INT_M53 | 326 | | 5 | | 15 | TABLE ACCESS BY INDEX ROWID| S_PROD_LN | 1 | 234 | 1 | |* 16 | INDEX UNIQUE SCAN | S_PROD_LN_P1 | 1 | | | | 17 | TABLE ACCESS BY INDEX ROWID | S_PROD_INT_CRSE | 1 | 632 | 1 | |* 18 | INDEX RANGE SCAN | S_PROD_INT_CRSE_U1 | 1 | | 1 | | 19 | TABLE ACCESS BY INDEX ROWID | S_LIT | 1 | 99 | 1 | |* 20 | INDEX UNIQUE SCAN | S_LIT_P1 | 1 | | | | 21 | TABLE ACCESS BY INDEX ROWID | S_LIT | 1 | 99 | 1 | |* 22 | INDEX UNIQUE SCAN | S_LIT_P1 | 1 | | | | 23 | TABLE ACCESS BY INDEX ROWID | S_PROD_INT_TNTX | 1 | 59 | 1 | |* 24 | INDEX RANGE SCAN | S_PROD_INT_TNTX_U1 | 1 | | 1 | | 25 | TABLE ACCESS BY INDEX ROWID | S_VOD | 1 | 96 | 1 | |* 26 | INDEX RANGE SCAN | S_VOD_U2 | 1 | | 2 | |* 27 | TABLE ACCESS BY INDEX ROWID | S_VOD_VER | 1 | 45 | 1 | |* 28 | INDEX RANGE SCAN | S_VOD_VER_U1 | 2 | | 4 | | 29 | TABLE ACCESS BY INDEX ROWID | S_ISS_OBJ_DEF | 1 | 50 | 1 | |* 30 | INDEX RANGE SCAN | S_ISS_OBJ_DEF_M3 | 1 | | 2 | | 31 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 18 | 1 | |* 32 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | | 33 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 18 | 1 | |* 34 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | | 35 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 14 | 1 | |* 36 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | | 37 | NESTED LOOPS | | 1 | 137 | 7 | | 38 | NESTED LOOPS | | 4 | 456 | 6 | | 39 | NESTED LOOPS | | 4 | 364 | 5 | | 40 | NESTED LOOPS | | 2 | 142 | 4 | | 41 | NESTED LOOPS | | 1 | 53 | 3 | | 42 | NESTED LOOPS | | 3 | 120 | 2 | |* 43 | INDEX RANGE SCAN | S_CTLG_CAT_PROD_U1 | 3 | 57 | 2 | |* 44 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 21 | 1 | |* 45 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | |* 46 | TABLE ACCESS BY INDEX ROWID | S_CTLG | 1 | 13 | 1 | |* 47 | INDEX UNIQUE SCAN | S_CTLG_P1 | 1 | | | | 48 | TABLE ACCESS BY INDEX ROWID | S_PARTY_CTGRY | 2 | 36 | 1 | |* 49 | INDEX RANGE SCAN | S_PARTY_CTGRY_F1 | 2 | | 1 | | 50 | TABLE ACCESS BY INDEX ROWID | S_PARTY_RPT_REL | 2 | 40 | 1 | |* 51 | INDEX RANGE SCAN | S_PARTY_RPTREL_F50 | 2 | | 1 | | 52 | TABLE ACCESS BY INDEX ROWID | S_PARTY_PER | 1 | 23 | 1 | |* 53 | INDEX RANGE SCAN | S_PARTY_PER_F1 | 1 | | 2 | |* 54 | INDEX RANGE SCAN | S_PARTY_PER_U1 | 1 | 23 | 3 | | 55 | NESTED LOOPS | | 1 | 157 | 8 | | 56 | NESTED LOOPS | | 1 | 144 | 7 | | 57 | NESTED LOOPS | | 1 | 125 | 6 | | 58 | NESTED LOOPS | | 1 | 104 | 5 | | 59 | NESTED LOOPS | | 1 | 86 | 4 | | 60 | NESTED LOOPS | | 4 | 264 | 3 | | 61 | NESTED LOOPS | | 2 | 86 | 2 | |* 62 | INDEX RANGE SCAN | S_PARTY_PER_U1 | 2 | 46 | 4 | | 63 | TABLE ACCESS BY INDEX ROWID | S_POSTN | 1 | 20 | 1 | |* 64 | INDEX UNIQUE SCAN | S_POSTN_P1 | 1 | | | |* 65 | INDEX RANGE SCAN | S_PARTY_PER_U1 | 2 | 46 | 3 | |* 66 | INDEX RANGE SCAN | S_PARTY_RPT_REL_U2 | 1 | 20 | 1 | |* 67 | INDEX RANGE SCAN | S_PARTY_CTGRY_U1 | 1 | 18 | 1 | |* 68 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 21 | 1 | |* 69 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | |* 70 | INDEX RANGE SCAN | S_CTLG_CAT_PROD_U1 | 1 | 19 | 1 | |* 71 | TABLE ACCESS BY INDEX ROWID | S_CTLG | 1 | 13 | 1 | |* 72 | INDEX UNIQUE SCAN | S_CTLG_P1 | 1 | | | | 73 | NESTED LOOPS | | 1 | 58 | 3 | | 74 | NESTED LOOPS | | 3 | 135 | 2 | |* 75 | INDEX RANGE SCAN | S_CTLG_CAT_PROD_U1 | 2 | 38 | 2 | |* 76 | TABLE ACCESS BY INDEX ROWID | S_CTLG_CAT | 1 | 26 | 1 | |* 77 | INDEX UNIQUE SCAN | S_CTLG_CAT_P1 | 1 | | | |* 78 | TABLE ACCESS BY INDEX ROWID | S_CTLG | 1 | 13 | 1 | |* 79 | INDEX UNIQUE SCAN | S_CTLG_P1 | 1 | | | -----------------------------------------------------------------------------------------------
该SQL涉及近四十个表和索引的链接操作,而且全部采用嵌套循环链接模式,是典型的siebel应用。这个SQL的执行计划十分复杂,我甚至不愿意去看它;既然从执行计划上进一步分析过于违背“惰性”,我首先想到的还是从wait interface入手,毕竟OWI是”useless”的。因为无法远程操作,反复收集信息会让整个过程十分缓慢,一口气问对方要了10046,errorstack,latches的dump信息,以及该数据库的完整RDA报告。
既然是latch free等待,那么总要先定位到相关的latch才好进一步分析,我们首先看10046 trace:
cat siebdb1_ora_15465.trc |grep "nam='latch free'"|awk '{print $7}' |sort|uniq -c 1 p1=-4611685984823284696 169 p1=-4611685984823317640 1 p1=-4611685988591416672 1 p1=-4611685988592487696 2 p1=-4611685988692060472 1 p1=-4611685988694684144 1 p1=-4611685988705526816 1 p1=-4611685988733017136 1 p1=-4611685988779484112 1 p1=-4611685988784059264 1 p1=-4611685988787329592 1 p1=-4611685988788323152 1 p1=-4611685988800312448 2 p1=-4611685988805347840 24 p1=-4611685988806714976 2 p1=-4611685988837854328 7 p1=-4611685988841028520 1 p1=-4611685988869432648 1 p1=-4611685988871320920
通过以上脚本处理,我们可以看到在执行过程中出现了各种latch free,出现最频繁的是p1=-4611685984823317640的栓。latch free等待事件中p1所代表的是会话所等待栓的地址。-4611685984823317640换算到16进制为C0000007D2F56B78。我们看看这个栓到底是何方神圣:
c0000007d2f56b78 cache buffer handles level=3 Location from where latch is held: kcbzfs: Context saved from call: 0 state=busy gotten 2734921346 times wait, failed first 656272521 sleeps 44808579 gotten 0 times nowait, failed: 0
当会话需要pin住buffer header时它首先要获去buffer handle,得到buffer handle的过程中首先要抢占cache buffer handles栓,为了避免对于cache buffer handles栓的过度争用,每个会话被允许cache一小撮buffer handles,也叫保留集(reserved set)。该保留集的上限由隐式参数_db_handles_cached(默认为5)所控制,在此基础上会话在执行不是十分复杂的SQL时不必反复申请栓。
同时存在一些十分复杂的语句,需要同时pin住大量的缓存,此时若存在空闲的handle,会话被允许索要手头5个cached buffer handles以外的handle。也为了限制单个会话的资源占用量,Oracle通过内部算法总是尽量让每个进程能够pin住”合理份额“的buffer,由此单个会话同一时间所能持有的总handle数等于db_block_buffers/processes,隐式参数_cursor_db_buffers_pinned指出了该值。另cache buffer handles并没有子栓,仅有父栓本身,所以如果系统中有大量复杂SQL,会话所能cache的buffer handles远不足以满足需求时,该栓就成为整个系统性能的瓶颈,甚至可能出现潜在的死锁/饿死问题。
结合我们从trace中获取的实际信息,cache buffer handles的命中率为1-656272521/2734921346=76%;一般来说如果系统中某个栓的命中率低于95%时,我们就可以认为该栓的争用较为激烈了;而此处一个父栓的命中率到了80%以下,我们很难想象其性能恶化状态。
在同事发给我的RDA中探索,从resource limit栏目中可以看到该实例的会话高水位从未超过2500,而实例设定的processes值则为6000,这是一个很多初级DBA或施工方会犯的错误,认为设定高一些的processes值可以避免出现ora-00020错误从而高枕无忧,殊不知Oracle中许多隐式参数的分配是针对process数来计算的,资源总数是有限的,process数多则单个会话所能操控的理论资源数相应减少,这不仅在Oracle中,几乎是放之四海皆准的原理。
极有可能是过高的processes参数影响了_cursor_db_buffers_pinned,而该问题SQL除了十分复杂外还有全部采用nested loop链接方式的问题,这就要求同时时刻需pin住大量不同表和索引的buffer,于是在SQL执行伊始会话尝试pin住了一部分的buffer,但因为最终达到了_cursor_db_buffers_pinned所指定的上限,该会话在持有cache buffer handles栓的情况下处于hang状态,进一步导致了大量其他会话因latch free等待而阻塞。
当时给同事的最主要建议就是降低processes参数,从该系统的V$resource_limit动态视图来看,将processes设到3000就绰绰有余了;同时我们可以增大隐式参数_db_handles_cached以避免系统中对该栓的争用。
几天后,同事回信告诉我开发人员通过其他途径绕过了该问题,但我还是建议他在未来可能的offline计划中加入修改processes和_db_handles_cached参数的操作。
这个case发生几个月后,我无意中从MOS中发现以下note:
Hdr: 9328032 10.2.0.3 RDBMS 10.2.0.3 BUFFER CACHE PRODID-5 PORTID-23
Abstract: INCREASE IN CACHE BUFFER CHAINS/HANDLES WAITS
*** 01/29/10 05:41 am ***
TAR:
----
3-1375394871
PROBLEM:
--------
Today, and despite no significant change in workload we have increased waits
on these two latches:
29Jan2010_1030_1045 : Top 5 Timed Events
Avg %Total
29Jan2010_1030_1045 : ~~~~~~~~~~~~~~~~~~
wait Call
29Jan2010_1030_1045 : Event Waits Time (s)
(ms) Time Wait Class
29Jan2010_1030_1045 : ------------------------------ ------------ -----------
------ ------ ----------
29Jan2010_1030_1045 : db file sequential read 4,912,142 50,169
10 40.7 User I/O
29Jan2010_1030_1045 : CPU time 45,456
36.9
29Jan2010_1030_1045 : latch: cache buffers chains 1,347,911 12,255
9 10.0 Concurrenc
29Jan2010_1030_1045 : latch: cache buffer handles 1,364,489 10,662
8 8.7 Other
29Jan2010_1030_1045 : log file sync 482,281 7,101
15 5.8 Commit
29Jan2010_1030_1045 :
-------------------------------------------------------------
DIAGNOSTIC ANALYSIS:
--------------------
See below
WORKAROUND:
-----------
None
RELATED BUGS:
-------------
REPRODUCIBILITY:
----------------
TEST CASE:
----------
STACK TRACE:
------------
SUPPORTING INFORMATION:
-----------------------
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------
DIAL-IN INFORMATION:
--------------------
IMPACT DATE:
------------
Notes so far:
There seems to be 5 - 12% increase across a number of statistics
between the OK sample and the problem sample.
eg: Total per second per TX
26th> execute count 6,322,503 7,025.2 14.6
29th> execute count 6,828,911 7,552.3 14.7
8% increase in execute count
26th> data blocks consistent reads - u 16,567,225 18,408.4 38.2
29th> data blocks consistent reads - u 18,700,073 20,681.1 40.3
12.6% increase in consistent read undo records applied
cache buffer handles
~~~~~~~~~~~~~~~~~~~~
There is about an 8% increase in gets of this latch between
the two reports:
Gets miss% wait time
26th> cache buffer handles 80,320,092 21.0 56
29th> cache buffer handles 87,158,750 55.9 10661
And the latch was already showing 21% miss rate before so was
showing signs of being close to becoming a problem.
The latch misses stats show:
Location Sleeps Waiter Sleeps
cache buffer handles kcbzgs 726,520 832,967
cache buffer handles kcbzfs 638,106 531,875
This latch is used just to get / release buffer state objects.
Each process caches up to "_db_handles_cached" of these against
its own process so once we are warmed up and have processes
then we should only use this latch if we are after pinning more
than 5 buffers in the process state.
As the misses seem balanced across kcbzgs and kcbzfs then the
gets do not appear to be due to warming up so this points at
SQLs on the system needing to pin more than 5 buffers at a
time. If we increase "_db_handles_cached" then we should be able
to alleviate the load on this latch.
eg: See PID 4174 from yesterday we see it has 9 buffers pinned
and is running:
SELECT /*+ LEADING (PROD, ITM, ORDX)
INDEX (ITM S_ORDER_ITEM_BT_W1_X )*/ ...
So we can see there are SQLs needing many more than 5 buffers
pinned at a time.
*** 01/29/10 10:06 am ***
cache buffers chains
~~~~~~~~~~~~~~~~~~~~ GETS Miss%
26th> cache buffers chains 866,844,547 1.0 0.0 83
29th> cache buffers chains 900,769,954 1.4 0.1 12258
4% increase in gets , large increase in wait time.
Previous miss rate was 1%, now 1.4%.
James already checked and mostly the spread looks reasonable
across child latches.
per second
26th> Transactions: 481.56
29th> Transactions: 513.79
6.5% more transactions processed.
This is looking like increased load is starting to see the
latch contention. It may be sensible here to look at the
higher "buffer gets" SQLs to see if they really are well
tuned. eg: 8y9msp8p3gv23 in the 1130-1145 reports are
showing increased executions but also increased buffers
per execution. It would seem sensible to check out the
plans and use of the top buffer gets SQLs.
It would also be wise to confirm we are not hitting bug 6455161
as that affects this release and I cannot see that patch in place.
ie: Verify:
select owner||'.'||Object_name||'.'||subobject_name
from dba_objects
where object_id<>data_object_id
and object_type like 'INDEX%'
order by owner,object_name,subobject_name;
That would not be a difference from 26th but if this is hit
you gets may be far higher than needed for certains types of
query.
*** 01/29/10 02:14 pm *** (CHG: Sta->11)
*** 01/29/10 02:14 pm ***
*** 01/29/10 02:40 pm *** (CHG: Sta->16)
*** 01/29/10 03:03 pm ***
*** 01/29/10 11:33 pm ***
Many of the indexes are on SIEBEL objects so
bug 6455161 looks like it could occur here.
*** 01/30/10 12:02 am ***
*** 01/30/10 12:41 am ***
I checked a few top buffer gets SQLs from older AWR that we have
awrrpt_1_67673_67677 . It looks like the heaviest SQLs are
probably not using indexes which are exposed to 6455161 but a couple
may be:
av5skvp0mkxm1
SELECT /*+ ALL_ROWS */ T3.CONFLICT_ID, T3.LAST_UPD ...
FROM SIEBEL.S_LST_OF_VAL T1, SIEBEL.S_SRM_ACTION T2, SIEBEL.S_SRM_ACTION T3
...
Indexes on the q1.log list which might be used by that query in NL:
SIEBEL.S_SRM_ACTION_M1 SIEBEL.S_SRM_ACTION_P1
SIEBEL.S_SRM_ACTION_U1
SIEBEL.S_SRM_ACT_PARAM_P1 SIEBEL.S_SRM_ACT_PARAM_U1
7khg9mx4cv5c5
Queries from SIEBEL.S_ORG_EXT with lots of other tables
Might use SIEBEL.S_ORG_EXT_M1 from the list
*** 01/30/10 12:59 am ***
=====
>>> Side issue unrelated to the issue here:
From patch list patches_28012010.lis
customer has 6868080 installed. This was superceded by 8244734
=====
*** 01/30/10 07:06 am *** (CHG: Sta->10)
*** 01/30/10 07:06 am ***
Suggestions for this:
a> Set "_db_handles_cached" >= 10
b> Get and install a patch for bug 6455161
This seems like this is minimal impact at present
as problem indexes are not used in the top buffer
gets SQLs BUT one index rebuild could lead to mass
problems, and it may give some small reductions
in buffer gets on certain SQLs, just not the top
ones.
It is a small focused fix with no clashes on current
list.
c> Double check some of the top buffer gets SQL plans.
They may be the best they could be but are worth
checking as there is little in the AWRs to show
any special reason for cache buffers chains increase
other than the slightly higher workload.
*** 02/01/10 07:32 am ***
>> SQL "av5skvp0mkxm1" may be susceptible to 6455161.
*** 02/01/10 08:03 am ***
Reviewing level of "_db_handles_cached" as some of the top
SQLs seem very deep NL outer.
Testing this here.
*** 02/01/10 11:05 am ***
For repeatedly executed deep NL SQLs higher handle caching should
improve loading on the "cache buffer handles" latch, but it
has no impact on the chains latch as NLs and NL outers keep buffers
pinned anyway.
So suggestions are still>>
Set "_db_handles_cached" >= 40 (given nested depth of some SQL)
See if extra memory should be added to shared pool to accomodate that
Install 6455161
Drill more into SQL to see if any can be improved esp wrt buffer
gets needed .
Also
Recheck V$latch_children spread for cache buffers chains
and recheck X$BH blocks on warmest children
See if there is any session skew on the latch waits
(any sessions more prone to waits than others)
该note所述case与我们的情况相似,经过排查发现最有”嫌疑“的SQL语句中也存在大量的Nested Loop外连接。这似乎是siebel应用一个普遍存在的问题,note中的客户最初发现的症结也是cache buffer handles栓等待问题,Oracle MOS最后给出的建议是:
1.增大db_handles_cache 到10以上,最后又建议增加到40以上
2. 这个note中客户的数据库版本是sparc(64 bit) 10.2.0.3, MOS建议在此基础上应用Bug 6455161的one-off patch
3.检查有大量buffer gets的SQL语句的执行计划
当前控制文件的内容可以通过CONTROLF dump以文本形式转储到实例参数指定的目录下。
各级别CONTROLF转储的区别:
转储级别 | 转储内容 |
---|---|
1 | 仅包含文件头信息 |
2 | 包括文件头,数据库信息记录,检查点进程记录 |
3 | 所有记录类型,针对循环重用的记录类型仅保留最早及最近的记录 |
4 | 如上,包含4条最新的可重用类型的记录 |
5+ | 如上,包含的可重用类型记录为2*level条 |
SQL> oradebug setmypid; |
Statement processed. |
SQL> oradebug dump controlf 3; |
Statement processed. |
SQL> oradebug tracefile_name; |
/s01/rac10g/admin/PROD/udump/prod_ora_3710.trc |
以下例而言,使用oradebug命令启动转储,将会把当前控制文件信息以文本形式转储到该服务进程对应的用户转储文件中,文本中记录了level 3所对应的内容。
SQL> oradebug setmypid; Statement processed. SQL> oradebug dump controlf 3; Statement processed. SQL> oradebug tracefile_name; /s01/rac10g/admin/PROD/udump/prod_ora_3710.trc
oradebug必须以SYSDBA形式登陆后方能使用,你也可以以非SYSDBA的身份通过alter session形式启动转储:
SQL> alter session set events 'immediate trace name controlf level 3'; Session altered.
[gview file=”http://askmac.cn/wp-content/uploads/resource/Controlfile.doc”]
可能很多朋友都遇到过这样的情况,在UNIX/Linux上定期清理Oracle日志文件夹时可能删除到仍被后台进程open着的trace文件,即某些后台进程一直持有着这些”被已经误删了的“打开文件的描述符(fd),这种情况下文件系统上该文件实际占用的空间是不会被释放的,这就造成使用df命令查看文件系统剩余空间和用du命令查看文件夹空间使用量时数值不一致的问题。此外因为是后台进程持有这些打开文件描述符,所以我们无法像kill服务进程一样来解决该问题(部分后台进程是可以kill的,不建议这样做)。oradebug是sqlplus中威力强大的debug命令,我们可以通过该命令发起多种trace/dump,其中也包括了close_trace事件;close_trace事件可以让指定进程关闭其正持有的trace文件。
下面我们就来演示下相关操作:
[maclean@rh2 ~]$ ps -ef|grep ora_|grep -v grep maclean 7281 1 0 16:35 ? 00:00:00 ora_pmon_PROD maclean 7283 1 0 16:35 ? 00:00:00 ora_psp0_PROD maclean 7285 1 0 16:35 ? 00:00:00 ora_mman_PROD maclean 7287 1 0 16:35 ? 00:00:00 ora_dbw0_PROD maclean 7289 1 0 16:35 ? 00:00:00 ora_lgwr_PROD maclean 7291 1 0 16:35 ? 00:00:00 ora_ckpt_PROD maclean 7293 1 0 16:35 ? 00:00:00 ora_smon_PROD maclean 7295 1 0 16:35 ? 00:00:00 ora_reco_PROD maclean 7297 1 0 16:35 ? 00:00:00 ora_cjq0_PROD maclean 7299 1 0 16:35 ? 00:00:00 ora_mmon_PROD maclean 7301 1 0 16:35 ? 00:00:00 ora_mmnl_PROD maclean 7303 1 0 16:35 ? 00:00:00 ora_d000_PROD maclean 7305 1 0 16:35 ? 00:00:00 ora_s000_PROD maclean 7313 1 0 16:35 ? 00:00:00 ora_qmnc_PROD maclean 7430 1 0 16:35 ? 00:00:00 ora_q000_PROD maclean 7438 1 0 16:36 ? 00:00:00 ora_q001_PROD /* lgwr是著名的Oracle后台进程,在这个启动的实例中其系统进程号为7289*/ [maclean@rh2 ~]$ ls -l /proc/7289/fd /* linux上的proc文件系统可以很方便我们探测进程信息*/ total 0 lr-x------ 1 maclean oinstall 64 Jul 26 16:38 0 -> /dev/null lr-x------ 1 maclean oinstall 64 Jul 26 16:38 1 -> /dev/null lr-x------ 1 maclean oinstall 64 Jul 26 16:38 10 -> /dev/zero lr-x------ 1 maclean oinstall 64 Jul 26 16:38 11 -> /dev/zero lr-x------ 1 maclean oinstall 64 Jul 26 16:38 12 -> /s01/rac10g/rdbms/mesg/oraus.msb lrwx------ 1 maclean oinstall 64 Jul 26 16:38 13 -> /s01/rac10g/dbs/hc_PROD.dat lrwx------ 1 maclean oinstall 64 Jul 26 16:38 14 -> /s01/rac10g/dbs/lkPROD lrwx------ 1 maclean oinstall 64 Jul 26 16:38 15 -> /s01/rac10g/oradata/PROD/controlfile/o1_mf_64q6xphj_.ctl lrwx------ 1 maclean oinstall 64 Jul 26 16:38 16 -> /s01/rac10g/flash_recovery_area/PROD/controlfile/o1_mf_64q6xpms_.ctl lrwx------ 1 maclean oinstall 64 Jul 26 16:38 17 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_1_64q6xrsr_.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 18 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_1_64q6xsoy_.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 19 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_2_64q6xths_.log l-wx------ 1 maclean oinstall 64 Jul 26 16:38 2 -> /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc lrwx------ 1 maclean oinstall 64 Jul 26 16:38 20 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_2_64q6xv9o_.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 21 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_3_64q6xw1b_.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 22 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_3_64q6xwv0_.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 23 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_system_64q6wd5j_.dbf lrwx------ 1 maclean oinstall 64 Jul 26 16:38 24 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_undotbs1_64q6wd7f_.dbf lrwx------ 1 maclean oinstall 64 Jul 26 16:38 25 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_sysaux_64q6wd5m_.dbf lrwx------ 1 maclean oinstall 64 Jul 26 16:38 26 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_users_64q6wd89_.dbf lrwx------ 1 maclean oinstall 64 Jul 26 16:38 27 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_temp_64q6xyox_.tmp lr-x------ 1 maclean oinstall 64 Jul 26 16:38 28 -> /s01/rac10g/rdbms/mesg/oraus.msb lr-x------ 1 maclean oinstall 64 Jul 26 16:38 3 -> /dev/null lr-x------ 1 maclean oinstall 64 Jul 26 16:38 4 -> /dev/null l-wx------ 1 maclean oinstall 64 Jul 26 16:38 5 -> /s01/rac10g/admin/PROD/udump/prod_ora_7279.trc l-wx------ 1 maclean oinstall 64 Jul 26 16:38 6 -> /s01/rac10g/admin/PROD/bdump/alert_PROD.log lrwx------ 1 maclean oinstall 64 Jul 26 16:38 7 -> /s01/rac10g/dbs/lkinstPROD (deleted) lrwx------ 1 maclean oinstall 64 Jul 26 16:38 8 -> /s01/rac10g/dbs/hc_PROD.dat l-wx------ 1 maclean oinstall 64 Jul 26 16:38 9 -> /s01/rac10g/admin/PROD/bdump/alert_PROD.log /*可以看到lgwr进程相关trace文件为/s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc,对应打开文件描述符为2*/ [maclean@rh2 ~]$ ls -lh /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc -rw-r----- 1 maclean oinstall 1.7M Jul 26 16:37 /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc [maclean@rh2 ~]$ rm -f /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc /*尝试删除该trace文件*/ [maclean@rh2 ~]$ ls -l /proc/7289/fd|grep lgwr l-wx------ 1 maclean oinstall 64 Jul 26 16:38 2 -> /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc (deleted) /*文件已处在deleted状态,但lgwr进程仍持有该文件相关的文件描述符,这个时候该文件占有的空间并不会被释放*/ [maclean@rh2 ~]$ lsof|grep lgwr oracle 7289 maclean 2w REG 8,2 1702391 3867134 /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc (deleted) [maclean@rh2 ~]$ sqlplus / as sysdba SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 26 17:03:04 2010 Copyright (c) 1982, 2010, Oracle. All Rights Reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> oradebug setospid 7289; Oracle pid: 6, Unix process pid: 7289, image: oracle@rh2 (LGWR) SQL> oradebug flush; /*写出trace buffer内容到trace文件*/ Statement processed. SQL> oradebug close_trace; Statement processed. /*close_trace能够释放指定Oracle进程正打开着的文件,To close the current trace file use*/ SQL> host [maclean@rh2 ~]$ lsof|grep lgwr [maclean@rh2 ~]$ ls -l /proc/7289/fd/|grep lgwr [maclean@rh2 ~]$ /* 从进程相关的fd文件夹中查找不到原来的trace文件;close_trace命令成功释放了该文件,并回收了磁盘空间。*/
直到今天仍是经典,无可替代,绝不落伍。想了解Oracle internal 必读的一本书,每次翻开影印本都会有所收获,感谢作者的无私贡献!
[gview file=”http://askmac.cn/wp-content/uploads/resource/oracle8i.internal.services.for.waits.latches.locks.pdf”]
共享池是Oracle著名的SGA的一个重要组成部分,当我们尝试从共享池中分配较大的连续区域时(默认来说是4400bytes),我们可能会用到共享池中的保留区域(也叫保留池);注意Oracle总是会先尝试扫描普通共享池的空闲列表,之后才尝试扫描保留池的空闲列表,无论所需分配的内存块是否超过隐式参数_shared_pool_reserved_min_alloc所指定的值。
什么?你看到过的4031描述文档是用以下伪代码描述分配流程的:
large, scan reserved list if (chunk found) check chunk size and perhaps truncate if (chunk is not found) scan regular free list if (chunk found) check chunk size and perhaps truncate all done if (chunk is not found) do LRU operations and repeat small, scan regular free list if (chunk found) check chunk size and perhaps truncate all done if (chunk is not found) do LRU operations and repeat
那么来看看以下测试:
SQL> alter system set "_shared_pool_reserved_pct"=5 scope=spfile; System altered. SQL> startup frce; SP2-0714: invalid combination of STARTUP options SQL> startup force; ORACLE instance started. Total System Global Area 3154116608 bytes Fixed Size 2099616 bytes Variable Size 2197816928 bytes Database Buffers 939524096 bytes Redo Buffers 14675968 bytes Database mounted. Database opened. SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 3525368 SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ 2 FROM SYS.x$ksppi x, SYS.x$ksppcv y 3 WHERE x.inst_id = USERENV ('Instance') 4 AND y.inst_id = USERENV ('Instance') 5 AND x.indx = y.indx 6 AND x.ksppinm LIKE '%_shared_pool_reserved_min_alloc%'; NAME VALU DESCRIB ------------------------------- ---- --------------------------------------------------------------------- _shared_pool_reserved_min_alloc 4400 minimum allocation size in bytes for reserved area of shared pool SQL> select count(*) from x$ksmsp where ksmchsiz>4400 and ksmchcom!='free memory'; COUNT(*) ---------- 64 SQL> exec dbms_workload_repository.create_snapshot; PL/SQL procedure successfully completed. SQL> select count(*) from x$ksmsp where ksmchsiz>4400 and ksmchcom!='free memory'; COUNT(*) ---------- 67 /* 方才调用的存储过程成功在共享池中分配到3个大于4400 byte的Chunk,接下来看保留池大小变化) SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 3525368 /* 保留池大小没有发生变化,很显然3个大于4400 byte的Chunk是从regular free list上获取的,而非reserved free list/
以上实验中我们通过调用awr快照存储过程,模拟了从共享池中分配大于4400字节Chunk的操作,实验结果是在保留池有足够空闲空间的情况下,Oracle仍尝试在普通共享池区域中分配了这些连续内存,故而通过查询内部视图x$ksmsp虽然发现了多出了三个大于4400 byte的Chunk,然而保留池的空闲量并未减少。由此可证即便是要分配大于4400字节的内存块,Oracle也会先尝试搜索普通空闲列表,在普通空闲列表上无适应尺寸的连续空间时,才会尝试扫描保留池的空闲列表。
言归正题,我们可以通过2个参数控制保留池的大小:shared_pool_reserved_size和_shared_pool_reserved_pct。这2个参数的区别在于普通参数shared_pool_reserved_size以数值形式制定保留池的大小,这个数值采用在10g的ASMM(自动管理的SGA内存管理)特性的环境中是不会随共享池的大小变化而浮动的;不同于此,隐式参数_shared_pool_reserved_pct作为一个比例值,可以协同ASMM中共享池的变化而适应变化。在讨论经典4031错误的数个文档中,都有介绍到如果在ASMM环境中,设置_shared_pool_reserved_pct往往要好过shared_pool_reserved_size,它使你的共享池更具可收缩性!
纸上得来终觉浅,我们来看看_shared_pool_reserved_pct的实际效果:
SQL> alter system set sga_max_size=3000M scope=spfile; System altered. SQL> alter system set sga_target=3000M scope=spfile; System altered. SQL> alter system set shared_pool_size=500M; System altered. SQL> alter system set "_shared_pool_reserved_pct"=50 scope=spfile; System altered. SQL> startup force ; ORACLE instance started. Total System Global Area 3154116608 bytes Fixed Size 2099616 bytes Variable Size 570426976 bytes Database Buffers 2566914048 bytes Redo Buffers 14675968 bytes Database mounted. Database opened. SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 21158280 SQL> alter system set shared_pool_size=2000M ; /*ASMM下手动修改shared_pool_size,模拟共享池自动扩展的情况*/ System altered. SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 21158280 /* ohhh!好像跟我们预期的差别挺大,保留池大小没变*/ 让我们跑下这段产生反复硬解析的SQL: begin for i in 1..200000 loop execute immediate 'select 2 from dual where 1='||i; end loop; end; / SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 296215920 /* 这样好了,我们如愿了,SGA真"动态" / SQL> alter system set shared_pool_size=300M; /*尝试收缩ASMM下的共享池*/ System altered. SQL> alter system flush shared_pool; System altered. SQL> select free_space from v$shared_pool_reserved; FREE_SPACE ---------- 296215920 /* 我们甚至无法flush 掉这些内存,这挺要命的 / SQL> select name ,value/1024/1024 "SIZE MB" from v$system_parameter where name in ('sga_target','sga_max_size','shared_pool_size','db_cache_size','java_pool_size','large_pool_size','db_keep_cache_size'); NAME SIZE MB -------------------- ---------- sga_max_size 3008 shared_pool_size 304 large_pool_size 16 java_pool_size 16 sga_target 3008 db_cache_size 512 db_keep_cache_size 0 可以看到我们还有很多“没有分配”的SGA内存,我们来加大高速缓存看看: SQL> alter system set db_cache_size=1000M; alter system set db_cache_size=1000M * ERROR at line 1: ORA-32017: failure in updating SPFILE ORA-00384: Insufficient memory to grow cache /* ohh 因为无法回收保留池的大量内存,导致了SGA其他组件无法扩展/
_shared_pool_reserved_pct的默认值5%可以满足绝大多数情况,通过上述实验证明设置该percent参数可以使保留池大小随SGA动态调整而扩大;但通过再次调整shared_pool_size和flush shared_pool手段都无法回收过度分配的保留池空间,这会导致其他组件无法正常扩展;因而我们在10gASMM的背景下,通过设置_shared_pool_reserved_pct可以获得更好的效果,但因为存在回收空间的问题,该参数也不宜设置过大,如果默认值在您的场景中显得过小,那么您可以尝试使用5-20这个区间内的值,超过20%的话往往就会造成负面的影响了。
ORA-08103问题的诊断最好是能生成8103错误的ERROR STACK TRACE, 在TRACE中会记录具体引发8103的对象的OBJ和OBJD,这便于我们定位可能存在corruption的对象。
问题在于往往前台进程遇到ORA-08103错误不会在后台生成TRACE文件,这需要我们手动设置8103 触发ERRORSTACK的EVENTS:
ALTER SYSTEM SET EVENTS ‘ 8103 TRACE NAME ERRORSTACK LEVEL 3’;
解决思路包括:
1. 通过OBJD和DBA定位到具体的表名和块号
2. 有条件的情况下对该表做一个analyze .. validate structure
3. 有条件的情况下对该表所在tablespace做一个 dbms_space_admin.ASSM_TABLESPACE_VERIFY
4. 有条件的情况下move这张表或者相关的分区,尝试绕过该问题
5. 有条件的情况下降该表或分区移动到MSSM表空间上,绕过该问题
execute dbms_space_admin.tablespace_verify(‘&tablespace_name’)
oradebug setmypid
oradebug tracefile_name
execute dbms_space_admin.assm_tablespace_verify(‘&tablespace_name’,dbms_space_admin.TS_VERIFY_BITMAPS)
oradebug setmypid
oradebug tracefile_name
[oracle@nas ~]$ oerr ora 8103 08103, 00000, "object no longer exists" // *Cause: The object has been deleted by another user since the operation // began, or a prior incomplete recovery restored the database to // a point in time during the deletion of the object. // *Action: Delete the object if this is the result of an incomplete // recovery. @ Using the call stack trace arguments to identify the block producing the ORA-8103. @ struct kcbds @ { @ ktid kcbdstid; /* full relative DBA plus object number */ @ .....@ struct ktid /* relative dba + objd */ @ { @ kdbafr dbr_ktid; /* a relative dba */ @ kobjd objd_ktid; /* data object number */ @ kobjn objn_ktid; /* dictionary object number */@ struct kdbafr /* full relative dba */ @ { @ ktsn tsn_kdbafr; 4bytes /* a tablespace number */ @ krdba dba_kdbafr; 4bytes /* a relative dba */ @ }; @ alter session set db_file_multiblock_read_count=1; @ alter session set events '8103 trace name errorstack level 3'; @ kcbgtcr(kcbds *ds,... @ ktecgshx(sdes, ...) @ kcbds *sdes; @ ktecgetsh(cdes, ...) @ kcbds *cdes; @ Example from a trace file with function ktecgshx being called by kteinicnt1: @ kteinicnt1()+796 CALL ktecgshx() FFFFFFFF7FFF8F78 ? @ 000000003 ? 000000004 ? @ 0000001BC ? 000000000 ? @ 1007AA000 ? @ Argument/Register addr=0xFFFFFFFF7FFF8F78. @ Dump of memory from 0xFFFFFFFF7FFF8F38 to 0xFFFFFFFF7FFF9078 @ FFFFFFFF7FFF8F30 00000000 00000000 [........] @ FFFFFFFF7FFF8F40 00000000 00000000 FFFFFFFF 00000001 [................] @ FFFFFFFF7FFF8F50 00000000 00000000 00000000 00000000 [................] @ Repeat 1 times @ FFFFFFFF7FFF8F70 00000000 00000000 0000000C 01006402 [..............d.] After increase in load, queries against ASSM table intermittently fail with ORA-8103 when executed in parallel if there are concurrent updates performed on the table. This appears to only manifest when access is in parallel. Cause This is caused by Bug 5637976 ORA-8103 EVEN WITH THE WORKAROUND FROM Bug 3569503 fixed in 11.1g. Concurrent inserts and direct path exports on an ASSM table causes ORA-8103/ORA-1410. This is due to the fact that newly formatted blocks between low and high water mark do not get flushed to disk and query sees old copies from disk. Rediscovery Information: 1. Concurrent inserts and exports on ASSM tables 2. ORA-8103/ORA-1410 3. redo dump shows 'ktspbfredo - Format Pagetable Datablock' for that rdba REDO RECORD - Thread:2 RBA: 0x00045b.001887a1.0028 LEN: 0x008c VLD: 0x01 SCN: 0x0578.6eddf7be SUBSCN: 1 07/19/2012 12:11:00 CHANGE #1 TYP:1 CLS: 4 AFN:370 DBA:0x5ca5f32e OBJ:1638047 SCN:0x0578.6eddf7bd SEQ: 1 OP:13.17 ktsphfredo - Format Pagetable Segment Header StartDBA 0x5ca5f32b nblks: 32 ForceL3 :1 Tsn: 15 objd: 1638047 REDO RECORD - Thread:2 RBA: 0x00045b.001887a5.0198 LEN: 0x008c VLD: 0x01 SCN: 0x0578.6eddf7c7 SUBSCN: 1 07/19/2012 12:11:00 CHANGE #1 TYP:1 CLS: 4 AFN:284 DBA:0x4718cbee OBJ:1638047 SCN:0x0578.6eddf7c2 SEQ: 1 OP:13.17 ktsphfredo - Format Pagetable Segment Header BH (70000039ffb5108) file#: 370 rdba: 0x5ca5f32e (370/2487086) class: 7 ba: 70000039f230000 set: 94 blksize: 32768 bsi: 0 set-flg: 0 pwbcnt: 0 dbwrid: 5 obj: 1638047 objn: 148393 tsn: 15 afn: 370 hash: [700000fde5e6380,700000fde5e6380] lru: [7000005e7fcbdc0,700000b91fb4ce8] lru-flags: hot_buffer ckptq: [NULL] fileq: [NULL] objq: [700000f7c3f8288,70000063cfbac28] st: SCURRENT md: NULL tch: 2 le: 70000069bff76a0 flags: remote_transfer LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535] buffer tsn: 15 rdba: 0x5ca5f32e (370/2487086) scn: 0x0578.6eded558 seq: 0x01 flg: 0x00 tail: 0xd5582401 frmt: 0x02 chkval: 0x0000 type: 0x24=PAGETABLE EXTENT MAP BLOCK Hex dump of block: st=0, typ_found=1 EMB Dump: Map Header:: next 0x4718cbee #extents: 1112 obj#: 1638047 flag: 0x10000000 Inc # 0 Extent Map ----------------------------------------------------------------- 0x5ca5f32b length: 32 0x5ceff1eb length: 32 0x5d15360b length: 32 0x5d5ddbcb length: 32 0x5d9d106b length: 32 0x5dc000ab length: 32 0x5e09e1ab length: 32 0x5e4a8c0b length: 32 0x5e80d24b length: 32 0x5ec9a10b length: 32 0x5f009feb length: 32 0x5f40b74b length: 32 0x5f895f2b length: 32 0x5fd254cb length: 32 BH (700000dbcfc0ea8) file#: 284 rdba: 0x4718cbee (284/1625070) class: 7 ba: 700000dbc750000 set: 67 blksize: 32768 bsi: 0 set-flg: 0 pwbcnt: 0 dbwrid: 2 obj: 1638047 objn: 148393 tsn: 15 afn: 284 hash: [700000fdc387588,700000fdc387588] lru: [7000002f1fbcf90,700000a77fcfc30] lru-flags: hot_buffer ckptq: [NULL] fileq: [NULL] objq: [700000fc67dd420,700000453fb1828] st: SCURRENT md: NULL tch: 143 le: 700000665fd8200 flags: remote_transfer LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535] buffer tsn: 15 rdba: 0x4718cbee (284/1625070) scn: 0x0578.6ee3867a seq: 0x01 flg: 0x00 tail: 0x867a2401 frmt: 0x02 chkval: 0x0000 type: 0x24=PAGETABLE EXTENT MAP BLOCK Hex dump of block: st=0, typ_found=1 EMB Dump: Map Header:: next 0x00000000 #extents: 1983 obj#: 1638047 flag: 0x10000000 Inc # 0 Extent Map ----------------------------------------------------------------- 0x4718cbeb length: 32 0x475b598b length: 32 0x47989f6b length: 32 0x47d84f2b length: 32 ORA-8103 - objd: 1638108 objn: 1338416 tsn: 15 rdba: 0x4b8bf059 ksedmp: internal or fatal error ORA-08103: object no longer exists Current SQL statement for this session: ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 000000001 ? 000000000 ? ksedmp+0290 bl ksedst 104C23090 ? ksddoa+0308 bl _ptrgl ksdpcg+0104 bl ksddoa 110490160 ? 11048ACB8 ? ksdpec+00e8 bl ksdpcg FFFFFFFFFFEEF20 ? 700000010007FE0 ? FFFFFFFFFFEEFF0 ? ksfpec+00a4 bl 03F37234 kgesev+007c bl _ptrgl ksesec0+0048 bl kgesev 000007FE8 ? 104FD1FE0 ? 000000000 ? 000000000 ? FFFFFFFFFFEF410 ? kteinicnt1+0384 bl 01FC3F98 qertbFetch+0288 bl 03F386EC qertqoFetch+0298 bl 01FC3FD8 qerpx_resume+0370 bl 01FC3FD8 qerpxFetch+0e08 bl qerpx_resume 000000000 ? 11055A520 ? rwsfcd+0054 bl _ptrgl insfch+00b4 bl _ptrgl insdrv+042c bl insfch 104C2BAE8 ? 000000000 ? inscovexe+02d8 bl insdrv 1104A81B0 ? insExecStmtExecIniE bl _ptrgl ngine+005c insexe+0318 bl insExecStmtExecIniE 000000000 ? 000000400 ? ngine 11048A818 ? opiexe+2840 bl insexe 1104BF320 ? FFFFFFFFFFF1678 ? opipls+1888 bl opiexe FFFFFFFFFFF29C8 ? FFFFFFFFFFF2AB0 ? FFFFFFFFFFF2968 ? opiodr+0b2c bl _ptrgl rpidrus+01dc bl opiodr 66FFFF47D0 ? 6FFFF4800 ? FFFFFFFFFFF5900 ? A00000000 ? skgmstack+00c8 bl _ptrgl rpidru+0088 bl skgmstack 000000003 ? 000000003 ? 000000002 ? 000000000 ? FFFFFFFFFFF50B0 ? rpiswu2+0368 bl _ptrgl rpidrv+097c bl rpiswu2 70000100553C598 ? 000000000 ? 700000010003520 ? 110566428 ? 110566464 ? 96FFFF5B30 ? 1104C6010 ? 000000000 ? Argument/Register addr=0x0FFFFFFFFFFEF410. Dump of memory from 0x0FFFFFFFFFFEF3D0 to 0x0FFFFFFFFFFEF510 FFFFFFFFFFEF3D0 00000000 00000000 00000001 1048A818 [.............H..] FFFFFFFFFFEF3E0 00000000 00002000 00000001 1019C060 [...... ........`] FFFFFFFFFFEF3F0 0FFFFFFF FFFEF5E0 48220080 00000B9D [........H"......] FFFFFFFFFFEF400 00000000 00000000 00000000 00000000 [................] FFFFFFFFFFEF410 0000000F 4B8BF059 0018FEDC 00146C30 [....K..Y......l0] FFFFFFFFFFEF420 00080003 00007FE8 00000000 100733A8 [..............3.] 00146C30=> 1338416=> ORA-8103 - objd: 1638108 objn: 1338416 kjbhistory[0xbf059.12e0000,(pkey 4294967295.0)(where 1)] *** 2012-07-19 15:05:23.818 GLOBAL CACHE ELEMENT DUMP (address: 70000018cfe95a0): id1: 0xbf059 id2: 0x12e0000 pkey: INVALID block: (302/782425) lock: NC rls: 0x0000 acq: 0x0003 latch: 20 flags: 0xc1 fair: 0 recovery: 0 fpin: 'ktewh25: kteinicnt' bscn: 0x578.6ee51801 bctx: 0 write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0 lcp: 700000fd843f070 lnk: [700000fd843f090,700000fd843f090] lch: [700000bdbfbb338,700000bdbfbb338] seq: 25664 hist: 7 352 477 329 144:6 384 7 352 477 329 LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT: flg: 0x00080000 state: READING mode: EXCL pin: 'ktewh25: kteinicnt' addr: 700000bdbfbb228 obj: 1638108 cls: SEG HEAD bscn: 0x577.a4f2674f Note= OERR: ORA-8103 "object no longer exists" / Troubleshooting, Diagnostic and Solution (Doc ID 8103.1) ==> Cause ORA-8103 is caused by an invalid block type. The block header has an invalid block type or the block type inside the block is not expected; e.g. a data block (Type=6) was expected but the actual block information is not a data block (Type!=6). ORA-8103 is also caused by an unexpected data_object_id where it is changing for the involved objects while the affected SQL statement is executed. $sqlplus / as sysdba Note: please replace literal '<owner>' with actual owner --------------------< set lines 500 set long 9999 set pages 999 set serveroutput on size 1000000 set feedback off SET MARKUP HTML ON SPOOL ON HEAD "<TITLE>SQL*Plus Report</title><STYLE TYPE='TEXT/CSS'><!--BODY {background: ffffc6} --></STYLE>" spool query_result.html set echo off alter session set nls_date_format = 'yyyy/mm/dd hh24:mi:ss'; SELECT * FROM DBA_TAB_MODIFICATIONS where table_owner = '<owner>' and table_name in ('RAW_BORM','MG_34_FEE_DTL','RAW_BOIS','MG_34_CA_AMT_BK','RAW_BLDVNI'); spool off SET MARKUP HTML OFF set echo on --------------------> 1. run the hcheck script against the database "using note hcheck.sql" script to check for known problems in Oracle8i, Oracle9i, Oracle10g and Oracle 11g (Doc ID 136697.1) and provide the output to SR. Please do not provide a print screen, but the spool file obtained 2. set event for ORA 8103 to capture the errorstack alter system set events='8103 trace name errorstack, level 3'; 3. wait for the error to reproduce and upload the trace file created for the error
检查告警日志发现出现ora-600:[17281],[1001]记录,该数据库版本为10.2.0.4:
ORA-00600: internal error code, arguments: [17281], [1001], [0x70000042F5E54F8], [], [], [], [], []
ORA-01001: invalid cursor
分析该600错误产生的trace文件,发现当时运行的语句是一段匿名块:
Current SQL statement for this session:
declare
t_owner varchar2(30);
t_name varchar2(30);
procedure check_mview is
dummy integer;
begin
if :object_type = ‘TABLE’ then
select 1 into dummy
from sys.all_objects
where owner = :object_owner
and object_name = :object_name
and object_type = ‘MATERIALIZED VIEW’
and rownum = 1;
:object_type := ‘MATERIALIZED VIEW’;
end if;
exception
when others then null;
end;
begin
:sub_object := null;
if :deep != 0 then
begin
if :part2 is null then
select constraint_type, owner, constraint_name
into :object_type, :object_owner, :object_name
from sys.all_constraints c
where c.constraint_name = :part1 and c.owner = user
and rownum = 1;
else
select constraint_type, owner, constraint_name, :part3
into :object_type, :object_owner, :object_name, :sub_object
from sys.all_constraints c
where c.constraint_name = :part2 and c.owner = :part1
and rownum = 1;
end if;
if :object_type = ‘P’ then :object_type := ‘PRIMARY KEY’; end if;
if :object_type = ‘U’ then :object_type := ‘UNIQUE KEY’; end if;
if :object_type = ‘R’ then :object_type := ‘FOREIGN KEY’; end if;
if :object_type = ‘C’ then :object_type := ‘CHECK CONSTRAINT’; end if;
return;
exception
when no_data_found then null;
end;
end if;
:sub_object := :part2;
if (:part2 is null) or (:part1 != user) then
begin
select object_type, user, :part1
into :object_type, :object_owner, :object_name
from sys.all_objects
where owner = user
and object_name = :part1
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
if :object_type = ‘SYNONYM’ then
select s.table_owner, s.table_name
into t_owner, t_name
from sys.all_synonyms s
where s.synonym_name = :part1
and s.owner = user
and rownum = 1;
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = t_owner
and o.object_name = t_name
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
end if;
:sub_object := :part2;
if :part3 is not null then
:sub_object := :sub_object || ‘.’ || :part3;
end if;
check_mview;
return;
exception
when no_data_found then null;
end;
end if;
begin
select s.table_owner, s.table_name
into t_owner, t_name
from sys.all_synonyms s
where s.synonym_name = :part1
and s.owner = ‘PUBLIC’
and rownum = 1;
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = t_owner
and o.object_name = t_name
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
check_mview;
return;
exception
when no_data_found then null;
end;
:sub_object := :part3;
begin
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = :part1
and o.object_name = :part2
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
check_mview;
return;
exception
when no_data_found then null;
end;
begin
if :part2 is null and :part3 is null
then
select ‘USER’, null, :part1
into :object_type, :object_owner, :object_name
from sys.all_users u
where u.username = :part1
and rownum = 1;
return;
end if;
exception
when no_data_found then null;
end;
begin
if :part2 is null and :part3 is null and :deep != 0
then
select ‘ROLE’, null, :part1
into :object_type, :object_owner, :object_name
from sys.session_roles r
where r.role = :part1
and rownum = 1;
return;
end if;
exception
when no_data_found then null;
end;
:object_owner := null;
:object_type := null;
:object_name := null;
:sub_object := null;
end;
—– Call Stack Trace —–
calling call entry argument values in hex
location type point (? means dubious value)
——————– ——– ——————– —————————-
ksedst+001c bl ksedst1 088424844 ? 041124844 ?
ksedmp+0290 bl ksedst 104A54870 ?
ksfdmp+0018 bl 03F30204
kgeriv+0108 bl _ptrgl
kgeasi+0118 bl kgeriv 1104722C8 ? 1101B87C0 ?
104AFA0FC ? 7000000100067F8 ?
000000000 ?
kgicli+0188 bl kgeasi 110195A58 ? 110447310 ?
438100004381 ? 200000002 ?
200000002 ? 000000000 ?
0000003E9 ? 000000002 ?
kgidlt+0398 bl kgicli 110450A70 ? 104C734E0 ?
kgidel+0018 bl kgidlt FFFFFFFFFFF90B8 ? 000000000 ?
000000003 ? 000000000 ?
FFFFFFFFFFF9468 ?
perabo+00ac bl kgidel FFFFFFFFFFF8D20 ? 0000000FF ?
perdcs+0050 bl perabo 000000000 ? 000000820 ?
000000000 ?
peidcs+01dc bl perdcs 110477618 ? 000000000 ?
kkxcls+00a4 bl peidcs FFFFFFFFFFF9468 ? 110477618 ?
kxsClean+0044 bl kkxcls 1100DD338 ?
kxsCloseXsc+0444 bl kxsClean FFFFFFFFFFF9760 ?
kksCloseCursor+031c bl kxsCloseXsc 110478688 ? 110281FB0 ?
opicca+00c4 bl kksCloseCursor 104BD9640 ?
opiclo+00a0 bl opicca 10013D940 ?
kpoclsa+0050 bl 03F32B00
opiodr+0ae0 bl _ptrgl
ttcpip+1020 bl _ptrgl
opitsk+1124 bl 01F9F2A0
opiino+0990 bl opitsk 000000000 ? 000000000 ?
opiodr+0ae0 bl _ptrgl
opidrv+0484 bl 01F9E0E8
sou2o+0090 bl opidrv 3C02DC1BBC ? 44065F000 ?
FFFFFFFFFFFF3A0 ?
opimai_real+01bc bl 01F9B9F4
main+0098 bl opimai_real 000000000 ? 000000000 ?
__start+0098 bl main 000000000 ? 000000000 ?
first argument为17281,该代码对应为在关闭游标时发生错误事件。
发生错误时的调用栈为:kkxcls->peidcs->perdcs->perabo->kgidel->kgidlt->kgicli,通过以上调用栈与argument信息在600 lookup工具中查询,可以发现bug:[6051353]:
Hdr: 6051353 10.1.0.45 THIN 10.1.0.5 PRODID-972 PORTID-212 ORA-600
Abstract: ORA-600[17281] ORA-[1001]
*** 05/14/07 07:09 am ***
TAR:
----
17450130.600
PROBLEM:
--------
Oracle 10.1.0.5 64-bit
AIX5L 64-bit server
Following the application of CPUJAN2007 patch, the database is giving
internal errors.
Tha alert log sjows:
ORA-600: internal error code, arguments: [17281], [1001],
[0x70000001E792DC8], [], [], [], [], []
ORA-1001: invalid cursor
Trace file shows the failing statement is an insert.
INSERT INTO V_RPMORGRESOURCEPOSITION
(ORGID, RESOURCEID, EFFECTIVEDAY, TERMINATIONDAY, OWNEDMW,
COMMITTEDMW, AVAILABLEMW, UNOFFEREDMW, FRRCOMMITTEDMW )
SELECT :B12 , :B11 , EFFECTIVEDAY, LEAST(:B10 , :B9 ),
NVL(OWNEDMW,0) + NVL(:B8 ,0), NVL(COMMITTEDMW,0) + NVL(:B7 ,0),
NVL(AVAILABLEMW,0) + NVL(:B6 ,0), NVL(UNOFFEREDMW,0) + NVL(:B5 ,0),
NVL(FRRCOMMITTEDMW,0) + NVL(:B4 ,0)
FROM RPMORGRESOURCEPOSITION
WHERE ORGID = :B3 AND RESOURCEID = :B2 AND EFFECTIVEDAY = :B1
Other information:
O/S info: user: , term: , ospid: 1234, machine: esu03als
client info: smartino@PJM
application name: eRpm
action name: QueryZonalLoadObligationDetail
last wait for 'SQL*Net message from client'
DIAGNOSTIC ANALYSIS:
--------------------
the function stack exactly matches 4359111
kgicli kgidlt kgidel perabo
Bug 4359111 - STRESS ORA-600[17281] WHEN RUNNING GMT APPLICATION13
But this is already fixed in 10.1.0.5
The 'client' (esu03als) is a Linux server and there is no oracle running
there.
The AIX SysAdmin. who manages the esu03als server says this about the
application:
"The way RPM connects to our database is by using DBCP (Apache Jakarta
Project -- see
As far as I can tell, the release (jar) contained within the RPM
delivery is version 1.1 of DBCP which was released on 2003-10-20."
This couls also be Bug 5392685/5413487 but tis doesn't seem to have a
resolution.
Also could be Bug 5366763
possible workaround - set session_cached_cursors to 0
But this is already set to 0.
Originally the Customer thought this error started after applying CPUJAN2007,
but it is appearing on another database where the CPU patch is not applied.
WORKAROUND:
-----------
none
RELATED BUGS:
-------------
4359111
5413487
5366763
REPRODUCIBILITY:
----------------
intermittent
TEST CASE:
----------
none
STACK TRACE:
------------
ksedmp ksfdmp kgeriv kgeasi
kgicli kgidlt kgidel perabo perdcs peidcs kkxcls2 kxtcln
kxsClean kksCloseCursor opicca opiclo kpoclsa opiodr
ttcpip opitsk opiino opiodr opidrv sou2o
main
其调用栈完全一致,可以基本确定2者的关联性。但该文档叙述bug发生在10.1.0.5的AIX版本上,且据称该Bug之前已在“Bug 4359111 – STRESS ORA-600[17281] WHEN RUNNING GMT APPLICATION13”中声明并在10.1.0.5上修复,看起来又是一个伪修复的漏洞。另外一个文档叙述了同样的错误发生在10.2.0.4上:
Hdr: 8337808 10.2.0.4 RDBMS 10.2.0.4 PRG INTERFACE PRODID-5 PORTID-46 ORA-600 4359111
Abstract: ORA-600 [17281] [1001] EVEN AFTER APPLYING THE PATCH 4359111
In prod_ora_12985.trc we see that:
O/S info: user: Arun?Sharma, term: ARUN, ospid: 2556:3300,
machine: BVM-EDP\ARUN
Got the ORA-600 at 2009-04-17 13:27:21 .
From the trace this was likely because the PLSQL block in
cursor #2 has an instantiation entry indicating that it
has cursor #3 open:
INSTANTIATION OBJECT: object=0xf60e9ed0
type="PL/SQL"[0] lock=0xa383a928 handle=0xb48def6c body=(nil)
flags=[40] executions=0
CURSORS: size=4 count=1 next=3
index cursor tag context flags
----- ------ -------- -------- ---------------
2 3 0xf60d56f4 0xf60f832c LRU/PRS/[03]
^here
But there is no cursor#3 so it has likely been closed independently
So the immediate thing to do would be to find out what this OS
user (Arun Sharma) was doing at 13.27 on 17th April from the
ARUN machine under OS pid 2556:3300 ?
- what was the client program ?
- what is this clients Oracle RSF version ?
- what was being done in that client at that time.
It is unlikely that the user will remember such fine detail
so you may want to track the alert log closely and as soon
as you seen the ORA-600 find the O/S user , machine etc..
and try to contact them ASAP to confirm what they were doing
etc.. If we can get a handle on the client version / program /
actions that may help. Beyond that the next step is likely
to need a diagnostic on the server side to note cursor close
operations from the client without extranoues additional trace.
From the above update his client is TOAD using 8.0.6.0 on Windows.
TOAD is known to be affected by bug 4359111 so
you should upgrade this client to a version where
bug 4359111 is fixed. (4359111 is a CLIENT SIDE fix)
Marking as an unconfirmed duplicate of bug 4359111
as it looks like some specific client may be connecting
which does not have that patch in place.
与以上文档描述相同,trace中存在以下记录:
INSTANTIATION OBJECT: object=1105e4fe8
type=”PL/SQL”[0] lock=70000044155a8b0 handle=70000042f5e54f8 body=0 level=0
flags=[40] executions=0
CURSORS: size=4 count=3 next=5
index cursor tag context flags
—– —— ——– ——– —————
2 4 11049ecc8 110525f60 LRU/PRS/[03]
3 6 11049ecc8 1105261f0 LRU/PRS/[03]
4 5 11049ecc8 110526338 LRU/PRS/[03]
但实际上这里cursor 3所打开的cursor#:4,5,6均不存在,所以cursor# 3也被单独关闭了。文档中问题是由toad引起的,首先toad连接数据库是不需要安装Oracle client的,它通过一些客制化过的c/c++的接口连接到DB;如文档所述可以确定toad V8.0.6.0受到 Bug4359111的影响,而我们的环境中是通过PL/SQL developer连接到数据库的,该工具需要用到Oracle client,而开发人员安装的Oracle client一般为9.2.0.1,极有可能是这一较低版本的客户端软件造成了问题发生,到这里触发Bug的条件基本清晰了。
8i/9i的oracle client虽然仍能够连接到10g,但难保不发生一些兼容性问题或者将早期版本中的Bug再次代入,Oracle对这些连接形式或已不提供技术支持,或提供扩展模式(可能收费)的技术支持。以下列表列出了各版本Server-client的兼容性:
其实在我们升级或迁移Oracle数据库的时候就因该考虑到客户端软件也需要升级到合适版本才能满足今后兼容性及应用程序健壮度的要求,当然客户端软件并不一定只是oracle client,它可能是jdbc,也许是odbc,也许是dbi等等。
Copyright © 2025 · Genesis Framework · WordPress · Log in