mutex

【Maclean Liu技术分享】深入理解Oracle中 Mutex的内部原理

2013/04/14 by admin Leave a Comment

【Maclean Liu技术分享】深入理解Oracle中 Mutex的内部原理

本文适合对Oracle Mutex/latch有兴趣了解其深入内部原理的同学

下载地址：

【Maclean Liu技术分享】深入理解Oracle中Mutex的内部原理

深入理解Oracle中的Mutex

2013/04/13 by admin 10 Comments

了解 Oracle Mutex

虽然Mutex中文翻译为互斥锁，但为了和OS mutex充分的区别，所以我们在本文里称Oracle Mutex为Mutex。

Oracle中的mutex，类似于Latch，是一种低级的串行机制，用以控制对SGA中部分共享数据结构的访问控制。 Oracle中的串行机制有不少，引入它们的目的是避免一个对象出现下述现象：

当某些进程在访问该对象时，该资源被重新分配
当某些进程在修改它时，被其他进程读取
当某些进程在修改它时，被其他进程修改
当某些进程在读取它时，被其他进程修改

不同于Latch，Mutex的使用更灵活，用途更多，例如:

哪些需要被mutex保护的共享数据结构可以有自己独立的mutex，即一个对象拥有自己独立的mutex，不像Latch往往一个需要保护大量对象，举例来说，每一个父游标有其对应的mutex，而每一个子游标也有其对应的mutex
每一个数据结构可能有一个或多个mutex保护，每一个mutex负责保护其结构的不同部分
当然一个mutex也可以用来保护多于一个的数据结构

理论上mutex即可以存放在其保护的结构本身中(其实是嵌入在结构里)，也可以存放在其他地方。一般情况下Mutex是在数据结构需要被保护时动态创建出来的。如是嵌在需要保护结构体内的mutex，则当所依附的数据结构被清理时该mutex也将被摧毁。

Mutex带来的好处

虽然mutex和latch都是Oracle中的串行机制，但是mutex具有一些latch没有的好处

更轻量级且更快

Mutex作为Latch的替代品，具有更快速获得，更小等优势。获取一个mutex进需要大约30~35个指令，而Latch则需要150~200个指令。一个mutex结构的大小大约为16 bytes，而在10.2版本中一个latch需要112个bytes，在更早的版本中是200个bytes。从200个bytes 精简到112个是通过减少不必要的统计指标 SLEEP1~SLEEP11、WAITERS_WOKEN, WAITS_HOLDING_LATCH等从而实现的。今后我们将看到更多关于Latch的代码优化。

减少伪争用

典型情况下一个Latch保护多个对象。当一个Latch保护多个热对象时，并行地对这些对象的频繁访问让latch本身变成性能的串行点。这也就是我们此处说的伪争用点，因为争用是发生在这个串行保护的机制上，而不是进程去访问的对象本身。与latch不同，使用mutex的情况下Oracle开发人员可以为每一个要保护的数据结构创建一个独立的mutex。这意味着Latch的那种伪争用将大大减少，因为每一个对象均被自己独立拥有的mutex保护

Mutex在一些地方替代了latch和PIN

一个Mutex可供多个Oracle进程并行地参考，反过来说进程们可以以S(Shared 共享) mode模式参考一个Mutex。以S mode一起共享参考这个mutex的进程的总数成为参考总数reference count。Mutex自身结构中存放了这个ref count的数据。另一方面，mutex也可以被以X (Exclusive)mode排他模式被仅有一个进程所持有Held。

Mutex有2种用途，一方面他们可以充当维护必要串行机制的结构，如同latch那样；同时也可以充当pin，避免对象被age out。

举例来说，mutex结构中包含的ref count信息可以用作替代library cache pin。在mutex充当cursor pin之前，当一个进程要执行=>pin一个cursor时需要做的是针对性地创建library cache pin和删除这个library cache pin(均为S mode)。mutex充当cursor pin之后，进程只需要增加和减少mutex上的ref count即可。

当某一个进程首次解析一个游标 Cursor，他将临时创建并移除一个library cache pin，但是该进程后续的解析或执行进要求增加或者减少ref count。注意在这个增加/减少ref count的过程中无需acquire latch，这是因为mutex自身能起到限制串行访问修改ref count的作用。当一个进程要移除自己的mutex pin时，它减少ref count，同样的无需acquire 任何latch。

Mutex和Latch的交互

Latches和Mutex 是独立的串行机制，举例来说一个进程可以同时持有latch和mutex。在进程异常dead的情况下，一般latch要比Mutex更早被PMON清理。一般情况下不存在mutex的死锁。不像latch，在早期版本例如9i之前我们经常遇到latch死锁的问题。

Mutex的用途

在版本10.2中仅仅有 KKS 这个内核层是mutex的客户，KKS 意为 Kernel Kompile Shared，它是Library Cache中的shared cursor游标共享部分层次的代码。在之后的版本中，ORACLE开发部门更多地使用了Mutex，不局限于KKS。

KKS游标共享如何使用Mutex

kks 使用mutex以便保护对于下述基于parent cursor父游标和子游标child cursor的一系列操作。

对于父游标parent cursor的操作：

基于发生的不同操作，对应不同的等待事件：
- 在某个父游标名下创建一个新的游标 ==> cursor:mutex X
- 检查一个父游标 ==> cursor:mutex S
- 绑定值捕获 ==> cursor:mutex X
保护父游标的mutex嵌入在父游标结构内
针对父游标parent cursor的Mutex类型为’Cursor Parent’ (kgx_kks2).
针对父游标parent cursor的Mutex等待事件均为’ Cursor: mutex *’的形式

针对游标统计信息的操作

基于对不同的游标统计信息的操作有不同的等待事件：
- 构造，更新游标相关的统计信息 ==> cursor:mutex X
- 检测游标相关的统计信息，例如访问V$SQLSTATS ==> cursor : mutex S
相关的游标可能在父游标中，也可能在游标统计信息相关的hash table上
针对游标统计信息的Mutex类型为Cursor Stat (kgx_kks1)
针对游标统计信息的Mutex等待事件均为’ Cursor: mutex *’的形式

Mutex是如何替代library cache pin来保护cursor heap的？

传统的’library cache pin’在10.2.0.2之后默认被取代，此处PIN被Mutex及其ref count取代。当进程执行游标语句时或者需要PIN，或者需要hard parse一个子游标heap。
在版本10.2.0.1中，使用mutex部分代码替代PIN的功能默认是不激活的，实际上这取决于隐藏参数_KKS_USE_MUTEX_PIN，在10.2.0.2之后_KKS_USE_MUTEX_PIN默认为TRUE。换而言之在版本10.2中我们还是可以关闭KKS使用MUTEX替代PIN保护CURSOR的，但是在版本11g中则几乎无法关闭MUTEX。注意10.2中仅当KKS真正使用MUTEX时，library cache pin不再用作cursor pin。
基于对不同的游标统计信息的操作有不同的等待事件：
- 为执行某个SQL而PIN一个游标Cursor ==>Cursor: Pin S Wait on X
- 当执行一个游标而PIN Cursor，而该Cursor正被其他进程以S mode检测 ==> cursor:pin S
当试图重建一个游标Cursor ==> Cursor: pin X 该等待事件一般不太会看到，因为当一个游标正被执行，且其需要重建时会有另一个游标被创建
保护游标的mutex嵌入在游标结构内
Mutex类型为’Cursor Pin’ (kgx_kks3)
等待事件均为 ‘cursor: pin *’的形式

KKS使用MUTEX情况下SQL语句的解析与执行的收益

在版本10.2中，以下是几个SQL解析与执行从MUTEX哪里获得主要收益：

在某个父游标下构建一个新的子游标
- 首先这种构建新子游标的操作更廉价了，当时Maclean仍要告诫你一个父游标下过多的子游标仍不是一件好事情
对父游标的检测
- 在找到一个合适的游标并执行前，父游标需要被适当检测。对父游标的这种检测目前也使用mutex来保护了，所以这种检测更的成本更低了
对于已经加载在Library Cache 中的SQL语句重复执行
- 常规情况下，当一个进程要执行SQL游标前总是必须要先pin它
- 不使用MUTEX的情况：若游标处于OPEN状态下以便今后的重复执行，且参数cursor_space_for_time(CSFT 目前已不推荐使用该参数)为TRUE，则每一次重复执行可以不需要library cache pin。若游标处于OPEN状态下但是cursor_space_for_time=false，则进程在重复执行SQL游标前总是要先拿library cache pin
- 使用MUTEX的情况: 相反，若使用mutex来替代library cache pin时，则无需关心cursor_space_for_time 。仅第一个进程需要做一个PIN，其他后续进程都只需要简单地在对应保护cursor heap的mutex上拿一个共享reference 。

查询SQL统计信息

通过V$SQLSTAT视图(本质上是X$KKSSQLSTAT)访问SQL统计信息时，其所需要的CPu和获取的Latch数量要远远少于访问其他V$SQL视图。在早期版本中, 并行地访问V$SQL或者V$SQLAREA视图会造成 library cache latch的争用。

Mutex 的统计信息

下面是一个AWR中的Mutex Sleep Statistics，这些数据主要来源于V$MUTEX_SLEEP视图。

Mutex Sleep Summary

ordered by number of sleeps desc

Mutex Type	Location	Sleeps
Library Cache	kglhdgh1 64	2,356
Library Cache	kglpnal2 91	2,345
Cursor Pin	kkslce [KKSCHLPIN2]	2,084
Library Cache	kglpin1 4	956
Library Cache	kglhdgn2 106	784
Library Cache	kglpndl1 95	691
Library Cache	kglpnal1 90	605
Library Cache	kgllkdl1 85	580
Library Cache	kgllkal1 80	404
Library Cache	kglllal3 111	282
Library Cache	kglllal1 109	218
Library Cache	kglhdgn1 62	163
Library Cache	kgllldl2 112	156
Library Cache	kgllkc1 57	105
Library Cache	kglget2 2	100
Library Cache	kglini1 32	53
Library Cache	kglget1 1	31
Cursor Pin	kksLockDelete [KKSCHLPIN6]	22
Library Cache	kgllkal3 82	18
Library Cache	kglUnsetHandleReference 120	10
Cursor Pin	kksxsccmp [KKSCHLPIN5]	10
Library Cache	kglobld1 75	8
Cursor Pin	kksfbc [KKSCHLPIN1]	8
Library Cache	kglUpgradeLock 119	7
Library Cache	kglhdgc1 102	2
Cursor Pin	kksfbc [KKSCHLFSP2]	2
Library Cache	kgldtin1 42	1
Library Cache	kglhbh1 63	1
Library Cache	kgllkal5 84	1
Library Cache	kglrdtin1 44	1
Cursor Parent	kkscsPruneChild [KKSPRTLOC35]	1

Mutex的类型Mutex Type

Mutex的类型其实就是 mutex对应的客户的名字，在版本10.2中基本只有KKS使用Mutex，所以仅有3种:

Cursor Stat (kgx_kks1)
Cursor Parent (kgx_kks2)
Cursor Pin (kgx_kks3)

在版本11g中扩展了对Mutex的使用，在Library Cache的HASH BUCKET中嵌入了mutex以保护hash bucket，所以多了一种mutex type : Library Cache

哪些代码函数会申请Mutex?

Oracle中哪些代码函数会申请Mutex? 例如KKSFBC等，其实很像 V$LATCH_MISSES 的location列

10.2中最常见的下面的几个函数

kkspsc0 -负责解析游标 – 检测我们正在解析的游标是否有对象的parent cursor heap 0存在

kksfbc – 负责找到合适的子游标或者创建一个新的子游标

kksFindCursorstat

11g开始有大量函数需要用到Mutex了

SQL> select location from X$MUTEX_SLEEP_HISTORY;

LOCATION
—————————————-
kkslce [KKSCHLPIN2]
kksfbc [KKSCHLFSP2]
kglhdgn2 106
kglpin1 4
kglhdgn2 106
kglllal1 109
kglpin1 4
kglpndl1 95
kglpin1 4
kglpin1 4
kksfbc [KKSCHLFSP2]
kglhdgn1 62
kglpnal1 90
kglllal3 111
kglpnal1 90
kglpnal1 90
kglget2 2
kglllal3 111
kglget2 2
kglobld1 75
kkslce [KKSCHLPIN2]
kglpndl1 95
kglpndl1 95
kglpin1 4
kkslce [KKSCHLPIN2]
kglpin1 4
kglget2 2
kglllal1 109
kgllkc1 57
kglget2 2
kglpnal1 90
kglpin1 4
kglpin1 4
kglpin1 4
kgllkdl1 85
kglllal3 111
kgllldl2 112
kglpin1 4
kglpndl1 95
kkslce [KKSCHLPIN2]
kksLockDelete [KKSCHLPIN6]
kglpndl1 95
kkslce [KKSCHLPIN2]
kglpnal1 90
kglpin1 4
kglpin1 4
kgllldl2 112
kgllkdl1 85
kglpin1 4
kglhdgn2 106
kglhdgn2 106
kksLockDelete [KKSCHLPIN6]
kglhdgn1 62

Mutex的Get和Sleep

当一个Mutex被申请时，一般称为一个get request。若初始的申请未能得到授权，则该进程会因为此次申请而进入到255次SPIN中(_mutex_spin_count Mutex spin count)，每次SPIN循环迭代过程中该进程都会去看看Mutex被释放了吗。

若该Mutex在SPIN之后仍未被释放，则该进程针对申请的mutex进入对应的mutex wait等待事件中。实际进程的等待事件和等待方式由mutex的类型锁决定，例如 Cursor pin、Cursor Parent。举例来说，这种等待可能是阻塞等待，也可以是sleep。

但是请注意在V$MUTEX_SLEEP_*视图上的sleep列意味着等待的次数。相关代码函数在开始进入等待时自加这个sleep字段。

等待计时从进程进入等待前开始计算等待时间，当一个进程结束其等待，则等待的时间加入都总和total中。该进程再次尝试申请之前的Mutex，若该Mutex仍不可用，则它再次进入spin/wait的循环。

V$MUTEX_SLEEP_HISTORY视图的GETS列仅在成功申请到一个Mutex时才增加。

短期持有一个mutex: spin 循环255次一般可以有效以S mode获得一个mutex, 前提是该Mutex 已经被以S mode持有。简单来说若有2个进程同时以S mode去申请一个Mutex，则稍晚的一个申请者需要进入SPIN并等稍早一点的申请者完成它的例如创建针对该mutex的一个reference的操作，但这都是非常迅速的操作。

长期持有一个Mutex: 如若一个Mutex已经被某进程以X mode持有，则往往有其进程以SHRD模式去申请该mutex时仍发现该mutex 以X mode被其他进程所持有，则往往这个EXCL 持有是 LONG_EXCL(可以通过SSD DUMP发现)，则后续的申请者往往要进入spin循环，甚至需要等待

上面我讲了willing-to-wait的mutex，实际上mutex 的申请也可以是 nowait的。进程以nowait申请mutex时不会进入spin-cycle也不sleep，它只继续常规处理。当一个nowait get失败时，将增加一次miss，但是实际上V$MUTEX_SLEEP_*中记录的miss不是这样的miss，视图中记录的miss是等待的次数，对于真正的miss没有统计项。

Wait Time等待时间

类似于latch，spin time 不算做mutex的消耗时间，它只包含等待消耗的时间。

真正理解Mutex相关的等待

Mutex数据结构中存放了Holder id持有者ID ， Ref Count，和其他Mutex相关的统计信息。 Holder id对应于持有该Mutex的session id (v$session.sid) 。特别注意， Ref Count是进程并发以S mode参考该Mutex的进程数量(如下文的演示)。

当一个Mutex被以X mode 持有，则Holder id 为对应持有该mutex的session id，而Ref Count为0。

每一个共享S mode持有者仅仅增加mutex上的Ref Count。可供大量session并发以S mode持有参考一个Mutex。但是注意更新ref count的操作是串行的，这是为了避免错漏并维护mutex中正确的ref count。

下面我们详细介绍一个执行游标过程中对mutex share pin的过程：

某进程以SHRD 模式申请一个Mutex，并尝试临时修改该Mutex的Holder ID
若该Mutex正被他人更新，则该session会将Holder id设置为本session的sid，之后该进程将增加ref count，之后再清楚mutex上的Holder id。简单来说这个Holder id是真正做了并行控制的功能。若该Holder id 被设置了，则说明该Mutex要么被以EXCL模式持有，要么正有一个其他进程在以S mode申请该Mutex的过程中(例如更新Ref Count)。当更新Ref Count时临时设置holder id的目的就是为了实现避免其他进程并发更新该Mutex的机制。通过这些例子说明了 , Mutex既可以用作Latch并发控制，也可用作pin。
若Holder id已被设置，则申请进程将可能进入等待事件。例如若当前Mutex的持有者进程正以X mode更新该Mutex，则申请者的等待事件应为”cursor: pin S on X” 。而若当持有者Holder并不是”真的要持有” 该Mutex，而仅仅是尝试更新其Ref Count,则第二个进程将等在’ Cursor :pin S’等待事件上；实际正在更新Ref count的操作时很快的，是一种轻微的操作。当第一个进程正在更新mutex，则后续的申请进程将进入spin 循环中255次等待前者结束。当mutex上不再有 Holder id时(如前者的进程已经更新完Ref Count)时，则申请者进程将Holder ID设为自身的SID，并更新Ref Count，并清除Holder id。若在255次循环SPIN后mutex仍不被释放，则该进程进入等待并不再跑在CPU上。

Mutex相关的等待事件

cursor: mutex * events等待事件

cursor: mutex * events等待事件用于Cursor Parent 和 Cursor stats类型的操作：
- ‘cursor: mutex X’ ，某个进程申请以EXCL mode持有mutex时进入该等待，该Mutex要么正被其他进程以SHRD模式参考，这导致X mode的申请必须要等待直到Ref count=0，或者该mutex正被另一个进程以X mode持有。
- 相关操作要求以EXCL X mode持有Mutex的：
  - 在一个父游标下创建一个新的子游标
  - 捕获SQL中的绑定变量
  - 更新或构件SQL统计信息V$SQLSTATS
‘Cursor: Mutex S’ ，某个进程以SHRD S mode申请一个Mutex，而该Mutex要么被其他进程已EXCL X mode所持有，要么其他进程正在更新mutex 上的Ref Count。
- 相关类型的操作一般是检测父游标或者CURSOR统计信息数据，此外查询V$SQLSTATS也会造成CURSOR statistics被查询

‘cursor: pin * events’等待事件

该类等待事件一般是为了pin相关的子游标

cursor: pin S 当一个进程以共享pin模式申请一个Mutex，而不能立即获得时，进入cursor: Pin S等待事件。 Mutex Pin是以共享类型的操作，例如执行一个游标。
- 当一个进程等在cursor: pin S上，说明该进程在对一个共享的mutex pin 参考或取消参考时，有其他的进程也正在为同样的cursor heap创建或者取消一个共享Mutex pin。实际上cursor: pin S 等待事件应当很少见，因为更新共享Mutex pin 的reference 应当是很快的。再重复一次，S mode的Mutex可以被并发持有，但是更新Mutex的Ref Count仍需要串行地处理。一旦reference count被增加好，则后续进程将可以为同样的cursor heap增加reference count。因此此处mutex 即可以扮演Latch的角色(串行控制ref count的更新)，又可以扮演pin的角色(ref count本身)。
‘cursor: pin X’ 当一个进程需要以EXCL X mode获得mutex时，这类需要EXCL X 模式的串行操作包括:
- 构建一个子游标
- 某个进程已经以X mode持有该Mutex
- 一个或多个进程正在reference 该Mutex (shared mutex pin)
‘Cursor: pin S on X’ 最常见的等待事件，进程为了共享操作例如执行pin游标而以SHRD S mode申请mutex，但是未立即获得。原因是该游标被其他进程以EXCL X mode 持有了。

Mutex的相关统计视图

V$MUTEX_SLEEP

shows the wait time, and the number of sleeps for each combination of mutex type and location.

Column	Datatype	Description
`MUTEX_TYPE`	`VARCHAR2(32)`	Type of action/object the mutex protects
`LOCATION`	`VARCHAR2(40)`	The code location where the waiter slept for the mutex
`SLEEPS`	`NUMBER`	Number of sleeps for this `MUTEX_TYPE` and `LOCATION`
`WAIT_TIME`	`NUMBER`	Wait time in microseconds

V$MUTEX_SLEEP_HISTORY

displays time-series data. Each row in this view is for a specific time, mutex type, location, requesting session and blocking session combination. That is, it shows data related to a specific session (requesting session) that slept while requesting a specific mutex type and location, because it was being held by a specific blocking session. The data in this view is contained within a circular buffer, with the most recent sleeps shown.

Column	Datatype	Description
`SLEEP_TIMESTAMP`	`TIMESTAMP(6)`	The last date/time this `MUTEX_TYPE` and `LOCATION` was slept for by the`REQUESTING_SESSION`, while being held by the `BLOCKING_SESSION`.
`MUTEX_TYPE`	`VARCHAR2(32)`	Type of action/object the mutex protects
`GETS`	`NUMBER`	The number of times the mutex/location was requested by the requesting session while being held by the blocking session. `GETS` is only incremented once per request, irrespective of the number of sleeps required to obtain the mutex.
`SLEEPS`	`NUMBER`	The number of times the requestor had to sleep before obtaining the mutex
`REQUESTING_SESSION`	`NUMBER`	The SID of a session requesting the mutex
`BLOCKING_SESSION`	`NUMBER`	The SID of a session holding the mutex
`LOCATION`	`VARCHAR2(40)`	The code location where the waiter slept for the mutex
`MUTEX_VALUE`	`RAW(4)`	If the mutex is held in exclusive (X) mode, this column shows the SID of the blocking session, else it shows the number of sessions referencing the mutex in S mode.
`P1`	`NUMBER`	Internal use only
`P1RAW`	`RAW(4)`	Internal use only
`P2`	`NUMBER`	Internal use only
`P3`	`NUMBER`	Internal use only
`P4`	`NUMBER`	Internal use only
`P5`	`VARCHAR2(64)`	Internal use only

接着我们会在环境中模拟cursor pin S wait on X的场景，并通过systemstate dump和v$mutex_sleep , v$mutex_sleep_history等视图观察这一现象

session A:
SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bi
PL/SQL Release 10.2.0.5.0 – Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 – Production
NLSRTL Version 10.2.0.5.0 – Production
www.askmac.cn

SQL> show parameter kks
SQL>
SQL> create table mac_kks tablespace users nologging as select * from dba_objects;

Table created.

SQL> insert /*+ append */ into mac_kks select * from mac_kks;

77386 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ append */ into mac_kks select * from mac_kks;

154772 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ append */ into mac_kks select * from mac_kks;

309544 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ append */ into mac_kks select * from mac_kks;

619088 rows created.

SQL> commit;

Commit complete.

SQL> insert /*+ append */ into mac_kks select * from mac_kks;

1238176 rows created.

SQL> commit;

Commit complete.

SQL> oradebug setmypid
Statement processed.
SQL> oradebug tracefile_name
Statement processed.
SQL> alter table mac_kks add t2 char(2000) default ‘MACLEAN’;

session B:
SQL> oradebug setospid 32424
Oracle pid: 17, Unix process pid: 32424, image: oracle@vrh8.oracle.com (TNS V1-V3)
SQL> oradebug suspend;
Statement processed.

session C:

select * from mac_kks where rownum=1; ==> hang

session D:

select * from mac_kks where rownum=1; ==> hang

session E:

SQL> select sid,event from v$session where wait_class!=’Idle’;

SID EVENT
———- —————————————————————-
141 SQL*Net message to client
145 library cache lock
149 cursor: pin S wait on X
159 log buffer space

SQL> oradebug setmypid
Statement processed.

SQL> oradebug dump systemstate 266;
Statement processed.

SQL> oradebug tracefile_name
/s01/admin/G10R25/udump/g10r25_ora_32537.trc
Object Names
~~~~~~~~~~~~
LOCK: handle=a7115ef0
Mutex 7fff7abadecf

KGX Atomic Operation Log 0x8d88a8d8
Mutex 0x954eaff8(145, 0) idn 7fff7abadecf oper EXCL
Cursor Pin uid 145 efd 0 whr 1 slp 0
opr=3 pso=0x97951af0 flg=0
pcs=0x954eaff8 nxt=(nil) flg=35 cld=0 hd=0xa7864b08 par=0x9523a9e0
ct=0 hsh=0 unp=(nil) unn=0 hvl=9595c3d8 nhv=1 ses=0xa8416738
hep=0x954eb078 flg=80 ld=1 ob=0x95ac6128 ptr=0x8fd90128 fex=0x8fd8f438

0x954eaff8(145, 0) ==> sid和 ref count
pso ==> parent state object
hd=0xa7864b08 ==>cursor 对应的handle address
par ==> 父游标的heap 0 pointer
ses=0xa8416738 ==》一般 EXCL是才有 session address v$session.saddr

SID=145 对Mutex 0x954eaff8 oper EXCL以X mode Hold 该Mutex, SID=145 在等 SYS.MAC_KKS表的library cache lock，该表被X mode pin和lock，而解析SQL要求以S mode lock该表

SID=149 对Mutex 0x954eaff8 申请 oper GET_SHRD， SID=149在等cursor: pin S wait on X

KGX Atomic Operation Log 0x8db79798
Mutex 0x954eaff8(145, 0) idn 7fff7abadecf oper GET_SHRD
Cursor Pin uid 149 efd 0 whr 5 slp 13893
opr=2 pso=0x8e6bd518 flg=0
pcs=0x954eaff8 nxt=(nil) flg=35 cld=0 hd=0xa7864b08 par=0x9523a9e0
ct=0 hsh=0 unp=(nil) unn=0 hvl=9595c3d8 nhv=1 ses=0xa8416738
hep=0x954eb078 flg=80 ld=1 ob=0x95ac6128 ptr=0x8fd90128 fex=0x8fd8f438

SO: 0xa841bd18, type: 4, owner: 0xa830cf98, flag: INIT/-/-/0x00
(session) sid: 149 trans: (nil), creator: 0xa830cf98, flag: (80000041) USR/- BSY/-/-/-/-/-
DID: 0001-0019-00000066, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 0, prv: 0, sql: 0x8d88bf90, psql: (nil), user: 0/SYS
service name: SYS$USERS
O/S info: user: oracle, term: pts/5, ospid: 32510, machine: vrh8.oracle.com
program: sqlplus@vrh8.oracle.com (TNS V1-V3)
application name: sqlplus@vrh8.oracle.com (TNS V1-V3), hash value=543908804
waiting for ‘cursor: pin S wait on X’ wait_time=0, seconds since wait started=0
idn=7abadecf, value=9100000000, where|sleeps=500003645
blocking sess=0x(nil) seq=13901
Dumping Session Wait History
for ‘cursor: pin S wait on X’ count=1 wait_time=0.266596 sec
idn=7abadecf, value=9100000000, where|sleeps=500003644
for ‘cursor: pin S wait on X’ count=1 wait_time=0.010679 sec
idn=7abadecf, value=9100000000, where|sleeps=500003643
for ‘cursor: pin S wait on X’ count=1 wait_time=0.010633 sec
idn=7abadecf, value=9100000000, where|sleeps=500003642
for ‘cursor: pin S wait on X’ count=1 wait_time=0.010843 sec
idn=7abadecf, value=9100000000, where|sleeps=500003641
for ‘cursor: pin S wait on X’ count=1 wait_time=0.011008 sec
idn=7abadecf, value=9100000000, where|sleeps=500003640
for ‘cursor: pin S wait on X’ count=1 wait_time=0.010406 sec

SO: 0xa8416738, type: 4, owner: 0xa830bfa8, flag: INIT/-/-/0x00
(session) sid: 145 trans: (nil), creator: 0xa830bfa8, flag: (80000041) USR/- BSY/-/-/-/-/-
DID: 0001-0017-0000008E, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 3, prv: 0, sql: 0x8d88bf90, psql: (nil), user: 0/SYS
service name: SYS$USERS
O/S info: user: oracle, term: pts/4, ospid: 32485, machine: vrh8.oracle.com
program: sqlplus@vrh8.oracle.com (TNS V1-V3)
application name: sqlplus@vrh8.oracle.com (TNS V1-V3), hash value=543908804
waiting for ‘library cache lock’ wait_time=0, seconds since wait started=165
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9
blocking sess=0x(nil) seq=9
Dumping Session Wait History
for ‘library cache lock’ count=1 wait_time=3.297287 sec
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9
for ‘library cache lock’ count=1 wait_time=2.930321 sec
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9
for ‘library cache lock’ count=1 wait_time=2.930856 sec
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9
for ‘library cache lock’ count=1 wait_time=2.930698 sec
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9
for ‘library cache lock’ count=1 wait_time=2.931518 sec
handle address=a7115ef0, lock address=978def20, 100*mode+namespace=c9

SO: 0x978def20, type: 53, owner: 0xa8456058, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=978def20 handle=a7115ef0 request=S
call pin=(nil) session pin=(nil) hpc=0005 hlc=0000
htl=0x978defa0[0x8e4a8950,0x8e4a8950] htb=0x8e4a8950 ssga=0x8e4a7a88
user=a8416738 session=a8416738 count=0 flags=RES/[0010] savepoint=0x1f
LIBRARY OBJECT HANDLE: handle=a7115ef0 mtx=0xa7116020(0) lct=8 pct=10 cdp=0
name=SYS.MAC_KKS
hash=c066b9b9c6c80736a15f5ba325563fdb timestamp=04-14-2013 00:43:14
namespace=TABL flags=KGHP/TIM/SML/[02000000]
kkkk-dddd-llll=0000-0701-0201 lock=X pin=X latch#=3 hpc=0006 hlc=0004
lwt=0xa7115f98[0x978def50,0x978def50] ltm=0xa7115fa8[0xa7115fa8,0xa7115fa8]
pwt=0xa7115f60[0xa7115f60,0xa7115f60] ptm=0xa7115f70[0xa7115f70,0xa7115f70]
ref=0xa7115fc8[0xa7115fc8,0xa7115fc8] lnd=0xa7115fe0[0x9cc2c260,0xa79fd198]
LIBRARY OBJECT: object=95f0e5b0
type=TABL flags=EXS/LOC/UPD[0905] pflags=[0000] status=VALD load=0
DATA BLOCKS:
data# heap pointer status pins change whr
—– ——– ——– ——— —- —— —
0 a78d3840 95f0e708 I/P/A/-/- 0 NONE 00
8 957a6ad8 94461f88 I/P/A/-/- 1 UPDATE 00

Mutex 0x954eaff8 被 SID=145 oper EXCL以X mode Hold 该Mutex

体现为子游标child cursor被以 X mode pin
SO: 0x8e6bd518, type: 53, owner: 0xa841bd18, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=8e6bd518 handle=a7864b08 mode=N
call pin=(nil) session pin=(nil) hpc=0000 hlc=0000
htl=0x8e6bd598[0x8e542f60,0x979da4d0] htb=0x979da4d0 ssga=0x979d96c8
user=a841bd18 session=a841bd18 count=1 flags=CBK[0020] savepoint=0x0
LIBRARY OBJECT HANDLE: handle=a7864b08 mtx=0xa7864c38(0) lct=2 pct=3 cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=X latch#=3 hpc=0002 hlc=0002
lwt=0xa7864bb0[0xa7864bb0,0xa7864bb0] ltm=0xa7864bc0[0xa7864bc0,0xa7864bc0]
pwt=0xa7864b78[0xa7864b78,0xa7864b78] ptm=0xa7864b88[0xa7864b88,0xa7864b88]
ref=0xa7864be0[0x954eb320,0x954eb320] lnd=0xa7864bf8[0xa7864bf8,0xa7864bf8]
LIBRARY OBJECT: object=95ac6128
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DATA BLOCKS:
data# heap pointer status pins change whr
—– ——– ——– ——— —- —— —
0 8d81aa70 95ac6240 I/P/A/-/- 0 NONE 00
6 954eb0f0 8fd90128 I/P/A/-/E 0 NONE 00

并行执行6个SQL语句并做systemstate dump :

[oracle@vrh8 ~]$ grep SHRD /s01/admin/G10R25/udump/g10r25_ora_32716.trc |grep 0x95987388
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD
Mutex 0x95987388(0, 6) idn 7fff8f32fff6 oper SHRD

to be continued ……………………………

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: mutex

cursor: pin X等待事件

2011/08/08 by admin Leave a Comment

‘cursor: pin * events’等待事件

该类等待事件一般是为了pin相关的子游标

‘cursor: pin X’ 当一个进程需要以EXCL X mode获得mutex时，这类需要EXCL X 模式的串行操作包括:

构建一个子游标
某个进程已经以X mode持有该Mutex
一个或多个进程正在reference 该Mutex (shared mutex pin)

如果自己搞不定可以找诗檀软件专业ORACLE数据库优化团队成员帮您调优!

诗檀软件专业数据库优化团队

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

详见《深入理解Oracle中的Mutex》一文

Cursor: pin X

A session waits on this event when it needs to get a mutex in X mode for serialized operations such as building a cursor.
When performing exclusive operation on cursor like dbms_shared_pool.purge, keep/unkeep etc.
Mutex is in the cursor or child cursor.

A session waits on this event when it is requesting an exclusive mutex pin for a cursor object and it must wait because the resource is busy. The mutex pin for a cursor object can be busy either because a session is already holding it exclusive, or there are one or more sessions which are holding shared mutex pin(s). The exclusive waiter must wait until all holders of the pin for that cursor object have released it, before it can be granted.

Solutions

Solutions for mutex wait problems are very similar to each other. See Analysis of Waits on Mutexes for more information and troubleshooting ideas.

Filed Under: Oracle Tagged With: mutex

详解cursor: pin S wait on X等待事件

2011/08/08 by admin Leave a Comment

‘cursor: pin * events’等待事件

该类等待事件一般是为了pin相关的子游标

‘Cursor: pin S on X’ 最常见的等待事件，进程为了共享操作例如执行pin游标而以SHRD S mode申请mutex，但是未立即获得。原因是该游标被其他进程以EXCL X mode 持有了。

实际该 cursor: pin S wait on X等待事件往往是由于其他因素诱发的。Mutex争用仅仅是问题的症状，但根本原因需要Database Consultant 进一步挖掘。

下面我们列出一些已知的常见案例，在这些例子中可以看到我上面提到的 Mutex的争用仅仅是伪争用:

过多的子游标 High Version Counts

过多的子游标版本Version Count可能导致Mutex 争用，一般一个SQL的Version Count不要高于500。

检查High Version Count很简单，在AWR里就有SQL ordered by High Version Count，也可以写SQL查V$SQL、V$SQLAREA

昂贵的X$、V$视图查询

一些对于V$、X$视图的查询，需要访问X$KGL*之类的fixed table，可能触发Mutex争用。

Mutex持有者得不到CPU

Mutex持有者若得不到足够的CPU片可能一直阻塞他人，直到它拿到需要的CPU。

这种情况可能由于OS操作系统的实际情况或者使用Resource Manager而引起。需要配合AWR中的Host CPU、Instance CPu一起看。

已经被KILLED的SESSION仍持有Mutex

当session正持有Mutex，而其对应的Process被强制KILL掉，则直到PMON彻底清理掉该Dead Process并释放Mutex，其他session才能不再等待。诊断该类问题，最好能检查PMON的TRACE。当然也存在部分BUG会导致PMON清理过程非常慢。

举例来说，bug 9312879描述了一种场景：PMON 需要获得某个Mutex以便清理某个dead process，但是该Mutex又被其他进程持有，则PMON甚至无法开始真正清理并释放Mutex。

详见《深入理解Oracle中的Mutex》一文

如何模拟 cursor pin S wait on X 等待事件

session A:
SQL> create table emp(t1 int);

Table created.

SQL> var b1 number;
SQL> begin :b1 := 7902; end;
2 /

SQL> select * from emp where t1=:b1;

no rows selected

oracle@localhost:~$ ps -ef|grep LOCAL
oracle 11734 11723 0 05:05 ? 00:00:00 oraclecdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11751 11748 0 05:06 ? 00:00:00 oraclecdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11757 11754 0 05:06 ? 00:00:00 oraclecdb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11785 11760 0 05:06 pts/3 00:00:00 grep LOCAL
oracle@localhost:~$
oracle@localhost:~$ gdb `which oracle` 11734

(gdb) break kxsPeekBinds
Breakpoint 1 at 0x417e870
(gdb) commands
Type commands for when breakpoint 1 is hit, one per line.
End with a line saying just “end”.
>bt 4
>end
(gdb) cont
Continuing.

打开sessio B 和session C
session B:
SQL> alter system flush shared_pool;

System altered.
此时session A执行之前的语句
select * from emp where t1=:b1; ==> 被gdb break住

GDB显示breakpoint kxsPeekBinds,窥视绑定值

Breakpoint 1, 0x000000000417e870 in kxsPeekBinds ()
#0 0x000000000417e870 in kxsPeekBinds ()
#1 0x00000000047e5b52 in opitca ()
#2 0x0000000003ee67e9 in kksSetBindType ()
#3 0x0000000003ef80fa in kksfbc ()
session C:
SQL> conn c##maclean/oracle
Connected.
SQL> var b1 number;
SQL> begin :b1 := 7902; end;
2 /

PL/SQL procedure successfully completed.

SQL> select * from emp where t1=:b1;

==》等在cursor pin S on X上

sessioD 验证：

select event from v$session where wait_class!=’Idle’;
EVENT
—————————————————————-
cursor: pin S wait on X

SQL> select spid from v$process where addr=(select paddr from v$SESSION where sid=43);

SPID
————————
12318

SQL> oradebug setospid 12318;
Oracle pid: 40, Unix process pid: 12318, image: oracle@localhost.localdomain (TNS V1-V3)
SQL> oradebug short_stack;
ksedsts()+313<-ksdxfstk()+34<-ksdxcb()+886<-sspuser()+148<-__sighandler()<-__select()+19<-skgpwwait()+230<-kgxWait()+1486<-kgxSharedExamine()+352<-kxsGetRuntimeLock()+254<-kkscsCheckCursor()+555<-kkscsSearchChildList()+1772<-kksfbc()+11554<-kkspsc0()+1296<-kksParseCursor()+118<-opiosq0()+1577<-kpooprx()+289<-kpoal8()+737<-opiodr()+929<-ttcpip()+1230<-opitsk()+1638<-opiino()+883<-opiodr()+929<-opidrv()+589<-sou2o()+120<-opimai_real()+146<-ssthrdmain()+392<-ssthrdMain_e()+18<-main()+220<-__libc_start_main()+244
kkscsCheckCursor=》kxsGetRuntimeLock=》kgxSharedExamine=》kgxWait
col mutex_type for a20
col location for a20
SQL> select * from v$mutex_sleep;

MUTEX_TYPE LOCATION SLEEPS WAIT_TIME CON_ID
——————– ——————– ———- ———- ———-
Cursor Pin kkslce [KKSCHLPIN2] 183369 161405107 0

SQL> /

MUTEX_TYPE LOCATION SLEEPS WAIT_TIME CON_ID
——————– ——————– ———- ———- ———-
Cursor Pin kkslce [KKSCHLPIN2] 191003 161405107 0

SQL> /

MUTEX_TYPE LOCATION SLEEPS WAIT_TIME CON_ID
——————– ——————– ———- ———- ———-
Cursor Pin kkslce [KKSCHLPIN2] 193763 161405107 0

这些location 代码存放在kksc中

col requesting_session for 9999
col blocking_session for 9999
select mutex_identifier, requesting_session REQ,
blocking_session BLOCKING,location
from v$mutex_sleep_history;

MUTEX_IDENTIFIER REQ BLOCKING LOCATION
—————- ———- ———- ——————–
2970438836 43 1 kkslce [KKSCHLPIN2]

SQL> select sql_text from v$sqlarea where hash_value =’2970438836′;

SQL_TEXT
——————————————————————————–
select * from emp where t1=:b1
SQL> conn c##maclean/oracle
Connected.
SQL> select * from emp where t1=:b1;

SQL> var b1 number;
SQL> begin :b1 := 7902; end;
2 /

PL/SQL procedure successfully completed.

SQL> select * from emp where t1=:b1; ==》也HANG住

再打开一个session :
SQL> oradebug setmypid
Statement processed.

SQL> oradebug dump systemstate 266;
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/cdb1/cdb1/trace/cdb1_ora_12702.trc

下面是父游标”select * from emp where t1=:b1″的 library cache dump，他只有一个child Handle=0x76aef758
SO: 0x7d8c28b0, type: 90, owner: 0x8496c6e8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x84cca8f8, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8739, pg=0 pdbuid=1

LibraryObjectLock: Address=0x7d8c28b0 Handle=0x7b1ee490 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0

User=0x8496c6e8 Session=0x8496c6e8 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5168f05b
LibraryHandle: Address=0x7b1ee490 Hash=b10d4cb4 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=select * from emp where t1=:b1
FullHashValue=3f0fd3f68247850db833fe62b10d4cb4 Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=1 Identifier=2970438836 OwnerIdn=101
Statistics: InvalidationCount=0 ExecutionCount=0 LoadCount=2 ActiveLocks=2 TotalLockCount=2 TotalPinCount=1
Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 Version=0 BucketInUse=1 HandleInUse=1 HandleReferenceCount=0
Concurrency: DependencyMutex=0x7b1ee540(0, 1, 0, 0) Mutex=0x7b1ee5d0(0, 27, 0, 0)
Flags=RON/PIN/TIM/PN0/DBN/[10012841]
WaitersLists:
Lock=0x7b1ee520[0x7b1ee520,0x7b1ee520]
Pin=0x7b1ee500[0x7b1ee500,0x7b1ee500]
LoadLock=0x7b1ee578[0x7b1ee578,0x7b1ee578]
Timestamp: Current=04-13-2013 05:42:51
HandleReference: Address=0x7b1ee660 Handle=(nil) Flags=[00]
LibraryObject: Address=0x7e0c7bc8 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
ChildTable: size=’16’
Child: id=’0′ Table=0x7e0c8a60 Reference=0x7e0c84d8 Handle=0x76aef758
NamespaceDump:
Parent Cursor: sql_id=bhczycashum5n parent=0x7e0c7c90 maxchild=1 plk=y ppn=n

下面是子游标 child cursor的library cache dump，可以看到它的PinMode=X 即他被PIN住了
SO: 0x77c888e0, type: 90, owner: 0x84920248, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x84ccff88, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8739, pg=0 pdbuid=1

LibraryObjectLock: Address=0x77c888e0 Handle=0x76aef758 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0
Context=0x7f3b9f18a140 User=0x84920248 Session=0x84920248 ReferenceCount=1
Flags=CBK/[0020] SavepointNum=0
LibraryHandle: Address=0x76aef758 Hash=0 LockMode=N PinMode=X LoadLockMode=0 Status=VALD
Name: Namespace=SQL AREA(00) Type=CURSOR(00)
Statistics: InvalidationCount=0 ExecutionCount=0 LoadCount=1 ActiveLocks=2 TotalLockCount=2 TotalPinCount=3
Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=0 Version=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
Concurrency: DependencyMutex=0x76aef808(0, 0, 0, 0) Mutex=0x7b1ee5d0(0, 27, 0, 0)
Flags=RON/PIN/PN0/EXP/CHD/[10012111]
WaitersLists:
Lock=0x76aef7e8[0x76aef7e8,0x76aef7e8]
Pin=0x76aef7c8[0x76aef7c8,0x76aef7c8]
LoadLock=0x76aef840[0x76aef840,0x76aef840]
LibraryObject: Address=0x7e3a10c0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
DataBlocks:
Block: #=’0′ name=KGLH0^b10d4cb4 pins=0 Change=NONE
Heap=0x78f80a48 Pointer=0x7e3a1188 Extent=0x7e3a1030 Flags=I/-/P/A/-/-
FreedLocation=0 Alloc=2.679688 Size=3.976562 LoadTime=4318064770
Block: #=’6′ name=SQLA^b10d4cb4 pins=0 Change=NONE
Heap=0x7e0c82a8 Pointer=0x7bae6bf8 Extent=0x7bae5fc0 Flags=I/-/P/A/-/E
FreedLocation=0 Alloc=0.000000 Size=0.000000 LoadTime=0
NamespaceDump:
Child Cursor: Heap0=0x7e3a1188 Heap6=0x7bae6bf8 Heap0 Load Time=04-13-2013 05:42:51 Heap6 Load Time=04-13-2013 05:42:51

KGLH0^b10d4cb4 =》　b10d4cb4= 2970438836 即 KGLH0^2970438836

下面我们列出一些Mutex相关的BUG:

Bug 5399325 : Mutex latch spin causes high cpu on non-CAS platforms
Bug 5443568 : Excess cpu from concurrent calls to kksclearcursorstat()
Bug 5594520 : Some columns in the mutex V$ views have trailing null characters.
Bug 5522582 : Ora-600 [kxsGetRuntimeLock2] might be encountered under rare circumstances.
Bug 4755704 : Incorrect V$SQL statistics when mutex feature enabled.
Bug 5485914 : Self deadlock with ‘cursor: pins S wait on X’ when trying to explain / tracing a remote
mapped cursor.
Bug 5907779 : Self deadlock with ‘cursor: pins S wait on X’ when executing dbms_stats under
10.2.0.x and where the deadlocked cursor involves recursive dictionary SQL.
Bug 5983020 : Documents a rare case where we might see the ‘cursor: pin S wait on X’ wait event.
Bug 5476091 : Interrupt support added to the mutex feature.
Bug 6143117 : Ora-600 [kgllockownerslistdelete] could be raised in early 11.1 releases because of
incorrect use of SL_WRITE_BARRIER in the mutex code.
Bug 6753486 : PMON can spin trying to acquire a mutex during process cleanup.
Bug 6523644 : Latch violation might be seen with the mutex feature.
Bug 7202166 : The view V$MUTEX_SLEEP is broken in early 11g releases.
Bug 7211086 : The view V$MUTEX_SLEEP_HISTORY can return a null LOCATION column.
Bug 6904068 : High CPU usage possible for “cursor: pin S” waits.
Bug 7234778 : Excessive “cursor: pin S wait on X” waits when no X mutex holder.
Bug 5928271 : Mutex performance improvement for 11g’s use of librarycache (KGL) mutexes.
Bug 7462072 : Waits on “cursor: pin S wait on X” for the same reasons as bug 7234778. This is the
10g fix.
Bug 7226463 : Use of execute immediate can lead to mutex waits under pre-11g releases if an
exception is hit.
Note 580273.1 states that mutex waits might be seen under HP-UX Itanium because of a problem
with the operating system’s pw_wait() call.
Bug 5500044 : DDL on partitioned tables with CLOB or BLOB columns might lead to waits on
‘cursor: pin X’ for a pseudo cursor.
Bug 7585574 : The view v$mutex_sleep might not return all the rows that it should.
Bug 7254221 : Library cache mutex usage doesn’t record wait_time.
Bug 7568642 : states that the BLOCKING_SESSION column in v$active_session_history and
v$session is unreliable when there is a mutex blocker. Bug 5159195 is another such case.
Bug 7307972 covers a case where there were lots of waits on “library cache: mutex X” with waits on
the “kglpin1 4” subclass.
Bug 8426816 is a problem with pre-11g releases where a dead process holding a mutex might not be
cleaned correctly by PMON.
Bug 8508078 covers a case where incorrect SQL, executed frequently enough, can result in mutex
related waits being seen.
Bug 8573601 corrects the v$mutex_sleep wait times so that they are correctly reported.
Bug 7441165 makes mutex critical sections non-preemptable if the OS supports this.
Bug 9066130 documents a case where one symptom might be lots of waits on ‘cursor: mutex S’.
Bug 8518755 covers a related problem with the yield implementation that affects mutex usage (see
Bug 8694205 covers for one such case). The 8518755 fix was slightly amended by the fix for Bug
9169219 which itself was slightly amended by Bug 9440214 .
Bug 9188470 states that library cache mutexes might not be released in kgllkde under certain cases
and that we could see incorrect deadlock detection when things are in flux.
Bug 8860198 covers a case where there was high contention on library cache mutexes (“kglpin1 4”)
when the XMLTYPE type was used.
Bug 8499043 says that this bug fix can reduce contention on “library cache: mutex X”. The fix changes
the way in which context kgl objects are handled under RAC.
Bug 9069388 makes it easier to correlate the SIDs of dead processes with the mutex SIDs reported in
state object dumps and derivable from v$ tables (v$ wait information).
Bug 9312879 mutex contention might be seen when a dead process holds the mutex and PMON
cannot deal with the dead process in a timely fashion.
Bug 8431767 mutex contention might be seen when using application contexts.
Bug 5949171 library cache mutex contention might be seen when using Streams Apply feature.
Bug 9499302 requested that the instruction count taken in the mutex code is reduced.
Bug 9821207 adds markhot and unmarkhot to dbms_shared_pool.
Bug 9530750 improves performance of cursor build locks to reduce “library cache: mutex X” waits in
releases from 11.2.0.1 onwards. This is fixed in 12.1.
Bug 8890002 covers the case where we need to implement the pause instruction on x86-64 and ia-64
platforms. This is recommended if high CPU issues are seen on these platforms with mutex contention
also being observed. (IA-64 can be Linux Itanium or HP-UX Itanium).
The fix will allow the CPU to do some useful work if we are seeing lots of waits.
Bug 8625030 documents that DDL on partitions can lead to high waits on ‘library cache: mutex X’.
Bug 9591812 states that some ‘cursor: mutex S’ waits should actually be ‘cursor: mutex X’ waits under
11.2 to 12.1.

A session waits on this event when requesting a mutex for sharable operations related to pins (such as executing a cursor), but the mutex cannot be granted because it is being held exclusively by another session (which is most likely parsing the cursor). This is a replacement for library cache pin waits in prior versions of Oracle.

Solutions

Solutions for mutex wait problems are very similar to each other. See Analysis of Waits on Mutexes for more information and troubleshooting ideas.

Filed Under: Oracle Tagged With: mutex

cursor: pin S等待事件

2011/08/08 by admin Leave a Comment

‘cursor: pin * events’等待事件

该类等待事件一般是为了pin相关的子游标

cursor: pin S 当一个进程以共享pin模式申请一个Mutex，而不能立即获得时，进入cursor: Pin S等待事件。 Mutex Pin是以共享类型的操作，例如执行一个游标。
- 当一个进程等在cursor: pin S上，说明该进程在对一个共享的mutex pin 参考或取消参考时，有其他的进程也正在为同样的cursor heap创建或者取消一个共享Mutex pin。实际上cursor: pin S 等待事件应当很少见，因为更新共享Mutex pin 的reference 应当是很快的。再重复一次，S mode的Mutex可以被并发持有，但是更新Mutex的Ref Count仍需要串行地处理。一旦reference count被增加好，则后续进程将可以为同样的cursor heap增加reference count。因此此处mutex 即可以扮演Latch的角色(串行控制ref count的更新)，又可以扮演pin的角色(ref count本身)。

详见《深入理解Oracle中的Mutex》一文

cursor: pin S 相关的AWR报告。

A session waits on this event when it wants to update a shared mutex pin and another session is currently in the process of updating a shared mutex pin for the same cursor object. This wait event should rarely be seen because a shared mutex pin update is very fast.

Solutions

Solutions for mutex wait problems are very similar to each other. See Analysis of Waits on Mutexes for more information and troubleshooting ideas.

Filed Under: Oracle Tagged With: mutex

cursor: mutex X等待事件

2011/08/08 by admin Leave a Comment

cursor: mutex * events等待事件用于Cursor Parent 和 Cursor stats类型的操作：

‘cursor: mutex X’ ，某个进程申请以EXCL mode持有mutex时进入该等待，该Mutex要么正被其他进程以SHRD模式参考，这导致X mode的申请必须要等待直到Ref count=0，或者该mutex正被另一个进程以X mode持有。
相关操作要求以EXCL X mode持有Mutex的：
- 在一个父游标下创建一个新的子游标
- 捕获SQL中的绑定变量
- 更新或构件SQL统计信息V$SQLSTATS

详见《深入理解Oracle中的Mutex》一文

如果自己搞不定可以找诗檀软件专业ORACLE数据库优化团队成员帮您调优!

诗檀软件专业数据库优化团队

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

Cursor: Mutex X

Building a new cursor under a parent
- Although this operation is cheaper, building many cursors under a parent cursor is not recommended.
Capture SQL bind data
Build or Update statistics blocks
Mutex is in the parent cursor.

A session waits on this event when it has requested a mutex for a cursor object in exclusive mode but must wait because it is busy. The mutex is busy because either the mutex is being held in exclusive mode by another session or the mutex is being held shared by one or more sessions. The existing mutex holder(s) must release the mutex before the mutex can be granted exclusively.

Solutions

Solutions for mutex wait problems are very similar to each other. See Analysis of Waits on Mutexes for more information and troubleshooting ideas.

Filed Under: Oracle Tagged With: mutex

cursor: mutex S等待事件

2011/08/08 by admin Leave a Comment

cursor: mutex * events等待事件

cursor: mutex * events等待事件用于Cursor Parent 和 Cursor stats类型的操作：

‘Cursor: Mutex S’ ，某个进程以SHRD S mode申请一个Mutex，而该Mutex要么被其他进程已EXCL X mode所持有，要么其他进程正在更新mutex 上的Ref Count。

相关类型的操作一般是检测父游标或者CURSOR统计信息数据，此外查询V$SQLSTATS也会造成CURSOR statistics被查询

对父游标parent cursor的检测:

当找到一个游标并准备执行前需要对父游标进行检测。在这个对父游标的检测过程中要用到MUTEX:cursor: mutex S.。
当一个父游标下有很多个子游标时，则服务进程要遍历这些父游标下的子游标以便找到合适的子游标child cursor。
检测的mutex在parent cursor父游标中。

详见《深入理解Oracle中的Mutex》一文

如果自己搞不定可以找诗檀软件专业ORACLE数据库优化团队成员帮您调优!

诗檀软件专业数据库优化团队

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

Parent examination

When finding a cursor to execute, the parent must be examined. The examination of the parent is performed using the mutex, cursor: mutex S.
When the parent cursor has many child cursors involved, this waits will come as the server process has to traverse the entire list of child cursors under the parent to find a match.
Mutex is in the parent cursor.

A session waits on this event when it is requesting a mutex in shared mode, when another session is currently holding a this mutex in exclusive mode on the same cursor object.

Solutions

Solutions for mutex wait problems are very similar to each other. See Analysis of Waits on Mutexes for more information and troubleshooting ideas.

Filed Under: Oracle Tagged With: mutex

Waiting For KKSFBC CHILD COMPLETION?

2011/05/09 by admin 4 Comments

客户有一套AIX 5.3上的10.2.0.4.5生产库系统，最近频繁出现”KKSFBC CHILD COMPLETION”等待，同时导致session不断spin消耗CPU并hang住，从表象看这似乎是由bug引起的。以KKSFBC CHILD COMPLETION为关键字到MOS查询可以找到<Bug 6795880 – Session spins / OERI after ‘kksfbc child completion’ wait – superceded [ID 6795880.8]>,该Bug的症状为进程不断spin且hang住、出现’KKSFBC CHILD COMPLETION’等待事件、还可能伴有’Waits for “cursor: pin S”‘等待事件，直接影响的版本有11.1.0.6、10.2.0.3和10.2.0.4。

对于该Bug的描述是在发生’kksfbc child completion’等待事件后会话陷入无休止的自旋(spins)中，这种自旋(spins)发生在由堆栈调用(stack call)kksSearchChildList->kkshgnc陷入对kksSearchChildList函数的无限循环中。

就当前用户提供的版本号及等待事件信息仍不足以定位到该Bug，我们需要更详细的stack call。所幸的是这个trouble是可以重现的(reproduceable),在之后的一次案发现场我们得到了必要的信息:

Name            PID  CPU%  PgSp Owner
oracle      3723390  10.0   7.0 oracle

SQL> oradebug setospid 3723390
Oracle pid: 155, Unix process pid: 3723390, image: oracle@lmsdb1

SQL> oradebug unlimit;
Statement processed.

SQL> oradebug short_stack;
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044F0
<-kksfbc+0b9c<-kkspsc0+0ffc<-kksParseCursor+00d4
<-opiosq0+0ae0<-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124
<-opiino+0990<-opiodr+0ae0<-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

SQL> oradebug dump processstate 10;
Statement processed.

SQL> oradebug dump systemstate 266;
Statement processed.

=========================process state dump=========================

    SO: 7000003f72e3378, type: 4, owner: 7000003f225c798, flag: INIT/-/-/0x00
    (session) sid: 270 trans: 0, creator: 7000003f225c798, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0001-009B-0000017F, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 700000243ef9540, psql: 7000002fe76d9c0, user: 28/OLSUSER
    service name: SYS$USERS
    O/S info: user: newprepay, term: unknown, ospid: 1234, machine: newprepayC
              program: JDBC Thin Client
    application name: JDBC Thin Client, hash value=2546894660
    last wait for 'kksfbc child completion' blocking sess=0x0 seq=4397 wait_time=48840 seconds since wait started=144918
                =0, =0, =0
    Dumping Session Wait History
     for 'kksfbc child completion' count=1 wait_time=48840
                =0, =0, =0

可以从以上trace中看到会话确实曾长时间处于’kksfbc child completion’等待中，之后陷入无限自旋(spins)中消耗了大量CPU时间。但这里实际的表现又存有差异，引发无限循环的函数是kksfbc而不是kksSearchChildList(常规的调用序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。

kksfbc意为K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]该函数用以在软解析时找寻合适的子游标，在10.2.0.2以后引入了mutex互斥体来取代原有的Cursor Pin机制，Mutex较Latch更为轻量级。著名的Tanel Poder指出虽然mutex的引入改变了众多cursor pin的内部机制，但kksfbc仍需要持有library cache latches才能扫描library cache hash chains(However the traversing of library cache hash chains (the right child cursor lookup using kksfbc()) was still protected by library cache latches)。另一方面当kksfbc函数针对某个parent cursor找到合适child cursor后，可能使用KKSCHLPINx方法将该child cursor pin住，这个时候就需要exclusive地持有该child cursor相应的mutex，如:

SQL>  select mutex_addr,requesting_session rs,blocking_session bs,location from x$mutex_sleep_history;

MUTEX_ADDR               RS         BS LOCATION
---------------- ---------- ---------- ----------------------------------------
00000000A3CF8928        159        148 kksfbc [KKSCHLPIN1]
00000000A3CEA8A8        159          0 kksfbc [KKSCHLPIN1]
00000000A3CF5508        159          0 kksfbc [KKSCHLPIN1]
00000000A3CF10E8        148        159 kksLockDelete [KKSCHLPIN6]

当有一个进程执行kksfbc，而其他进程可能需要陷入’kksfbc child completion’等待中(更多的是cursor:pin S等待事件)，但这种等待一般是十分轻微的，你很难从某个”仪表”上观察到这一事件。因为一系列kksfbc相关的bug仅发生在10.2.0.2以后，可以猜测是由于mutex的引入引起的。

打破’kksfbc child completion’异常等待的一种行之有效的workaround方法是设置隐藏参数_use_kks_mutex_pin为false,即使用传统的由latch保护的Cursor pin；设置该hidden parameter需要重启实例，具体的设置方法如下:

alter system set "_kks_use_mutex_pin"=false scope=spfile;

restart instance...........

但这种方法也仅仅是行之有效，而非万试万灵。从消极的角度来说，因为放弃了mutex保护cursor pin的机制，在解析频繁的系统中library cache latch的争用将白热化，有可能成为Top 5等待事件。

另一种值得推荐的workaround是减少硬编码的SQL语句，使用session_cached_cursor减少实际的软解析操作也有助于避免引发该Bug；当然这要求我们修改代码，但从长远来看这种改良是值得的。

~~最后Oracle在10.2.0.4上提供了该Bug的one-off Patch 8575528,其在10.2.0.4 psu4以后的等价补丁为(Equivalent patch)为merge patch 9696904：~~

8557428

9696904 7527908

Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts

但merge patch 9696904目前仅有Linux x86/64平台上的版本，而问题数据库所在平台为IBM AIX on POWER Systems (64-bit)。如果要通过补丁来fix这个问题的话，AIX平台的用户可以要求Oracle development开发部门build一个Power版的9696904 patch，也可以升级到10.2.0.5上去；

注意Patch 8575528: MISSING ENTRIES IN V$MUTEX_SLEEP.LOCATION目前有IBM AIX on POWER Systems (64-bit)平台上10.2.0.4.4的版本，经过验证该one-off patch可以在10.2.0.4 PSU4以后的版本上实施(包括10.2.0.4.5/10.2.0.4.6等)，不需要如以上描述地去apply 9696904这个merge patch。当然升级到10.2.0.5/11.1.0.7或以上版本依然有效。

此外apply以上8575528:后需要修改隐藏参数”_cursor_features_enabled”并重启才能使fix生效，具体设置防范如下:

ALTER SYSTEM SET "_cursor_features_enabled"=10 scope=spfile;
restart instance.............

注意以上参数仅在使用one-off patch时需要设置，而当通过升级到10.2.0.5/11.1.0.7来修复问题时无需设置上述”_cursor_features_enabled”参数。

Filed Under: Oracle, Oracle补丁patch Tagged With: 10.2.0.4, cursor pin, latch, mutex, One-off patch, psu

分析Oracle Mutex等待事件

2010/08/08 by admin 5 Comments

Troubleshooting or analyzing waits on mutex events, i.e. ones that start with “cursor:”, are very similar to each other and below are some guidelines.

There are many bugs related to problems using mutexes in Oracle 10g, the first version where mutexes were introduced. If you are seeing excessive waits on mutexes, a quick fix may be to go back to the library cache latching mechanism. To do this, you can set the parameter _kks_use_mutex_pin=false.

Note: For mutex related waits, v$session.blocking_session is not populated in 10.2. However, v$session.blocking_session is populated in 11g R1.

To do further analysis, the P2RAW column in v$session gives the blocking session, i.e. the holder SID of the mutex in the upper 8 bytes. It is in hex so it needs to be converted in decimal. The following query provides the blockers with the most sessions blocked behind it as well as converts the P2RAW to an usable SID:

select p2raw, to_number(substr(to_char(rawtohex(p2raw)), 1, 8), ‘XXXXXXXX’) sid, count(1) sessions_waiting
from v$session
where event = ‘cursor: pin S wait on X’
group by p2raw, to_number(substr(to_char(rawtohex(p2raw)), 1, 8), ‘XXXXXXXX’);

P2RAW                      SID    SESSIONS_WAITING
—————-            —                  —————-
0000001F00000000    31                                    79

On 64 bit platforms 8 bytes are used, so the top 4 bytes are the holding session Id (if the mutex is held X), and the bottom 4 bytes are the ref count (if the mutex is held S). On 32 bit platforms 4 bytes are used, so the top 2 bytes are the holding session Id (if the mutex is held X) and the bottom 2 bytes are the ref count (if the mutex is held S).

p1 = The mutex Id
This has the same definition as v$mutex_sleep_history.mutex_identifier

p2raw = Holding Session Id | Ref Count

The most significant bytes always store the Holding Session Id (Holding SID). The least significant bytes always store the Ref Count.

Filed Under: Oracle Tagged With: mutex

single-task message and cursor pin S on X

2010/06/06 by admin 1 Comment

Recently i find many session are waiting for Event <cursor pin s on x> in my customer’s AIX 10.2.0.4 instance, it’s a new event in the 10gr2,after mutex take place of latch .

When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor have to wait until pin X released.

I looked for the session who was pining that cursor on X mode ,and saw it was waiting on event single-task messag:
the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is “ff2c2fd459ac5ee188586781d322e445” as after:

PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO:  70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N                           --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020]  savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0)  cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001  lock=N pin=X latch#=23 hpc=fffc  hlc=fffc                                     --lock  in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770]  ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738]  ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58]  lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/-    0 NONE   00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E    0 NONE   00

----------------------------------------

KGX Atomic  Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp  0      -- pin sid 3311
opr=3 pso=70000043e695968  flg=0            -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8   par=70000041d78b1e0    -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1  ses=70000048eed0d30                              -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28  fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag:  INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000]  savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0)  cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select  account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate  /100, '99999999999990.99') rate, charge,channel_id from ct_05
where  1=1 and account_id ='300187744'  and key_source_type ='52001'  order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618]  ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0]  ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648]  lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child#    table reference   handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8                                 -- the child cursor
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0  I/P/A/-/-    0 NONE   00                                      --  heap 0

The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing.

In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly.

The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable.

A little odd,the blocked also did some single task when it lost one dblink connection.

I searched the metalink with keyword: single-task message, there were few documents found , but fortunately below note:

Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212
Abstract: CURSOR: PIN S WAIT ON X BLOCKER “SINGLE-TASK MESSAGE”

PROBLEM:
——–
The databse has multiple processes waiting on “CURSOR: PIN S WAIT ON X”

The holder of the mutes is waiitng on “single-task message” form almost
745687
seconds.
The query run by the holder does conatin a dblink.

DIAGNOSTIC ANALYSIS:
——————–
All the process waiting on the same mutex:-

waiting for ‘cursor: pin S wait on X’ blocking sess=0x0
idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6

Holder
PROCESS 26:
KGX Atomic Operation Log 700000061490300
Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL
Cursor Pin uid 1967 efd 0 whr 1 slp 0
opr=3 pso=7000000670f93a0 flg=0
pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0
par=7000000629006a8
ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0
hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8
ptr=700000065f124f8 fex=700000065f11808

Session object for this process:-

SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00
(session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41)
USR/- BSY/-/-/-/-/-
DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39
txn branch: 0
oct: 3, prv: 0, sql: 700000062900e18, psql:
70000005d797bc8, user: 21/V500
O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5
program: cpm_srvscript@aruput5 (TNS V1-V3)
application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0
waiting for ‘single-task message’ blocking sess=0x0 seq=7071
wait_time=0 seconds since wait
started=745687

This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace:

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<-
nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco
n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011
4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn
+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree
+0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe
ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<-
kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30<
-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod
r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

My case stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160<
-nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c<
-upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c<
-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<
-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068<
-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178<
-kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0<
-opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88<
-kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0<
-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<
-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow.
Till now this bug has no patch to apply or workaround way what’s awful.

As “Bug 5111335: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO” described:

Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226
Abstract: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO
This is a three cluster environment. Two nodes per cluster and on that
cluster running RAC with 2 instances.
The databases involved are:

GCTGCMU is the “GCT” database in GCM = Greenwich USA, will be named USA from
now on
GCTHKGU is the “GCT” database in HKG = Hong Kong, will be named HKG from now
on
GCTLONU is the “GCT” database in LON = London, will be named LON from now on

On this environment there is a Streams bi-directional replication environment
defined between the 3 databases.

USA database is propagating changes captured to LON and HKG.
There are two propagations defined from this site to every site. 4
propagations as a total.

GCTHKGU_BATCH_PROPAGATE
GCTLONU_BATCH_PROPAGATE
GCTHKGU_DAILY_PROPAGATE
GCTLONU_DAILY_PROPAGATE

Daily propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_DAILY.
Batch propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_BATCH

This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006,
and no problems were reported.

Customer reported that he could see that propagations from USA to HKG were
stucked on library cache lock.
This situation was solved by customer by bouncing USA instances.

Analyzing the systemstate dumps we could see that there were propagations
locked on this situation and QMON slaves too.
The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the
owner of the lock was process J000 that was stucked ‘single-task message’.

From the information on the systemstate dump we can see that this process has
been on that situation for more than 3 days and 20 hours.
Based on the call stack it seems that the process was about or in the middle
of a call to a listener.

DIAGNOSTIC ANALYSIS:
——————–
Files to be uploaded:

USA:
– RDA.zip , RDA node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 2
– gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 2
– gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 1.

HKG:
– alert_GCTHKGU1.log, alert log file node 1
– alert_GCTHKGU2.log, alert log file node 2
– listener_hkg0223xus.log_20060320, listener log file node 1.
– listener_hkg0224xus.log_20060319, listener log file node 2.

LON:
– alert_GCTLONU1.log, alert log file node 1
– alert_GCTLONU2.log, alert log file node 2
– listener_lon0223xus.log_20060318, listener log file node 1.
– listener_lon3166xus.log_20060320, listener log file node 2.

Hang was identified at USA on node 1.

Analysis of gctgcmu1_ora_858.trc:

System State 1
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

System State 2
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY

Resource Holder State
LOCK: handle=25ea4afc8 50: waiting for ‘single-task message’

PROCESS 50:
~~~~~~~~~~~

OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000)

From the call stack looks like process was doing an SQL*Net connection to a
listenerand doing this
connection we get hanged because the stack trace is the same for the two
process dumps shown on
both system states.

(session)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN
waiting for ‘single-task message’ blocking sess=0x(nil) seq=5 wait_time=0
seconds since wait
started=332308
=0, =0, =0

It is running a propagation job: next_date :=
sys.dbms_aqadm.aq$_propaq(job);

Knowing that the seconds in wait are 332308 and the dump of the process
occurred on 2006-03-21 08:15:12.640,
this process is waiting on this event:

1. On USA the process is waiting since 17-MAR-2006 11:56:44
2. That time on London is 17-MAR-2006 16:56:44
3. That time on HONG KONG is 18-MAR-2006 01:56:44

At that time nothing relevant was found on alert log files on listener log
files.

Processes waiting on library cache lock
~~~~~~~~~~~~~~~

21 (PZ99) : Slave of J002
42 (q004) : Slave of process 44 (QMNC)
51 (J001) :
67 (q003) : Slave of process 44 (QMNC)
71 (q005) : Slave of process 44 (QMNC)
51 (TNS shadow) :
77 (q007) : Slave of process 44 (QMNC)

WORKAROUND:
———–
No workaround found so far.
Killing the process or bouncing instance should solve the problem.

RELATED BUGS:
————-
No similar bugs found.

This environment is also involved on bug 5089630

REPRODUCIBILITY:
—————-
Problem has reproduced just one time on customer site.

TEST CASE:
———-

STACK TRACE:
————
For processes waiting on library cache lock we can see different call stacks,
here is the one for process 21:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
kslwait kjusuc ksipgetctx kqlmli kgllkal kglget
kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif
qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch
opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute

kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o
opimai_real main __libc_start_main _start

For process waiting on single-task message, we can see the following call
stack:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
clntudp_call do_ypcall yp_match internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr
ntacbnd2addr
nsc2addr nscall1 nscall niotns nigcall osncon
kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn
npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps
spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca
pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe
opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp

opirip opidrv sou2o opimai_real main libc_start_main start

I do not see any Oracle bug here but it is hard to be
certain without the SQLNET.ORA settings and multiple
snapshots of the problem process. It is quite valid
for an Oracle session to be holding a library cache lock
when it makes an outbound DBlink call. If another session
wants that lock in an incompatible mode then it has to wait.

In this case the blocking session making the outbound link
call appears from the one stack dump to be in OS code
under gethostbyname().
However it could have been spinning down the stack in
Oracle code but I cannot comment on the info we have.

The “single-task message” wait event is started / ended
in kpnconn() so if there was a “spin” it would have to be
above that point in the stack.

In 10.2 there is the option to configure NET to timeout
on outbound connect attempts by setting
SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds.
The timeout occurs in niotns() in the stack so if this
is set you would expect a session in this state to then
timeout which may be useful if there are underlying issues with
TCP hostname resolution.
(Watch out for bug 4933023 if you use this SQLNET parameter).

If the issue recurs then get CPU usage info, multiple stacks
and you may want to attach with gdb and see if the code ever
returns above gethostbyname or not to get an idea if this

Sorry , my explanation is probably not clear .

Here is the stack you have but annotated:

Callback to get the short stack. Confirms we are not HUNG in a device
driver.
ksdxfstk
ksdxcb
sspuser
pthread_sighandler_rt
GI___libc_sigaction

Inside OS supplied code executing gethostbyname() C call.
GI___poll
clntudp_call
do_ypcall
yp_match
internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname

Oracle Net code:
snlpcgthstbynm
snlinGetAddrInfo
nttbnd2addr
ntacbbnd2addr
ntacbnd2addr
nsc2addr
nscall1
nscall
niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT
is implemented IF it has been configured.
nigcall

Oracle RDBMS code:
osncon
kpkiadef
upiini
upiah0
kpuatch
OCIServerAttach
kpnconn <<< This is where “single-task message” starts/ends
npicon0
…

If we just get two stack snapshots of the process, as was the case, and
both have the above stack then we can say:
As the “seq=” of both snaps is the same (5) then we did not return
below kpnconn() in the call stack.

But as poll() is typically a blocking call, even if only for a short
time, then we cannot be sure if this process was blocked permanently
in the poll() waiting on some large timout or similar, or if the
code was executing and looping. If it was looping it could have done so
at any point on the above stack down to kpnconn() where we would have
changed the seq# on the wait event if we exited here.

We can say that as the short stack dump triggered a callout then the
Linux kernel was not stuck inside a device driver.

We can also say it looks like the gethostbyname is in YP code doing a UDP
call so at OS level you may want to check out the name resolution
configuration. You probably dont want to be falling all the way back to
UDP requests to get hostnames resolved.

So if the problem recurs it would help to get:
ps information of the process (to see if it uses CPU at all)

strace for a few minutes – this should help show if the
process is entering / leaving any system call (eg: poll)

A full call stack with arguments from either gdb or errorstack
and these should help show the arguments passed up the stack.

In gdb you can also do something like attach to the problem process
and :
break osncon
break nigcall
break niotns
etc.. up the stack to gethostbyname and then “cont”
and it should stop if it enters into those functions from below
helping indicate a spin/loop point below that point.

It may also be worth getting Linux “crash” information for the
process to see what the process looks like from the Linux side
but follow up with the OS team on that side.

SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent

This note claimed that one session may hold library cache lock , and wait in single-task message
after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait.
The “single-task message” wait event is started / ended in kpnconn() so if there was a “spin” it would have to be
above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that
we can expect a session in this state to then timeout which may be useful if there are underlying issues with
TCP hostname resolution. The principle in these cases is identical.
Do deep digging ,you can see this Knowledge:

Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170
Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170)
PROBLEM:
——–
When client connect to database using TCP/IP, if there are some problems
in the network and server , The client recevies ORA-12170 (ETIMEDOUT)
The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries
parameter on Linux platform).

ERROR:
ORA-12170 : TNS:Connect timeout occurred

** sqlnet.log ***************************************************
VERSION INFORMATION:
TNS for Linux: Version 10.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 – Production
Time: 10-APR-2009 21:38:41
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-505: Operation timed out
nt secondary err code: 110 <= ETIMEDOUT nt OS err code: 0 Client address:
** sqlnet.log ***************************************************

client server
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |when setting tcp_syn_retries=5(default)
|——-x–>| | It takes about 189 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

But, the connection via DBLINK, it sometimes try to connect twice. and
it takes double-time for time-out.

local site remote site
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
ORA-12170 occur ,but it not returning to client.
and server process of local site try to connect automatically.(2nd connect)
connect() | |
|——-x–>| |
| SYN | |
|——-x–>| | It takes about 378 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

The problem is that it takes dobule-time by this behavior using DBLINK.

DIAGNOSTIC ANALYSIS:
——————–
We got stack trace when trying connect.

1st connect
connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335 <- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112 <- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463 <- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40 <- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481 <- kksCheckCriteria()+1711<- kksCheckCursor()+226 <- kksSearchChildList()+1496 <- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<- opiosq0()+1349 <- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<- opitsk()+1031 <- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161 2nd connect <- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<- nscall1()+449<- nscall()+609 <- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<- npicon0()+1250<- kpndbcon()+909 <- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<- kkmfcbrm()+132 <- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<- qcsprfro_tree()+118 <- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<- opiDeferredSem()+283 <- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<- kksLoadChild()+5852 <- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<- kksParseCursor()+132 <- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093 <- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161<- WORKAROUND: ----------- execute following command before connect. SQL> alter system flush shared_pool;

You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time.
The document provide one workaround way: flush shared_pool.which I suspected.
In summary , still no Mature program to resolve this problem , All we can do is pray.
In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.

Filed Under: Oracle Tagged With: cursor pin S on X, hang, kpnconn, mutex, sqlnet, wait event

了解 Oracle Mutex

Mutex带来的好处

更轻量级且更快

减少伪争用

Mutex在一些地方替代了latch和PIN

Mutex和Latch的交互

Mutex的用途

KKS游标共享如何使用Mutex

对于父游标parent cursor的操作：

针对游标统计信息的操作

Mutex是如何替代library cache pin来保护cursor heap的？

KKS使用MUTEX情况下SQL语句的 解析与执行的收益

查询SQL统计信息

Mutex 的统计信息

Mutex Sleep Summary

Mutex的Get和Sleep

真正理解Mutex相关的等待

Mutex相关的等待事件

Mutex的相关统计视图

诗檀软件专业数据库优化团队

服务热线 ： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

诗檀软件专业数据库优化团队

服务热线 ： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

诗檀软件专业数据库优化团队

服务热线 ： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

KKS使用MUTEX情况下SQL语句的解析与执行的收益

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com

服务热线： 13764045638 QQ号:47079569 邮箱：service@parnassusdata.com