Oracle Internal Research内部原理研究

了解你所不知道的SMON功能(三):清理obj$基表

2011/06/08 by admin 6 Comments

SMON的作用还包括清理obj$数据字典基表(cleanup obj$)

OBJ$字典基表是Oracle Bootstarp启动自举的重要对象之一:

SQL> set linesize 80 ;
SQL> select sql_text from bootstrap$ where sql_text like 'CREATE TABLE OBJ$%';

SQL_TEXT
--------------------------------------------------------------------------------
CREATE TABLE OBJ$("OBJ#" NUMBER NOT NULL,"DATAOBJ#" NUMBER,"OWNER#" NUMBER NOT N
ULL,"NAME" VARCHAR2(30) NOT NULL,"NAMESPACE" NUMBER NOT NULL,"SUBNAME" VARCHAR2(
30),"TYPE#" NUMBER NOT NULL,"CTIME" DATE NOT NULL,"MTIME" DATE NOT NULL,"STIME"
DATE NOT NULL,"STATUS" NUMBER NOT NULL,"REMOTEOWNER" VARCHAR2(30),"LINKNAME" VAR
CHAR2(128),"FLAGS" NUMBER,"OID$" RAW(16),"SPARE1" NUMBER,"SPARE2" NUMBER,"SPARE3
" NUMBER,"SPARE4" VARCHAR2(1000),"SPARE5" VARCHAR2(1000),"SPARE6" DATE) PCTFREE
10 PCTUSED 40 INITRANS 1 MAXTRANS 255 STORAGE (  INITIAL 16K NEXT 1024K MINEXTEN
TS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 OBJNO 18 EXTENTS (FILE 1 BLOCK 121))

触发场景

OBJ$基表是一张低级数据字典表，该表几乎对库中的每个对象(表、索引、包、视图等)都包含有一行记录。很多情况下，这些条目所代表的对象是不存在的对象(non-existent)，引起这种现象的一种可能的原因是对象本身已经被从数据库中删除了，但是对象条目仍被保留下来以满足消极依赖机制(negative dependency)。因为这些条目的存在会导致OBJ$表不断膨胀，这时就需要由SMON进程来删除这些不再需要的行。SMON会在实例启动(after startup of DB is started cleanup function again)时以及启动后的每12个小时执行一次清理任务(the cleanup is scheduled to run after startup and then every 12 hours)。

我们可以通过以下演示来了解SMON清理obj$的过程:

SQL>  BEGIN
  2      FOR i IN 1 .. 5000 LOOP
  3      execute immediate ('create synonym gustav' || i || ' for
  4  perfstat.sometable');
  5      execute immediate ('drop   synonym gustav' || i );
  6      END LOOP;
  7    END;
  8    /

PL/SQL procedure successfully completed.

SQL> startup force;
ORACLE instance started.

Total System Global Area 1065353216 bytes
Fixed Size                  2089336 bytes
Variable Size             486542984 bytes
Database Buffers          570425344 bytes
Redo Buffers                6295552 bytes
Database mounted.
Database opened.

SQL>   select count(*) from user$ u, obj$ o
  2        where u.user# (+)=o.owner# and o.type#=10 and not exists
  3        (select p_obj# from dependency$ where p_obj# = o.obj#);

  COUNT(*)
----------
      5000

SQL> /

  COUNT(*)
----------
      5000

SQL> /

  COUNT(*)
----------
      4951

SQL> oradebug setospid 18457;
Oracle pid: 8, Unix process pid: 18457, image: oracle@rh2.oracle.com (SMON)

SQL> oradebug event 10046 trace name context forever ,level 1;
Statement processed.

SQL> oradebug tracefile_name;
/s01/admin/G10R2/bdump/g10r2_smon_18457.trc

select o.owner#,
       o.obj#,
       decode(o.linkname,
              null,
              decode(u.name, null, 'SYS', u.name),
              o.remoteowner),
       o.name,
       o.linkname,
       o.namespace,
       o.subname
  from user$ u, obj$ o
 where u.use r#(+) = o.owner#
   and o.type# = :1
   and not exists
 (select p_obj# from dependency$ where p_obj# = o.obj#)
 order by o.obj#
   for update

select null
  from obj$
 where obj# = :1
   and type# = :2
   and obj# not in
       (select p_obj# from dependency$ where p_obj# = obj$.obj#)

delete from obj$ where obj# = :1

/* 删除过程其实较为复杂，可能要删除多个字典基表上的记录 */

现象

我们可以通过以下查询来了解obj$基表中NON-EXISTENT对象的条目总数(type#=10)，若这个总数在不断减少说明smon正在执行清理工作

    select trunc(mtime), substr(name, 1, 3) name, count(*)
      from obj$
     where type# = 10
       and not exists (select * from dependency$ where obj# = p_obj#)
     group by trunc(mtime), substr(name, 1, 3);

      select count(*)
        from user$ u, obj$ o
       where u.user#(+) = o.owner#
         and o.type# = 10
         and not exists
       (select p_obj# from dependency$ where p_obj# = o.obj#);

如何禁止SMON清理obj$基表

我们可以通过设置诊断事件event=’10052 trace name context forever’来禁止SMON清理obj$基表，当我们需要避免SMON因cleanup obj$的相关代码而意外终止或spin从而开展进一步的诊断时可以设置该诊断事件。在Oracle并行服务器或RAC环境中，也可以设置该事件来保证只有特定的某个节点来执行清理工作。

10052, 00000, "don't clean up obj$"

alter system set events '10052 trace name context forever, level 65535';

Problem Description: We are receiving the below warning during db startup:
WARNING: kqlclo() has detected the following :
Non-existent object 37336 NOT deleted because an object
of the same name exists already.
Object name: PUBLIC.USER$

This is caused by the SMON trying to cleanup the SYS.OJB$.
SMON cleans all dropped objects which have a SYS.OBJ$.TYPE#=10.
This can happen very often when you create an object that have the same name as a public synonym.

When SMON is trying to remove non-existent objects and fails because there are duplicates,
multiple nonexistent objects with same name.
This query will returned many objects with same name under SYS schema:

select o.name,u.user# from user$ u, obj$ o where u.user# (+)=o.owner# and o.type#=10
and not exists (select p_obj# from dependency$ where p_obj# = o.obj#);

To cleanup this message:

Take a full backup of the database - this is crucial. If anything goes wrong during this procedure,
your only option would be to restore from backup, so make sure you have a good backup before proceeding.
We suggest a COLD backup. If you plan to use a HOT backup, you will have to restore point in time if any problem happens

Normally DML against dictionary objects is unsupported,
but in this case we know exactly what the type of corruption,
also you are instructing to do this under guidance from Support.

Data dictionary patching must be done by an experienced DBA.
This solution is unsupported.
It means that if there were problems after applying this solution, a database backup must be restored.

1. Set event 10052 at parameter file to disable cleanup of OBJ$ by SMON

EVENT="10052 trace name context forever, level 65535"

2. Startup database in restricted mode

3. Delete from OBJ$, COMMIT

SQL> delete from obj$ where (name,owner#) in ( select o.name,u.user# from user$ u, obj$ o
where u.user# (+)=o.owner# and o.type#=10 and not exists (select p_obj# from
dependency$ where p_obj# = o.obj#) );

SQL> commit;

SQL> Shutdown abort.

4. remove event 10052 from init.ora

5. Restart the database and monitor for the message in the ALERT LOG file

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: bootstrap$, cleanup, obj$, smon

了解你所不知道的SMON功能(二):合并空闲区间

2011/06/07 by admin 2 Comments

SMON的作用还包括合并空闲区间(coalesces free extent)

触发场景

早期Oracle采用DMT字典管理表空间，不同于今时今日的LMT本地管理方式，DMT下通过对FET$和UET$2张字典基表的递归操作来管理区间。SMON每5分钟(SMON wakes itself every 5 minutes and checks for tablespaces with default pctincrease != 0)会自发地去检查哪些默认存储参数pctincrease不等于0的字典管理表空间，注意这种清理工作是针对DMT的，而LMT则无需合并。SMON对这些DMT表空间上的连续相邻的空闲Extents实施coalesce操作以合并成一个更大的空闲Extent，这同时也意味着SMON需要维护FET$字典基表。

现象

以下查询可以检查数据库中空闲Extents的总数，如果这个总数在持续减少那么说明SMON正在coalesce free space：

SELECT COUNT(*) FROM DBA_FREE_SPACE;

在合并区间时SMON需要排他地(exclusive)持有ST(Space Transaction)队列锁，其他会话可能因为得不到ST锁而等待超时出现ORA-01575错误。同时SMON可能在繁琐的coalesce操作中消耗100%的CPU。

如何禁止SMON合并空闲区间

可以通过设置诊断事件event=’10269 trace name context forever, level 10’来禁用SMON合并空闲区间(Don’t do coalesces of free space in SMON)

10269, 00000, "Don't do coalesces of free space in SMON"
// *Cause:    setting this event prevents SMON from doing free space coalesces

alter system set events '10269 trace name context forever, level 10';

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: coalesce, dmt, extent, smon

了解你所不知道的SMON功能(一):清理临时段

2011/06/07 by admin 9 Comments

SMON(system monitor process)系统监控后台进程，有时候也被叫做system cleanup process，这么叫的原因是它负责完成很多清理(cleanup)任务。但凡学习过Oracle基础知识的技术人员都会或多或少对该background process的功能有所了解。

曾几何时对SMON功能的了解程度可以作为评判一位DBA理论知识的重要因素，至今仍有很多公司在DBA面试中会问到SMON有哪些功能这样的问题。首先这是一道开放式的题目，并不会奢求面试者能够打全(答全几乎是不可能的，即便是在你阅读本篇文章之后)，答出多少可以作为知识广度的评判依据(如果面试人特意为这题准备过，那么也很好，说明他已经能系统地考虑问题了)，接着还可以就具体的某一个功能说开去，来了解面试者的知识深度，当然这扯远了。

我们所熟知的SMON是个兢兢业业的家伙，它负责完成一些列系统级别的任务。与PMON(Process Monitor)后台进程不同的是，SMON负责完成更多和整体系统相关的工作，这导致它会去做一些不知名的”累活”，当系统频繁产生这些”垃圾任务”，则SMON可能忙不过来。因此在10g中SMON变得有一点懒惰了，如果它在短期内接收到过多的工作通知(SMON: system monitor process posted)，那么它可能选择消极怠工以便让自己不要过于繁忙(SMON: Posted too frequently, trans recovery disabled)，之后会详细介绍。

SMON的主要作用包括:

1.清理临时段(SMON cleanup temporary segments)

触发场景

很多人错误地理解了这里所说的临时段temporary segments，认为temporary segments是指temporary tablespace临时表空间上的排序临时段(sort segment)。事实上这里的临时段主要指的是永久表空间(permanent tablespace)上的临时段，当然临时表空间上的temporary segments也是由SMON来清理(cleanup)的，但这种清理仅发生在数据库实例启动时(instance startup)。

永久表空间上同样存在临时段，譬如当我们在某个永久表空间上使用create table/index等DDL命令创建某个表/索引时，服务进程一开始会在指定的永久表空间上分配足够多的区间(Extents)，这些区间在命令结束之前都是临时的(Temporary Extents)，直到表/索引完全建成才将该temporary segment转换为permanent segment。另外当使用drop命令删除某个段时，也会先将该段率先转换为temporary segment，之后再来清理该temporary segment(DROP object converts the segment to temporary and then cleans up the temporary segment)。常规情况下清理工作遵循谁创建temporary segment，谁负责清理的原则。换句话说，因服务进程rebuild index所产生的temporary segment在rebuild完成后应由服务进程自行负责清理。一旦服务进程在成功清理temporary segment之前就意外终止了，亦或者服务进程在工作过程中遇到了某些ORA-错误导致语句失败，那么SMON都会被要求(posted)负责完成temporary segment的清理工作。

对于永久表空间上的temporary segment，SMON会三分钟清理一次(前提是接到post)，如果SMON过于繁忙那么可能temporary segment长期不被清理。temporary segment长期不被清理可能造成一个典型的问题是:在rebuild index online失败后，后续执行的rebuild index命令要求之前产生的temporary segment已被cleanup，如果cleanup没有完成那么就需要一直等下去。在10gR2中我们可以使用dbms_repair.online_index_clean来手动清理online index rebuild的遗留问题:

The dbms_repair.online_index_clean function has been created to cleanup online index rebuilds.
Use the dbms_repair.online_index_clean function to resolve the issue.
Please note if you are unable to run the dbms_repair.online_index_clean function it is due to the fact
that you have not installed the patch for Bug 3805539 or are not running on a release that includes this fix.
The fix for this bug is a new function in the dbms_repair package called dbms_repair.online_index_clean,
which has been created to cleanup online index [[sub]partition] [re]builds.

New functionality is not allowed in patchsets;
therefore, this is not available in a patchset but is available in 10gR2.

Check your patch list to verify the database is patched for Bug 3805539
using the following command and patch for the bug if it is not listed:

opatch lsinventory -detail

Cleanup after a failed online index [re]build can be slow to occurpreventing subsequent such operations
until the cleanup has occured.

接着我们通过实践来看一下smon是如何清理永久表空间上的temporary segment的:

设置10500事件以跟踪smon进程，这个诊断事件后面会介绍

SQL> alter system set events '10500 trace name context forever,level 10';
System altered.

在第一个会话中执行create table命令，这将产生一定量的Temorary Extents

SQL> create table smon as select * from ymon;

在另一个会话中执行对DBA_EXTENTS视图的查询，可以发现产生了多少临时区间

SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

COUNT(*)
----------
117

终止以上create table的session，等待一段时间后观察smon后台进程的trc可以发现以下信息:

*** 2011-06-07 21:18:39.817
SMON: system monitor process posted msgflag:0x0200 (-/-/-/-/TMPSDROP/-/-)

*** 2011-06-07 21:18:39.818
SMON: Posted, but not for trans recovery, so skip it.

*** 2011-06-07 21:18:39.818
SMON: clean up temp segments in slave

SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

COUNT(*)
----------
0

可以看到smon通过slave进程完成了对temporary segment的清理

与永久表空间上的临时段不同，出于性能的考虑临时表空间上的Extents并不在操作(operations)完成后立即被释放和归还。相反，这些Temporary Extents会被标记为可用，以便用于下一次的排序操作。SMON仍会清理这些Temporary segments，但这种清理仅发生在实例启动时(instance startup):

For performance issues, extents in TEMPORARY tablespaces are not released ordeallocated
once the operation is complete.Instead, the extent is simply marked as available for the next sort operation.
SMON cleans up the segments at startup.

A sort segment is created by the first statement that used a TEMPORARY tablespacefor sorting, after startup.
A sort segment created in a TEMPOARY tablespace is only released at shutdown.
The large number of EXTENTS is caused when the STORAGE clause has been incorrectly calculated.

现象

可以通过以下查询了解数据库中Temporary Extent的总数，在一定时间内比较其总数，若有所减少那么说明SMON正在清理Temporary segment

SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

也可以通过v$sysstat视图中的”SMON posted for dropping temp segment”事件统计信息来了解SMON收到清理要求的情况:

SQL> select name,value from v$sysstat where name like '%SMON%';
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
total number of times SMON posted                                         8
SMON posted for undo segment recovery                                     0
SMON posted for txn recovery for other instances                          0
SMON posted for instance recovery                                         0
SMON posted for undo segment shrink                                       0
SMON posted for dropping temp segment                                     1

另外在清理过程中SMON会长期持有Space Transacton(ST)队列锁，其他会话可能因为得不到ST锁而等待超时出现ORA-01575错误:

01575, 00000, "timeout waiting for space management resource"
// *Cause: failed to acquire necessary resource to do space management.
// *Action: Retry the operation.

如何禁止SMON清理临时段

可以通过设置诊断事件event=’10061 trace name context forever, level 10’禁用SMON清理临时段(disable SMON from cleaning temp segments)。

alter system set events '10061 trace name context forever, level 10';

相关诊断事件

除去10061事件外还可以用10500事件来跟踪smon的post信息，具体的事件设置方法见<EVENT: 10500 “turn on traces for SMON>

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: cleanup, event, smon, temporary

Oracle内部视图:x$ktfbue

2011/06/02 by admin Leave a Comment

x$ktfbue:kernel transaction, file bitmap used extent,used extent bitmap in file header for LMT (equivalent to uet$ in DMT); check dba_extents view definition,ktfb –space/spcmgmt support for bitmapped space manipulation of files/tablespaces

KTFBUE means K[Kernel] T[Transaction] F[File] B[Bitmap] U[Used] E[Extents]


SQL> desc x$ktfbue;

Name	        Meaning
ADDR	        N/A
INDX	        N/A
INST_ID	        N/A
KTFBUESEGTSN	ts# containing this segment
KTFBUESEGFNO	Relative number of the file containing the segment header
KTFBUESEGBNO	segment header block number
KTFBUEEXTNO	Extent number
KTFBUEFNO	Relative number of the file containing the extent
KTFBUEBNO	Starting block number of the extent
KTFBUEBLKS	Size of the extent in ORACLE blocks
KTFBUECTM	commit_jtime,Commit Time of the undo in the extent expressed as Julian date
KTFBUESTT	commit_wtime,Commit Time of the undo in the extent expressed as wall clock time
KTFBUESTA	Transaction Status of the undo in the extent;1, 'ACTIVE', 2, 'EXPIRED', 3, 'UNEXPIRED', 0 for non-undo-extent

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: x$ view

Only ARCH Bgprocess may create archivelog?

2011/05/30 by admin 2 Comments

我们在学习Oracle入门知识时都会介绍到ARCH归档进程，归档进程ARCH负责将在线重做日志归档，注意ARCH只会将日志文件中存在的重做内容复制到归档日志文件中，举例来说重做日志文件的大小是512MB，但当前写入的redo entry只占用10MB空间，此时由某些条件触发了日志切换，那么产生的归档文件的大小仍是10MB。

一直以来存在着这样一种误解：归档操作只会由ARCH进程负责完成。

实际上归档操作可能由多种进程完成，举例来说前台进程Fore ground process就可以完成归档操作，当在前台进程中执行alter system archive log current命令时实际执行归档操作的是前台进程(server process)，而非ARCH，这可能是大多数人料想之外的，口说无凭，我们具体验证一下:

[oracle@rh2 ~]$ ps -ef|grep LOCAL=YES|grep G10R2
oracle   20790 20789  0 19:17 ?        00:00:00 oracleG10R2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

SQL> set linesize 200 pagesize 1400;
SQL> select le.leseq "Current log sequence No",
  2         100 * cp.cpodr_bno / le.lesiz "Percent Full",
  3         (cpodr_bno - 1) * 512  "bytes used exclude header",
  4         le.lesiz * 512 - cpodr_bno * 512 "Left space",
  5         le.lesiz  *512       "logfile size"
  6    from x$kcccp cp, x$kccle le
  7   where LE.leseq = CP.cpodr_seq
  8     and bitand(le.leflg, 24) = 8;

Current log sequence No Percent Full bytes used exclude header Left space logfile size
----------------------- ------------ ------------------------- ---------- ------------
                    720   62.0244141                  32518144   19910144     52428800

/* 通过以上脚本可以了解当前日志文件(current logfile)的使用情况，
    这里可以看到日志文件大小为50MB，已使用32MB                    */

[oracle@rh2 ~]$ strace -p 20972 -o archive.log
Process 20972 attached - interrupt to quit

SQL> alter system archive log current;
System altered.

[oracle@rh2 ~]$ cat archive.log |egrep "open|read|write"
read(8, "\1\2\0\0\6\0\0\0\0\0\3^\10!\200\0\0\0\0\0\0\30\352\341\0\0\0\0\0`\0\0"..., 2064) = 258
open("/s01/admin/G10R2/bdump/alert_G10R2.log", O_WRONLY|O_CREAT|O_APPEND, 0660) = 6
writev(6, [{"Mon May 30 19:31:23 2011\n", 25}, {"ALTER SYSTEM ARCHIVE LOG", 24}, {"\n", 1}], 3) = 50
open("/flashcard/oradata/G10R2/controlfile/1.ctl", O_RDWR|O_SYNC|O_DIRECT) = 12
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\361\374\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\17\0\0\0\2324\0\0\377\377\1\4?^\0\0\0\6\0\0\0\0\0\0\0\0\0\2"..., 16384, 245760) = 16384
pread(12, "\25\302\0\0\21\0\0\0\2324\0\0\377\377\1\4\377\37\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 278528) = 16384
pread(12, "\25\302\0\0\26\0\0\0\2234\0\0\377\377\1\4\327\312\0\0\0\220\1\0\316\2\0\0\2\0\0\0"..., 16384, 360448) = 16384
pread(12, "\25\302\0\0\23\0\0\0\2174\0\0\377\377\1\4\372\202\0\0\17\0\0\0\265!\315\0\0\0/\234"..., 16384, 311296) = 16384
pwrite(12, "\25\302\0\0\22\0\0\0\2334\0\0\377\377\1\4B\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 294912) = 16384
pwrite(12, "\25\302\0\0\20\0\0\0\2334\0\0\377\377\1\4\240^\0\0\200\6\0\0\0\0\0\0\0\0\0\2"..., 16384, 262144) = 16384
pwrite(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\t\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\t\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\r\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\20\0\0\0\2354\0\0\377\377\1\4\241\177\0\0\200'\0\0\0\0\0\0\0\0\0\2"..., 16384, 262144) = 16384
pread(12, "\25\302\0\0\22\0\0\0\2354\0\0\377\377\1\4X\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 294912) = 16384
pread(12, "\25\302\0\0\26\0\0\0\2354\0\0\377\377\1\4\236W\0\0\0\220\1\0\321\2\0\0\1\0\0\0"..., 16384, 360448) = 16384
pread(12, "\25\302\0\0\36\0\0\0\2354\0\0\377\377\1\4\367\5\0\0\3\0\3\0\2\0\0\0\0\0/f"..., 16384, 491520) = 16384
open("/flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log", O_RDONLY|O_DIRECT) = 13
open("/flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log", O_RDWR|O_SYNC|O_DIRECT) = 13
open("/s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log", O_RDONLY|O_DIRECT) = 14
open("/s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log", O_RDWR|O_SYNC|O_DIRECT) = 14
pread(13, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200Um\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 512, 512) = 512
pread(14, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200Um\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 512, 512) = 512
open("/flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log", O_RDONLY|O_DIRECT) = 13
open("/flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log", O_RDWR|O_SYNC|O_DIRECT) = 13
open("/s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log", O_RDONLY|O_DIRECT) = 14
open("/s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log", O_RDWR|O_SYNC|O_DIRECT) = 14
pread(13, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200Um\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 512, 512) = 512
pread(14, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200Um\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 512, 512) = 512
pread(12, "\25\302\0\0\24\0\0\0\2354\0\0\377\377\1\4\341\202\0\0\17\0\0\0\265!\315\0\0\0/\234"..., 16384, 327680) = 16384
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\r\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\20\0\0\0\2354\0\0\377\377\1\4\241\177\0\0\200'\0\0\0\0\0\0\0\0\0\2"..., 16384, 262144) = 16384
pread(12, "\25\302\0\0\22\0\0\0\2354\0\0\377\377\1\4X\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 294912) = 16384
pread(12, "\25\302\0\0D\1\0\0\2234\0\0\377\377\1\4L\250\0\0\0\0\0\0\0\0\0\0008\251\332,"..., 16384, 5308416) = 16384
pread(12, "\25\302\0\0A\1\0\0\2234\0\0\377\377\1\4\3715\0\0\0\236<\216\5\0\0\0\0\300[\310"..., 16384, 5259264) = 16384
pwrite(12, "\25\302\0\0C\1\0\0\2364\0\0\377\377\1\4\210\232\0\0\0000\360\1\0\0\0\0\v\252\332,"..., 16384, 5292032) = 16384
pwrite(12, "\25\302\0\0\21\0\0\0\2364\0\0\377\377\1\4[\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 278528) = 16384
pwrite(12, "\25\302\0\0\17\0\0\0\2364\0\0\377\377\1\4?\177\0\0\0'\0\0\0\0\0\0\0\0\0\2"..., 16384, 245760) = 16384
pwrite(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\r\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\r\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDWR|O_CREAT|O_EXCL|O_SYNC, 0660) = 15
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDWR|O_SYNC|O_DIRECT) = 15
write(15, "\0\"\0\0\0\0\300\377\0\0\0\0\0\0\0\0~\240\0\0\0\2\0\0\30\370\0\0}|{z"..., 512) = 512
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDONLY|O_DIRECT) = 15
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDWR|O_DIRECT) = 15
pread(13, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200Um\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 1048576, 512) = 1048576
pwrite(15, "\1\"\0\0\1\0\0\0\320\2\0\0\0\200tm\0\0\0\0\0\3 \n\22\235\205\254G10R"..., 1048576, 512) = 1048576
pread(13, "\1\"\0\0\1\10\0\0\320\2\0\0\0\200\271\242:22:48:01\5VALID\1"..., 1048576, 1049088) = 1048576
pwrite(15, "\1\"\0\0\1\10\0\0\320\2\0\0\0\200\271\242:22:48:01\5VALID\1"..., 1048576, 1049088) = 1048576
pread(13, "\1\"\0\0\1\20\0\0\320\2\0\0\0\200\220\271\n\0232011-05-12:22:"..., 1048576, 2097664) = 1048576
pwrite(15, "\1\"\0\0\1\20\0\0\320\2\0\0\0\200\220\271\n\0232011-05-12:22:"..., 1048576, 2097664) = 1048576
pread(13, "\1\"\0\0\1\30\0\0\320\2\0\0\364\201\261/\21\0232011-05-12:22:"..., 1048576, 3146240) = 1048576
pwrite(15, "\1\"\0\0\1\30\0\0\320\2\0\0\364\201\261/\21\0232011-05-12:22:"..., 1048576, 3146240) = 1048576
pread(13, "\1\"\0\0\1 \0\0\320\2\0\0\0\200\31~:22:54:21\5VALID\1"..., 1048576, 4194816) = 1048576
pwrite(15, "\1\"\0\0\1 \0\0\320\2\0\0\0\200\31~:22:54:21\5VALID\1"..., 1048576, 4194816) = 1048576
pread(13, "\1\"\0\0\1(\0\0\320\2\0\0\0\200\320\210\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 5243392) = 1048576
pwrite(15, "\1\"\0\0\1(\0\0\320\2\0\0\0\200\320\210\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 5243392) = 1048576
pread(13, "\1\"\0\0\0010\0\0\320\2\0\0\374\200\265$\5\f\27<\10\7xo\5\f\27<\10\02320"..., 1048576, 6291968) = 1048576
pwrite(15, "\1\"\0\0\0010\0\0\320\2\0\0\374\200\265$\5\f\27<\10\7xo\5\f\27<\10\02320"..., 1048576, 6291968) = 1048576
pread(13, "\1\"\0\0\0018\0\0\320\2\0\0\0\200W&796508\377\5\304\2\6\2\t\377\tP"..., 1048576, 7340544) = 1048576
pwrite(15, "\1\"\0\0\0018\0\0\320\2\0\0\0\200W&796508\377\5\304\2\6\2\t\377\tP"..., 1048576, 7340544) = 1048576
pread(13, "\1\"\0\0\1@\0\0\320\2\0\0\0\200*2SYS\7D807355\377\5\304\2\7"..., 1048576, 8389120) = 1048576
pwrite(15, "\1\"\0\0\1@\0\0\320\2\0\0\0\200*2SYS\7D807355\377\5\304\2\7"..., 1048576, 8389120) = 1048576
pread(13, "\1\"\0\0\1H\0\0\320\2\0\0\0\200~\5\tPROCEDURE\7xo\5\f\30"..., 1048576, 9437696) = 1048576
pwrite(15, "\1\"\0\0\1H\0\0\320\2\0\0\0\200~\5\tPROCEDURE\7xo\5\f\30"..., 1048576, 9437696) = 1048576
pread(13, "\1\"\0\0\1P\0\0\320\2\0\0\0\200ll\5\f\30\10'\0232011-05-12"..., 1048576, 10486272) = 1048576
pwrite(15, "\1\"\0\0\1P\0\0\320\2\0\0\0\200ll\5\f\30\10'\0232011-05-12"..., 1048576, 10486272) = 1048576
pread(13, "\1\"\0\0\1X\0\0\320\2\0\0\0\200\320\370\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 11534848) = 1048576
pwrite(15, "\1\"\0\0\1X\0\0\320\2\0\0\0\200\320\370\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 11534848) = 1048576
pread(13, "\1\"\0\0\1`\0\0\320\2\0\0\0\200\27\216 \377\tPROCEDURE\7xo\5"..., 1048576, 12583424) = 1048576
pwrite(15, "\1\"\0\0\1`\0\0\320\2\0\0\0\200\27\216 \377\tPROCEDURE\7xo\5"..., 1048576, 12583424) = 1048576
pread(13, "\1\"\0\0\1h\0\0\320\2\0\0008\201fj\1N\1N,\0\r\3SYS\7D860"..., 1048576, 13632000) = 1048576
pwrite(15, "\1\"\0\0\1h\0\0\320\2\0\0008\201fj\1N\1N,\0\r\3SYS\7D860"..., 1048576, 13632000) = 1048576
pread(13, "\1\"\0\0\1p\0\0\320\2\0\0\0\200\25%\7xo\5\f\30\20:\0232011-05"..., 1048576, 14680576) = 1048576
pwrite(15, "\1\"\0\0\1p\0\0\320\2\0\0\0\200\25%\7xo\5\f\30\20:\0232011-05"..., 1048576, 14680576) = 1048576
pread(13, "\1\"\0\0\1x\0\0\320\2\0\0\0\200G\36\7xo\5\f\30\23\6\7xo\5\f\30\23\6"..., 1048576, 15729152) = 1048576
pwrite(15, "\1\"\0\0\1x\0\0\320\2\0\0\0\200G\36\7xo\5\f\30\23\6\7xo\5\f\30\23\6"..., 1048576, 15729152) = 1048576
pread(13, "\1\"\0\0\1\200\0\0\320\2\0\0004\200\"\34311-05-12:23:20:1"..., 1048576, 16777728) = 1048576
pwrite(15, "\1\"\0\0\1\200\0\0\320\2\0\0004\200\"\34311-05-12:23:20:1"..., 1048576, 16777728) = 1048576
pread(13, "\1\"\0\0\1\210\0\0\320\2\0\0\0\200\320(\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 17826304) = 1048576
pwrite(15, "\1\"\0\0\1\210\0\0\320\2\0\0\0\200\320(\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 17826304) = 1048576
pread(13, "\1\"\0\0\1\220\0\0\320\2\0\0\0\200\343\367\377\tPROCEDURE\7xo\5\f"..., 1048576, 18874880) = 1048576
pwrite(15, "\1\"\0\0\1\220\0\0\320\2\0\0\0\200\343\367\377\tPROCEDURE\7xo\5\f"..., 1048576, 18874880) = 1048576
pread(13, "\1\"\0\0\1\230\0\0\320\2\0\0\350\201\200\263\33\32\0232011-05-12:23"..., 1048576, 19923456) = 1048576
pwrite(15, "\1\"\0\0\1\230\0\0\320\2\0\0\350\201\200\263\33\32\0232011-05-12:23"..., 1048576, 19923456) = 1048576
pread(13, "\1\"\0\0\1\240\0\0\320\2\0\0\34\201\36\326-12:23:28:30\5VAL"..., 1048576, 20972032) = 1048576
pwrite(15, "\1\"\0\0\1\240\0\0\320\2\0\0\34\201\36\326-12:23:28:30\5VAL"..., 1048576, 20972032) = 1048576
pread(13, "\1\"\0\0\1\250\0\0\320\2\0\0\0\200P\331VALID\1N\1N\1N,\0\r\3S"..., 1048576, 22020608) = 1048576
pwrite(15, "\1\"\0\0\1\250\0\0\320\2\0\0\0\200P\331VALID\1N\1N\1N,\0\r\3S"..., 1048576, 22020608) = 1048576
pread(13, "\1\"\0\0\1\260\0\0\320\2\0\0\0\200\342X\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 23069184) = 1048576
pwrite(15, "\1\"\0\0\1\260\0\0\320\2\0\0\0\200\342X\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 23069184) = 1048576
pread(13, "\1\"\0\0\1\270\0\0\320\2\0\0\0\200k\251,\0\r\3SYS\7D968280\377"..., 1048576, 24117760) = 1048576
pwrite(15, "\1\"\0\0\1\270\0\0\320\2\0\0\0\200k\251,\0\r\3SYS\7D968280\377"..., 1048576, 24117760) = 1048576
pread(13, "\1\"\0\0\1\300\0\0\320\2\0\0\310\200Ix:37:19\5VALID\1N\1N"..., 1048576, 25166336) = 1048576
pwrite(15, "\1\"\0\0\1\300\0\0\320\2\0\0\310\200Ix:37:19\5VALID\1N\1N"..., 1048576, 25166336) = 1048576
pread(13, "\1\"\0\0\1\310\0\0\320\2\0\0\0\200\34*#\7\377\tPROCEDURE\7xo"..., 1048576, 26214912) = 1048576
pwrite(15, "\1\"\0\0\1\310\0\0\320\2\0\0\0\200\34*#\7\377\tPROCEDURE\7xo"..., 1048576, 26214912) = 1048576
pread(13, "\1\"\0\0\1\320\0\0\320\2\0\0\0\200\343\336A\377\tPROCEDURE\7xo\5"..., 1048576, 27263488) = 1048576
pwrite(15, "\1\"\0\0\1\320\0\0\320\2\0\0\0\200\343\336A\377\tPROCEDURE\7xo\5"..., 1048576, 27263488) = 1048576
pread(13, "\1\"\0\0\1\330\0\0\320\2\0\0P\200\367{/\377\tPROCEDURE\7xo\5"..., 1048576, 28312064) = 1048576
pwrite(15, "\1\"\0\0\1\330\0\0\320\2\0\0P\200\367{/\377\tPROCEDURE\7xo\5"..., 1048576, 28312064) = 1048576
pread(13, "\1\"\0\0\1\340\0\0\320\2\0\0\0\200\315_-05-12:23:46:08\5"..., 1048576, 29360640) = 1048576
pwrite(15, "\1\"\0\0\1\340\0\0\320\2\0\0\0\200\315_-05-12:23:46:08\5"..., 1048576, 29360640) = 1048576
pread(13, "\1\"\0\0\1\350\0\0\320\2\0\0\0\200\263Xo\5\f\0302\31\7xo\5\f\0302\31\0232"..., 1048576, 30409216) = 1048576
pwrite(15, "\1\"\0\0\1\350\0\0\320\2\0\0\0\200\263Xo\5\f\0302\31\7xo\5\f\0302\31\0232"..., 1048576, 30409216) = 1048576
pread(13, "\1\"\0\0\1\360\0\0\320\2\0\0\0\200\213K\5VALID\1N\1N\1N,\0\r\3"..., 1048576, 31457792) = 1048576
pwrite(15, "\1\"\0\0\1\360\0\0\320\2\0\0\0\200\213K\5VALID\1N\1N\1N,\0\r\3"..., 1048576, 31457792) = 1048576
pread(13, "\1\"\0\0\1\370\0\0\320\2\0\0\300\200\333N\24\0\t\0T\0\210\22\"\0\0\0\t\0\t\0"..., 12288, 32506368) = 12288
pwrite(15, "\1\"\0\0\1\370\0\0\320\2\0\0\300\200\333N\24\0\t\0T\0\210\22\"\0\0\0\t\0\t\0"..., 12288, 32506368) = 12288
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDWR|O_DIRECT) = 16
write(16, "\0\"\0\0\0\0\300\377\0\0\0\0\0\0\0\0~\240\0\0\0\2\0\0\30\370\0\0}|{z"..., 512) = 512
open("/s01/flash_recovery_area/G10R2/archivelog/2011_05_30/o1_mf_1_720_6y700ch9_.arc", O_RDWR|O_DIRECT) = 16
write(16, "\0\"\0\0\0\0\300\377\0\0\0\0\0\0\0\0~\240\0\0\0\2\0\0\30\370\0\0}|{z"..., 512) = 512
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\r\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\17\0\0\0\2364\0\0\377\377\1\4?\177\0\0\0'\0\0\0\0\0\0\0\0\0\2"..., 16384, 245760) = 16384
pread(12, "\25\302\0\0\21\0\0\0\2364\0\0\377\377\1\4[\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 278528) = 16384
pread(12, "\25\302\0\0\26\0\0\0\2354\0\0\377\377\1\4\236W\0\0\0\220\1\0\321\2\0\0\1\0\0\0"..., 16384, 360448) = 16384
pread(12, "\25\302\0\0\313\0\0\0\2234\0\0\377\377\1\4\2\245\0\0005\251\332,D\"\315\0\0\0\0\0"..., 16384, 3325952) = 16384
pread(12, "\25\302\0\0(\1\0\0d4\0\0\377\377\1\4\6*\0\0\33\0\4\0\333\7\0\0\0\0\0\0"..., 16384, 4849664) = 16384
pread(12, "\25\302\0\0A\1\0\0\2234\0\0\377\377\1\4\3715\0\0\0\236<\216\5\0\0\0\0\300[\310"..., 16384, 5259264) = 16384
pread(12, "\25\302\0\0\313\0\0\0\2234\0\0\377\377\1\4\2\245\0\0005\251\332,D\"\315\0\0\0\0\0"..., 16384, 3325952) = 16384
pwrite(12, "\25\302\0\0B\1\0\0\2374\0\0\377\377\1\4\3725\0\0\0\236<\216\5\0\0\0\0\300[\310"..., 16384, 5275648) = 16384
pread(12, "\25\302\0\0C\1\0\0\2364\0\0\377\377\1\4\210\232\0\0\0000\360\1\0\0\0\0\v\252\332,"..., 16384, 5292032) = 16384
pread(12, "\25\302\0\0B\1\0\0\2374\0\0\377\377\1\4\3725\0\0\0\236<\216\5\0\0\0\0\300[\310"..., 16384, 5275648) = 16384
pwrite(12, "\25\302\0\0D\1\0\0\2374\0\0\377\377\1\4\177\253\0\0\0\0\0\0\0\0\0\0\v\252\332,"..., 16384, 5308416) = 16384
pread(12, "\25\302\0\0\313\0\0\0\2234\0\0\377\377\1\4\2\245\0\0005\251\332,D\"\315\0\0\0\0\0"..., 16384, 3325952) = 16384
pwrite(12, "\25\302\0\0B\1\0\0\2374\0\0\377\377\1\4\353{\0\0\0\316,\220\5\0\0\0\0\300[\310"..., 16384, 5275648) = 16384
pread(12, "\25\302\0\0\26\0\0\0\2354\0\0\377\377\1\4\236W\0\0\0\220\1\0\321\2\0\0\1\0\0\0"..., 16384, 360448) = 16384
pwrite(12, "\25\302\0\0\25\0\0\0\2374\0\0\377\377\1\4\236W\0\0\0\220\1\0\321\2\0\0\1\0\0\0"..., 16384, 344064) = 16384
pwrite(12, "\25\302\0\0\314\0\0\0\2374\0\0\377\377\1\4\231@\0\0005\251\332,D\"\315\0\0\0\0\0"..., 16384, 3342336) = 16384
pwrite(12, "\25\302\0\0\22\0\0\0\2374\0\0\377\377\1\4\323\35\0\0\0\0\0\0\0\0\0\0Q\350\255,"..., 16384, 294912) = 16384
pwrite(12, "\25\302\0\0\20\0\0\0\2374\0\0\377\377\1\4\261\375\0\0\200%\0\0\0\0\0\0\0\0\0\2"..., 16384, 262144) = 16384
pwrite(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\0043\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
pread(12, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\0043\377\0\0\0\0\0\0\0\3 \n\22\235\205\254"..., 16384, 16384) = 16384
write(11, "\0\313\0\0\6\0\0\0\0\0\10\6\0\315\"\315\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0"..., 203) = 203

在以上system call trace中可以看到前台进程首先读取并更新了控制文件(controlfile/1.ctl)，之后读取了在线日志文件的头部(read o1_mf_3_6v34jpmp_.log=512bytes),接着以相关属性创建了归档日志文件(o1_mf_1_720_6y700ch9_.arc，注意这里使用了O_CREAT、O_EXCL、O_SYNC、0660属性)并得到该打开文件描述符15，接着以每次buffer为1048576字节来读取在线日志并写入归档日志pread(onlinelog)->pwrite(archivelog)，直到在线日志中的redo记录被写完。这里的buffer大小1048576实际上受到隐藏参数_log_archive_buffer_size(Size of each archival buffer in log file blocks)的影响，该参数指定了归档缓冲的大小，其单位为日志文件块(log file block)，日志文件块大小在不同平台上各有差异，具体可以参考这里。目前主流的UNIX平台上日志文件标准块大小为512 bytes，而_log_archive_buffer_size的默认值一般为2048，所以得出了这里的(2048*512)=1048576=1MB buffer大小。需要注意的是与”alter system archive log current”命令不同，”switch logfile”不会引起前台进程去执行归档操作，归档操作仍将由ARCH进程完成。

我们可以很方便的通过查询v$archived_log视图来了解具体完成某个日志文件归档操作的是何种进程，这得益于其creator列:


SQL> select distinct creator  from v$archived_log;

CREATOR
-------
RMAN
FGRD
ARCH


CREATOR 	VARCHAR2(7) 	Creator of the archivelog:

    ARCH - Archiver process
    FGRD - Foreground process
    RMAN - Recovery Manager
    SRMN - RMAN at standby
    LGWR - Logwriter process

以上可以看到可能创建归档日志的进程不仅仅局限于FGRD与ARCH，RMAN(注意实际上并非直接由rman进程完成)在执行热备操作时也可能掺合进来归档一把，此外在DataGuard环境中LGWR也会参与归档操作。

V$archived_log视图中另外一个有趣的字段是REGISTRAR，一般来说在非DataGuard环境中CREATOR总是等于REGISTRAR，而在DataGuard环境中REGISTRAR常是RFS(remote file server process (RFS) on the standby system where the RFS process writes the redo data to either archived redo log files or standby redo log files)后台进程，而实际的CREATOR则往往仍是ARCH或LGWR。

REGISTRAR 	VARCHAR2(7) 	Registrar of the entry:

    RFS - Remote File Server process
    ARCH - Archiver process
    FGRD - Foreground process
    RMAN - Recovery manager
    SRMN - RMAN at standby
    LGWR - Logwriter process

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: arch, archivelog, lgwr, Oracle Data Guard Standby, redo

Oracle Latch:一段描绘Latch运作的伪代码

2011/05/21 by admin 3 Comments

以下这段伪代码来自于OraPub的Craig A. Shallahamer，这段代码并不长但基本对获取latch、spin、sleep的行为都描述清楚了，如果你对latch仍不甚了了，那么这段代码会对你很有帮助:

Function Get_Latch(latch_name,mode)
{
  If Mode eq ‘immediate’ {
    If Fast_Get(latch_name) {
      return TRUE
    Else {
      return FALSE
    }
  }
  Else {
    If Fast_Get(latch_name)
    Then {
      v$latch.gets++
      return TRUE
    }
    Else {
      v$latch.misses++
      for try = 0 .. large_number
      {
        if Spin_Get(latch_name)
        Then {
          return TRUE
        }
        Else {
          T0 = time
          Sleep(try)
          T1=time
          Register_Event("latch free",T1-T0)
        }
      } -- spin/sleep loop
    }
  }
}

Function Fast_Get(latch_name)
{
  If try_to_get_latch(latch_name)
  Then {
    return TRUE
  }
  Else {
    return FALSE
  }
}

Function Spin_Get(latch_name)
{
  v$latch.spin_get++
  for i = 1 to _spin_count
  {
    If Fast_Get(latch_name)
    Then {
      return TRUE
    }
  }
}

Function Sleep(try)
{
  v$latch.sleeps++
  v$latch.sleep[try]++
  sleeptime =
    decode(try,0,0,1,10,2,20,3,~40,4,~80,...~2000)
  sleep(sleeptime)
}

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: latch, pseudo code, sleep, spin

Who pulls the background process dbwr’s trigger?

2011/02/10 by admin 1 Comment

到底是谁扣动了database writer的扳机？初学Oracle的朋友都会对dbwr这个后台进程有一种模糊的印象，dbwr何时会被触发？很多人大约会回答当发生检查点或者当某些脏块在LRU链表上处于较冷的一端时。同时又有许多关注于宏观架构的工程师会将dbwr的写出规律归结为是lazy(懒)的。Oracle作为目前市场占有率最高的商用数据库，其各种内部算法都可以算得上是商业机密；虽然不断有专家为我们”解密”，但在我的观念中这些内部原理都与真理之冠有着不大不小的差别。所以显然我要描述的是我个人对于database writer以及cache management(缓存管理)的理解，这些理解在一定程度上是能够自洽的，但我无法保证它们必然准确无误。

要详细描述dbwr的工作原理，我们需要从久远年代的版本V7323说起，当时的db writer和cache management已经十分成熟了，8i以后只是引入了增量检查点等特性，dba不用再关心db writer受一些细节参数的影响，而只需要关注增量检查点的活跃程度就可以了。以下我们列出在V7323中，dbwr可能被触发写出的几种情况:

a.当前台进程需要将磁盘上的物理数据块读取到数据库高速缓存中(db cache)时，其首先需要在数据库缓存中寻找到一块可用的free(空闲) buffer。为了寻找这样free buffer，该前台进程首先需要以排他方式持有相关LRU链表的latch(闩)，并在该LRU链表上扫描所可用的Free buffer，扫描都会从LRU链表的尾端开始，也就是”较冷”的一端。在此过程中，前台进程沿着由尾到头的方向所遍历到的脏块将被移动到LRUW链表上(注意：一个buffer同时只可能处于一个链表上)；此外相关的统计信息如dirty buffer inspected及free buffers inspected将会累增。若该前台进程在LRU链表上搜索的范围超过了整个LRU链表长度*(隐式参数_db_block_max_scan_count/100)所规定的阀值时，其搜索操作将自行中止，该前台进程还会以信号通知dbwr进程并释放其所持有的LRU latch。dbwr后台进程在收到前台进程的信号信息后，会执行一次大批量的写出操作以使得LRU链表上有干净的clean buffer可用，在此过程中前台进程将处于free buffer wait等待事件中。dbwr后台进程为了写出LRUW与(LRU链表尾部)的脏块，其会主动去持有LRU latch并扫描该LRU链表(也是从尾部开始)试图找出脏块，并批量写出这些收集到得脏块。该DBWR的扫描深度(DBWR scan depth)由隐式参数_db_writer_scan_depth_pct的所指定，当DBWR所扫描的LRU链表长度等于整个LRU链表长度*(_db_writer_scan_depth_pct/100)时，DBWR将停止继续扫描LRU链表。

8i以后：以上这种情况一言以蔽之就是DBWR write for Free Request，这种情况在8i以后仍然奏效；hidden parameter _db_block_max_scan_pct依然健在，其默认值为40，当然也可以从x$kvit视图中”Max percentage of LRU list foreground can scan for free”相关列观察到。到10.1版本中_db_writer_scan_depth_pct(Percentage of LRU buffers for dbwr to scan when looking for dirty)仍健在其默认值为25，在10.2中被彻底废弃。由于引入了增量检查点，DBWn也会主动去遍历LRU链表，将发现的Dirty Buffer移至Checkpoint Queue(dirty queue)上，该扫描同样也受到隐式参数_db_writer_scan_depth_pct的限制。

b.若前台进程在遍历LRU链表，顺带将脏块(dirty block)移动到LRUW链表上时，前台进程可能意识到LRUW链表的长度已经达到了某种阀值。该阀值定义了dirty queue(脏队列)的最大长度，该阀值一般受隐式参数2*_db_block_write_batch或_db_large_dirty_queue的影响。当此时刻LRUW链表将不在接受任何脏块(dirty buffer)。DBWR后台进程将被通知完成批量清理脏块的任务。在此情形下，DBWR处于一种”慌乱”状态中将不遗余力地去清理LRUW与LRU(仍需扫描，扫描深度如上述参数)链表中的脏块。与此同时，前台进程将被禁止访问LRU链表以避免产生进一步弄脏块和扫描操作。这种情形很像操作系统空闲内存小于所设minfree时，所发生的paging。此外在V7323中DBWR进程批量写出IO的大小受制于隐式参数_db_block_write_batch的值，因为该参数在之后的版本中被彻底废弃了，所以不再介绍。

8i以后：此情形仍将触发DBWR写出，但_db_block_write_batch隐式参数被彻底废弃，所以实际限制dirty queue(checkpoint queue)长度的是隐式参数_db_large_dirty_queue，其默认为25个buffer；该参数也可以通过x$kvit内部视图的”kcbldq large dirty queue if kcbclw reaches this”观察，该参数到11.2中依然健在。

c.DBWR后台进程每三秒空闲超时后被唤醒，每次超时均会唤醒DBWR去遍历buffer headers(扫描2*_db_block_write_batch个块)以寻找并写出任意脏块(排序块，临时块会被跳过)。这样做的目的是不让DBWR过于空闲。

8i以后:DBWR仍会通过调用semtimedop函数陷入3s一次的睡眠中，若在过去三秒中DBWR始终空闲，那么在它醒来后会写出少量的脏块到磁盘上。但不同于早期版本，目前版本中DBWR会每三秒根据增量检查点的要求写出脏块(脏块来源于CKPT-Q链表及由DBWR进程主动去扫描LRU链表的tail)。详见Jonathan Lewis的<How have the log_checkpoint_interval and log_checkpoint_timeout changed from version 7?>:

“(Oracle decided to keep trickling dirty blocks to disc at a higher rate than had been effected by the old 3-second idle write rate (every 3 seconds, dbwr wakes up and writes a few blocks to disc if it has had no other work in the interval).? To achieve this, they changed the meaning of the two log checkpoint parameters. This change was made possible by a change in the architecture of the buffer management, which now allows Oracle to queue dirty buffers in the order that they were first made dirty ??.Amongst other things, Oracle already kept a low redo block address (lrba)on each buffer header for each dirty buffer. This identifies the first redo block that started the process of changing that buffered block from the state that is currently on disc to the dirty state that is in the buffer. The function of the log_checkpoint_interval is simply to limit the distance between a buffer’s lrba and the addreess of the redo block that lgwr is currently writing. If the lrba is too low, then the block was first dirtied too long ago and it has to be written to disc (and its lrba set to zero). Since Oracle now queues dirty blocks in the order they were first dirtied (i.e. lrba order) it is a quick and cheap process to find such blocks.For example: if lgwr is currently writing redo block 12,100 and the log_checkpoint_interval is set to 4,000, then dbwr will be cued to write any dirty blocks with an lrba less than 8,100. This check is carried out every 3 seconds, and I believe the control files and possibly any affected data files are updated with the SCN at which this incremental checkpoint took place.)” http://www.jlcomp.demon.co.uk/faq/log_checkpoint.html

d.8i以前不存在incremental checkpoint增量检查点，完全检查点发生时，LGWR将通知DBWR写出一系列的current,dirty和非临时buffer到磁盘上。注意因检查点写出的buffer块并不会被置为free状态，这些脏块仍保留在数据库高速缓存中，以减少不必要的物理IO。因为8i以后的检查点已发生巨大转变，故不再展开。

8i以后:完全检查点为增量检查点所”取代”，完全检查点仅在alter system checkpoint或shutdown(非abort)等少数情况下发生，增量检查点会引发checkpoint queue(dirty queue)上的脏块递进地被写出，每三秒CKPT后台进程将计算检查点目标RBA(Redo Block Address)，该目标RBA基于:当前RBA,log_checkpoint_timeout,log_checkpoint_interval,fast_start_mttr_target,fast_start_io_target,最小在线日志的大小等因素。当增量检查点发生时所有在目标RBA相应时间之前被弄脏的buffer块都当被写出(When a checkpoint is initiated, DBWR writes all buffers on the queue until the checkpoint RBA is less than the head of the CKPTQ RBA)。在Oracle 10g中实现了自动调整检查点，只要不显示地(explicitly)设置参数FAST_START_MTTR_TARGET为零，自动检查点调整都将被启用，数据库将以较低的I/O负载写出脏块以提高性能。需要注意的是当启用mttr advisor时有必要将log_checkpoint_timeout,log_checkpoint_interval,fast_start_io_target三个参数设置为零。– DBWR write For checkpoint。

e.9i以前的rac称作”ops oracle parallel server”，因为当时还没有出现cache fusion功能，所以节点间同步缓存需要通过将current块写入到磁盘上来完成，这种因为block ping request的需求而导致的DBWR写出，可以归类为write for ping request (The biggest performance robber in the OPS architecture was the DB block ping. A DB block ping would occur when an instance participating in an OPS database had a block in its cache that another participating instance required. In OPS, if another instance required the block in the cache of a second instance, the block would have to be written out to disk, the locks transferred, and then the block re-read into the requesting instance.)

8i以后:在Oracle 8i的ops中初步实现了cache fusion(Oracle 8i (OPS) introduced the initial phase of cache fusion. The data blocks were transferred from the SGA of one instance to the SGA of another instance without the need to write the blocks to disk.)到9i中cache fusion技术日渐成熟，ops也更名为”RAC real application cluster”以区别于老式的cluster。因为cache fusion这一革命性的特性出现，9i以后的oracle cluster中dbwr被从需要不断完成write for ping request的性能窘境中拯救出来了；因而block ping request这种触发写出的条件也不再成立。

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: cache management, checkpoint, dbwr, dirty buffer

基于行跟踪的ROWDEPENDENCIES ORA_ROWSCN信息

2011/01/02 by admin Leave a Comment

在Oracle 10g中的引入了ORA_ROWSCN伪列新特性。基于此种伪列所提供的信息，我们可以方便地找出某个数据块或某一个行最近被修改
的时间戳。在默认情况下，10g下表会以非行依赖性(NOROWDEPENDENCIES)的属性创建，这意味着我们可观察的ORA_ROWSCN信息是以块级跟踪的，无法分辨同一块内的多行间不同的修改时间。为了达到行级粒度的跟踪我们需要在建表时指定基于行依赖性的ROWDEPENDENCIES字句。如：

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> drop table maclean;
Table dropped.

SQL> create table maclean(t1 int,t2 timestamp) ;
Table created.

SQL> insert into maclean values(1,systimestamp);
1 row created.

SQL> commit;
Commit complete.

SQL> insert into maclean values(2,systimestamp);
1 row created.

SQL> commit;
Commit complete.


SQL> alter session set nls_timestamp_format='hh24:mi:ss';
Session altered.

SQL> col t2 for a35
SQL> col orscn for a35

SQL> select scn_to_timestamp(ora_rowscn) orscn,t2 from maclean;

ORSCN                               T2
----------------------------------- -----------------------------------
20:30:11                            20:29:56
20:30:11                            20:30:10

/* 可以看到默认情况下创建的数据表使用块级依赖性追踪(Block-level Dependency Tracking)
   故而其返回的ORA_ROWSCN伪列仅能代表某数据块最近被更新的SCN
*/

create table maclean_rd(t1 int,t2 timestamp) rowdependencies;
Table created.

SQL> select table_name,dependencies from user_tables where dependencies!='DISABLED';
TABLE_NAME                     DEPENDEN
------------------------------ --------
MACLEAN_RD                     ENABLED


/* 包括字典基表在内所有的表都会默认以NOROWDEPENDENCIES创建*/

SQL> insert into maclean_rd values(1,systimestamp);

1 row created.

SQL> commit;

Commit complete.

SQL> insert into maclean_rd values(2,systimestamp);

1 row created.

SQL> commit;

Commit complete.

SQL> select scn_to_timestamp(ora_rowscn) orscn,t2 from maclean_rd;

ORSCN                               T2
----------------------------------- -----------------------------------
20:31:26                            20:31:25
20:31:35                            20:31:37

/* 可以看到在行依赖性跟踪情况下，ORA_ROWSCN反映的时间戳与插入的时间戳间仍会有误差；
   显然这种误差部分源于scn_to_timestamp函数使用的smon_scn_time SCN记录表也仅是粗略记录SCN对应的时间戳。
*/

SQL> select dbms_rowid.rowid_relative_fno(rowid),dbms_rowid.rowid_block_number(rowid) from maclean_rd;

DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID) DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
------------------------------------ ------------------------------------
                                   1                                94122
                                   1                                94122

/* 以上通过rowid找到了插入的2行所在的数据块*/



SQL> alter system dump datafile '/s01/10gdb/oradata/CLINICA/datafile/o1_mf_system_6fp8d3f0_.dbf' block 94122;
System altered.

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug tracefile_name;
/s01/10gdb/admin/clinica/udump/clinica_ora_12934.trc

block_row_dump:
tab 0, row 0, @0x1f88
tl: 24 fb: --H-FL-- lb: 0x0  cc: 2
dscn 0x0000.0351b8bd
col  0: [ 2]  c1 02
col  1: [11]  78 6f 01 02 15 20 1a 21 d8 52 68
tab 0, row 1, @0x1f70
tl: 24 fb: --H-FL-- lb: 0x0  cc: 2
dscn 0x0000.0351b8c4
col  0: [ 2]  c1 03
col  1: [11]  78 6f 01 02 15 20 26 02 ab c2 f8

/* 可以从block dump中看到每行都多出了dscn信息，这就是基于行追踪的行级ORA_ROWSCN信息的来源。
   注意这里的dscn需要占用6个字节的空间，换而言之启用ROWDEPENDENCIES会为每一行多出6个字节的磁盘开销。
*/

/* 此外行级追踪仅能在创建表(create table)的同时指定，而无法使用ALTER TABLE来修改  */

SQL> alter table maclean move tablespace users ROWDEPENDENCIES;
alter table maclean move tablespace users ROWDEPENDENCIES
                                          *
ERROR at line 1:
ORA-14133: ALTER TABLE MOVE cannot be combined with other operations

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: block dump, ora_rowscn, row, scn

Oracle闩:Cache Buffers chains

2010/12/30 by admin 2 Comments

Latch cache buffers chains大约是Oracle中child latch数量最多，使用最为频繁的闩锁了。其子闩总数受到初始化参数(8i中的db_block_buffers/4)的影响，Oracle中有大量kernel函数有机会接手持有cache buffer chains latch:

col parent_name for a25
col location for a40

SELECT t1.ksllasnam "parent_name",
       t2.ksllwnam  "location"
  FROM x$ksllw t2, x$kslwsc t1
 WHERE t2.indx = t1.indx
   AND ksllasnam = 'cache buffers chains';

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbw_activate_granule
cache buffers chains      kcbw_first_buffer_free
cache buffers chains      kcbwxb
cache buffers chains      kcbw_examine_granule
cache buffers chains      kcbw_next_free
cache buffers chains      kcbw_first_buffer_free_2
cache buffers chains      kcbbckb
cache buffers chains      kcbbioe
cache buffers chains      kcbbic1
cache buffers chains      kcbbcwd
cache buffers chains      kcbbxsv

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbbwdb
cache buffers chains      kcbbic2
cache buffers chains      kcbkzs
cache buffers chains      kcbrmf2so
cache buffers chains      kcbget: exchange rls
cache buffers chains      kcbralloc
cache buffers chains      kcbgcur: deadlock
cache buffers chains      kcbgcur: buf pinned
cache buffers chains      kcbgtcr
cache buffers chains      kcbchg: no fast path
cache buffers chains      kcbchg: apply change

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbrra: buf exists
cache buffers chains      kcbrra: update buf flags
cache buffers chains      kcbema: find buf
cache buffers chains      kcbtema: find buf
cache buffers chains      kcbget: prewarm wait
cache buffers chains      kcbrfrebuf
cache buffers chains      kcbsod1
cache buffers chains      kcbrbrl
cache buffers chains      kcbgcur: kslbegin
cache buffers chains      kcbgtcr: kslbegin shared
cache buffers chains      kcbrls: kslbegin

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbchg: kslbegin: bufs not pinned
cache buffers chains      kcbchg: kslbegin: call CR func
cache buffers chains      kcbnlc
cache buffers chains      kcbget: exchange
cache buffers chains      kcbget: pin buffer
cache buffers chains      kcbnew: new latch again
cache buffers chains      kcbgkcbcr
cache buffers chains      kcbget: in cur_read
cache buffers chains      kcbesc: escalate
cache buffers chains      kcblbi
cache buffers chains      kcbcge

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbfrl
cache buffers chains      kcbzsc
cache buffers chains      kcbibr
cache buffers chains      kcbnew_1
cache buffers chains      kcbema
cache buffers chains      kcbsrbd
cache buffers chains      kcbso1: set no access
cache buffers chains      kcbtema
cache buffers chains      kcbso1: in done_clr
cache buffers chains      kcbsod2
cache buffers chains      kcbzcg

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbzre1
cache buffers chains      kcbrlb1
cache buffers chains      kcbchkrsod
cache buffers chains      kcbxbh
cache buffers chains      kcbzsck
cache buffers chains      kcbgtcr: fast path
cache buffers chains      kcbgtcr: kslbegin excl
cache buffers chains      kcbgtcrf
cache buffers chains      kcbfdgd
cache buffers chains      kcbdng
cache buffers chains      kcbbufaddr2hdr

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbo_ivd_process
cache buffers chains      kcbo_write_process
cache buffers chains      kcbo_exam_buf
cache buffers chains      kcb_pre_apply: kcbhq61
cache buffers chains      kcb_post_apply: kcbhq62
cache buffers chains      kcb_post_apply: kcbhq63
cache buffers chains      kcbnew : new esc failed
cache buffers chains      kcbesc : escalate failed
cache buffers chains      kcb_private_owner
cache buffers chains      kcb_is_private
cache buffers chains      kcb_unprivatize

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcb_restore_block_headers
cache buffers chains      kcb_flush_undo_buffers
cache buffers chains      kcbgcur - DEADL
cache buffers chains      kcbtbd
cache buffers chains      kcbzwc
cache buffers chains      kcbzwx
cache buffers chains      kcbrmflx
cache buffers chains      kcbzwb
cache buffers chains      kcbzgb: get latch after post
cache buffers chains      kcbzgb: scan from tail. nowait
cache buffers chains      kcbzgb: exit_loop

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbzib: multi-block read: nowait
cache buffers chains      kcbzib: finish free bufs
cache buffers chains      kcbzcb
cache buffers chains      kcbzdh
cache buffers chains      kcbdpr
cache buffers chains      kcbcxx
cache buffers chains      kcbzrn
cache buffers chains      kcbdpd: for specific dba
cache buffers chains      kcbdpd: dump all buffers
cache buffers chains      kcbzib: exchange rls
cache buffers chains      kcbzpnd: dump buffers

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbzhngcbk1: get hash chain latch no wai
                          t

cache buffers chains      kcbo_cxx
cache buffers chains      kcbz_check_obj_reuse_sanity
cache buffers chains      kcbzib_grlk
cache buffers chains      kcbz_force_maps
cache buffers chains      kcbrldflx: recover in-flux bufs
cache buffers chains      kcbra1fbuf: recover in-flux bufs
cache buffers chains      kcbrafb: flashback bufs:1
cache buffers chains      kcbrafb: flashback bufs:2

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kcbr_media_apply: find buffer
cache buffers chains      kcbr_issue_read: alloc buffer
cache buffers chains      kcbr_issue_read: retry alloc
cache buffers chains      kcbr_validate_read: mark corrupt
cache buffers chains      kcbr_apply_change: after apply
cache buffers chains      kcbr_mapply_change
cache buffers chains      kcbr_mrcv_clear_fgda
cache buffers chains      kclwlr
cache buffers chains      kclebs_1
cache buffers chains      kclcls
cache buffers chains      kclcsr_1

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclpred
cache buffers chains      kclcls_1
cache buffers chains      kclple_1
cache buffers chains      kclple_2
cache buffers chains      kclcls_2
cache buffers chains      kcllwr
cache buffers chains      kclwcrs
cache buffers chains      kclcrs_1
cache buffers chains      kclcsr
cache buffers chains      kclrls
cache buffers chains      kclwcrs_1

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclfbst_1
cache buffers chains      kclpdc_1
cache buffers chains      kclwcrs_2
cache buffers chains      kclwcrs_3
cache buffers chains      kclfpdb
cache buffers chains      kclfpdb_2
cache buffers chains      kclpdc_2
cache buffers chains      kcllkopb
cache buffers chains      kclgrantlk
cache buffers chains      kclwrt
cache buffers chains      kcllkopb_1

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclwcrs_4
cache buffers chains      kcllkopb_2
cache buffers chains      kclcls_4
cache buffers chains      kclpred_1
cache buffers chains      kclrclr_2
cache buffers chains      kclrecbst
cache buffers chains      kclgrantlk_1
cache buffers chains      kclcls_5
cache buffers chains      kclrwrite_1
cache buffers chains      kclrwrite_2
cache buffers chains      kclcopy

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclswrite
cache buffers chains      kclchash
cache buffers chains      kclcfusion
cache buffers chains      kclfchk_1
cache buffers chains      kclcfusion_1
cache buffers chains      kclblkdone
cache buffers chains      kclcfusion_2
cache buffers chains      kclrenounce
cache buffers chains      kclbla
cache buffers chains      kclpto_1
cache buffers chains      kclgrantlk_2

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclcomplete
cache buffers chains      kclshrshr
cache buffers chains      kclclaim
cache buffers chains      kclhngcbk1
cache buffers chains      kclblkdone_1
cache buffers chains      kclgvlk
cache buffers chains      kclblkdone_2
cache buffers chains      kclcclaim
cache buffers chains      kclrechk_1
cache buffers chains      kclrechk_2
cache buffers chains      kclbr

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclpto
cache buffers chains      kclpdcl
cache buffers chains      kclpdc_3
cache buffers chains      kclpdc_4
cache buffers chains      kclgcr_1
cache buffers chains      kclcls_6
cache buffers chains      kclevict
cache buffers chains      kcldle
cache buffers chains      kclrcopy
cache buffers chains      kclenter
cache buffers chains      kclrbast

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclexpand
cache buffers chains      kclcls_3
cache buffers chains      kclverify
cache buffers chains      kclaffinity
cache buffers chains      kclassert
cache buffers chains      kclobj
cache buffers chains      kclobj_1
cache buffers chains      kclobj_2
cache buffers chains      kclgclk
cache buffers chains      kclwcrs_5
cache buffers chains      kclscrs

parent_name               location
------------------------- ----------------------------------------
cache buffers chains      kclstalepi
cache buffers chains      kclstalepi_1
cache buffers chains      kclstalepi_2
cache buffers chains      kclgvlk_1
cache buffers chains      kclgclk_1
cache buffers chains      kclgclk_2
cache buffers chains      kclcsr_2
cache buffers chains      kcbvm

204 rows selected.

我们一般认为Latch结构是Mostly exclusive access的，也就是极少会有共享访问闩的机会。但Oracle一般对外宣称读取数据时服务进程是以共享模式使用cache buffers chains闩，这就造成了许多人误以为读读是不会出现latch: cache buffers chains争用的。
但是实际上查询语句大部分情况下仍需要以exclusive模式持有该类子闩(有时会以SHARED模式持有，这取决于读取时是使用kcbgtcr: kslbegin shared还是kcbgtcr: kslbegin excl；kcbgtcr是Oracle rdbms中重要的获取一致性读的函数，其含义为Kernal Cache Buffer GeT Cosistents Read，显然该函数存在两种获取cache buffers chains的方式即kslbegin shared和excl。与之相对应的是kcbgcur: kslbegin，kcbgcur的含义为Kernel Cache Buffer Get Current，该函数用以获取当前块以便修改，也就是”写”；很显然kcbgcur: kslbegin函数只有以excl排他方式持有child cache buffers chains latch的必要)，原因在于虽然是查询语句但同样需要修改buffer header结构，譬如修改tch抢手度、holder list的hash变量us_nxt、us_prv以及waiter list的hash变量wa_prv、wa_nxt等。换而言之读读是会引起Latch free:cache buffers chains等待的，而非许多人认为的仅有读写、写写会导致缓存链闩争用。

这个问题我们再用实验证明一遍:

SQL> drop table maclean;
drop table maclean
           *
ERROR at line 1:
ORA-00942: table or view does not exist

SQL> create table maclean tablespace users as select * from dba_objects;
Table created.

SQL> select count(*) from maclean;

  COUNT(*)
----------
     51944

SQL> /

  COUNT(*)
----------
     51944

SQL> /

  COUNT(*)
----------
     51944

SQL> select spid from v$process where addr =(select paddr from v$session where sid=(select distinct sid from v$mystat));

SPID
------------
6023

另开一个命令行窗口:

[maclean@rh2 ~]$ gdb $ORACLE_HOME/bin/oracle 6023
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /s01/10gdb/bin/oracle...(no debugging symbols found)...done.
Attaching to program: /s01/10gdb/bin/oracle, process 6023
Reading symbols from /s01/10gdb/lib/libskgxp10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libskgxp10.so
Reading symbols from /s01/10gdb/lib/libhasgen10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libhasgen10.so
Reading symbols from /s01/10gdb/lib/libskgxn2.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libskgxn2.so
Reading symbols from /s01/10gdb/lib/libocr10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libocr10.so
Reading symbols from /s01/10gdb/lib/libocrb10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libocrb10.so
Reading symbols from /s01/10gdb/lib/libocrutl10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libocrutl10.so
Reading symbols from /s01/10gdb/lib/libjox10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libjox10.so
Reading symbols from /s01/10gdb/lib/libclsra10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libclsra10.so
Reading symbols from /s01/10gdb/lib/libdbcfg10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libdbcfg10.so
Reading symbols from /s01/10gdb/lib/libnnz10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/10gdb/lib/libnnz10.so
Reading symbols from /usr/lib64/libaio.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaio.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000039f280d8e0 in __read_nocancel () from /lib64/libpthread.so.0
(gdb) break kcbgtcr
Breakpoint 1 at 0x108c72c

回到原sqlplus窗口再次执行查询语句,会hang住:
SQL> select count(*) from maclean;

在gdb窗口:
(gdb) break kslfre
Breakpoint 2 at 0x7a133c

Breakpoint 1, 0x000000000108c72c in kcbgtcr ()
(gdb) c
Continuing.

Breakpoint 1, 0x000000000108c72c in kcbgtcr ()
(gdb) c
Continuing.

Breakpoint 1, 0x000000000108c72c in kcbgtcr ()
(gdb) c
Continuing.

Breakpoint 2, 0x00000000007a133c in kslfre ()
(gdb) c
Continuing.

多次continue直到出现kslfre内核函数，开一个新的sqlplus窗口:
SQL> set autotrace on;
SQL> select count(*) from v$latchholder;

  COUNT(*)
----------
         3

Execution Plan
----------------------------------------------------------
Plan hash value: 1575818826

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     1 |    13 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |            |     1 |    13 |            |          |
|*  2 |   FIXED TABLE FULL| X$KSUPRLAT |     1 |    13 |     0   (0)| 00:00:01 |
--------------------------------------------------------------------------------

v$latchholder的数据来源于X$KSUPRLAT内部视图，因为v$latchholder不带mode字段，所以我们直接观察X$KSUPRLAT

SQL> select KSUPRLAT "address",KSUPRLNM "name",KSUPRLMD "mode" from X$KSUPRLAT;

address          name                      mode
---------------- ------------------------- ---------------
00000000FCE40040 cache buffers chains      EXCLUSIVE
00000000FA696978 simulator lru latch       EXCLUSIVE
00000000FA6CDCE0 simulator hash latch      EXCLUSIVE

/* 可以看到即便是查询语句也是以EXCLUSIVE mode持有child cache buffers chains latch */

(gdb) quit
A debugging session is active.

        Inferior 1 [process 6235] will be detached.

Quit anyway? (y or n) y
Detaching from program: /s01/10gdb/bin/oracle, process 6235

/* 可以通过以下statistics可以看到以上读取为纯粹的逻辑读，没有物理读取的部分干扰*/

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        719  consistent gets
          0  physical reads
          0  redo size
        517  bytes sent via SQL*Net to client
        492  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: cache, chain, latch

Latches and Tuning:Latches

2010/12/27 by admin Leave a Comment

1. Query the V$FIXED_TABLE to determine the V$ views that provide latch information.

a. Execute the following query:> select name from v$fixed_table where name like ‘V$LATCH%’;

2. Remember that there are 14 levels of latches, numbered 0 – 13.

a. Take a look at three common latches and their level by executing the following query:> select name, level# from v$latch where name in
(‘cache buffers chain’,’library cache’,’redo allocation’);

You may or may not see each of these latches listed depending upon the state and activity on your database.

3. Each parent and child latch has one instance in X$KSLLT.

a. Execute the following query to view a KSLLT structure for a latch:> select v$latch_parent.addr,v$latch_parent.level#,v$latch_parent.latch#,
v$latch_parent.name
from v$latch_parent,x$ksllt
where v$latch_parent.addr = x$ksllt.addr;

You should at least see information for the ‘latch wait list’ latch. However, your results will vary depending upon the state and activity of your database.

4. Using the latch# from the previous query, you can query the X$KSLLD table to see the array of latch descriptors. One of these arrays is stored in every PGA.

a. Execute the following query to view the array for the associated latch#:> select * from X$KSLLD where indx = <latch# from previous query>;

A given process can select from this table to view the latch array.

Note: As you can see, output from the X$ tables is somewhat cryptic. However, knowledge of where to gather information regarding latches will help you to more effectively diagnose latch contention and communicate with Oracle support. The follow on lessons to this module will use the dynamic performance views and system tables for diagnosing and resolving latch contention.

5. Query the V$SYSTEM_PARAMETER to view the SPIN_COUNT and MAX_EXPONENTIAL_SLEEP parameters:

a. Execute the following query:> select name from v$system_parameter where name like ‘%spin%’ or name like ‘%max%’;

Note: If you are running Oracle8i you will get no rows returned. That’s because each of these parameters are now hidden.

b. Execute the following query using the X$KSPPI table which lists hidden parameters:

> select ksppinm from x$ksppi where (ksppinm like ‘%spin%’ or ksppinm like ‘%max%’) and ksppinm like ‘\_%’ escape ‘\’ ;

Also note the _MAX_SLEEPING_HOLDING_LATCH and _LATCH_WAIT_POSTING parameters. You can query the ksppidesc column to get a description of these parameters.

6. Remember that some latches are held for long periods of time, while others are not.

a. Execute the following query to view three latches that are held for long periods of time:> select name, level# from v$latch where
name in (‘library cache’,’shared pool’,’library cache load lock’,’latch wait list’);

Also notice the level numbers of these latches taking into account that most latches are in the range of 0 – 7. The latch wait list latch protects the linked list of waiters.

7. Certain wait events are associated with latches. A session process may wait on a latch and the wait time will be updated in the V$SESSION_EVENT and
V$SYSTEM_EVENT dynamice performance views. V$SESSION_WAIT also provides event informaton.

a. Execute the following query to view real time information on which sessions are currently waiting or have just waited for an event:> select sid, event, wait_time
from v$session_wait
order by wait_time, event;

Latch Free will appear in the output if there were statistics for this event.

b. Query V$SYSTEM_EVENT to view aggregate statistics for the latch free event:

> select * from v$system_event where event = ‘latch free’;

c. Query V$SESSION_EVENT to view statistics for the latch free event for each session:

> select * from v$session_event where event = ‘latch free’;

8. You can obtain information about specific latches by using both the V$SESSION_WAIT and V$LATCH views..

a. Execute the following query to view three latches that are held for long periods of time:> select distinct l.name “Latch”
from v$session_wait s, v$latch l
where s.p2 = l.latch#
and s.event = ‘latch free’;

You may or may not get a result set from this query if there are no session latch free events. If so then execute the next query to get an idea of wait events:

> select DISTINCT l.name “Latch”, p1, p2
from v$session_wait s, v$latch l
where s.p2 = l.latch#
and s.wait_time = 0;

Note: Remember that P1 is the latch address and P2 is the latch number. The DISTINCT operator is used because a process may hold more than one of the same latch. Run the same query without the DISTINCT operator and notice the result set will have multiple P1 values for the same latch.

> select l.name “Latch”, p1, p2
from v$session_wait s, v$latch l
where s.p2 = l.latch#
and s.wait_time = 0;

9. Latch contention occurs when a latch is requested by several processes at the same time. The impact of latch contention can be assessed using the sleep histogram in V$LATCH.

a. Execute the following query to derive information about latch sleeps:> select name, sleeps * sleeps / (misses – spin_gets ) impact
from v$latch
where sleeps > 0;

b. The SLEEP columns in V$LATCH represent the sleep buckets in the histogram. Describe V$LATCH to view the sleep buckets:

> desc v$latch

Note: SPIN_GETS is also a sleep bucket. Only the first four buckets are used.
GETS = immediate_gets + (MISSES(SPIN_GETS+sleep_gets(SLEEP1 + SLEEP2 + SLEEP3))

10. You can also derive information about latches by creating a trace file.

a. To create a trace file execute the following command:> alter session set events ‘immediate trace name latches level 10’

b. The trace file will be created in the directory that is specified by the parameter USER_DUMP_DEST. View this file and you will see information about sleeps and nowait times.

c. If child latch#’s are listed in the trace file, you can query V$LATCH_CHILDREN to acquire additional information.

Filed Under: Oracle, Oracle Internal Research内部原理研究 Tagged With: get, latch, sleep, spin