cache management

到底是谁扣动了database writer的扳机？初学Oracle的朋友都会对dbwr这个后台进程有一种模糊的印象，dbwr何时会被触发？很多人大约会回答当发生检查点或者当某些脏块在LRU链表上处于较冷的一端时。同时又有许多关注于宏观架构的工程师会将dbwr的写出规律归结为是lazy(懒)的。Oracle作为目前市场占有率最高的商用数据库，其各种内部算法都可以算得上是商业机密；虽然不断有专家为我们”解密”，但在我的观念中这些内部原理都与真理之冠有着不大不小的差别。所以显然我要描述的是我个人对于database writer以及cache management(缓存管理)的理解，这些理解在一定程度上是能够自洽的，但我无法保证它们必然准确无误。

要详细描述dbwr的工作原理，我们需要从久远年代的版本V7323说起，当时的db writer和cache management已经十分成熟了，8i以后只是引入了增量检查点等特性，dba不用再关心db writer受一些细节参数的影响，而只需要关注增量检查点的活跃程度就可以了。以下我们列出在V7323中，dbwr可能被触发写出的几种情况:

a.当前台进程需要将磁盘上的物理数据块读取到数据库高速缓存中(db cache)时，其首先需要在数据库缓存中寻找到一块可用的free(空闲) buffer。为了寻找这样free buffer，该前台进程首先需要以排他方式持有相关LRU链表的latch(闩)，并在该LRU链表上扫描所可用的Free buffer，扫描都会从LRU链表的尾端开始，也就是”较冷”的一端。在此过程中，前台进程沿着由尾到头的方向所遍历到的脏块将被移动到LRUW链表上(注意：一个buffer同时只可能处于一个链表上)；此外相关的统计信息如dirty buffer inspected及free buffers inspected将会累增。若该前台进程在LRU链表上搜索的范围超过了整个LRU链表长度*(隐式参数_db_block_max_scan_count/100)所规定的阀值时，其搜索操作将自行中止，该前台进程还会以信号通知dbwr进程并释放其所持有的LRU latch。dbwr后台进程在收到前台进程的信号信息后，会执行一次大批量的写出操作以使得LRU链表上有干净的clean buffer可用，在此过程中前台进程将处于free buffer wait等待事件中。dbwr后台进程为了写出LRUW与(LRU链表尾部)的脏块，其会主动去持有LRU latch并扫描该LRU链表(也是从尾部开始)试图找出脏块，并批量写出这些收集到得脏块。该DBWR的扫描深度(DBWR scan depth)由隐式参数_db_writer_scan_depth_pct的所指定，当DBWR所扫描的LRU链表长度等于整个LRU链表长度*(_db_writer_scan_depth_pct/100)时，DBWR将停止继续扫描LRU链表。

8i以后：以上这种情况一言以蔽之就是DBWR write for Free Request，这种情况在8i以后仍然奏效；hidden parameter _db_block_max_scan_pct依然健在，其默认值为40，当然也可以从x$kvit视图中”Max percentage of LRU list foreground can scan for free”相关列观察到。到10.1版本中_db_writer_scan_depth_pct(Percentage of LRU buffers for dbwr to scan when looking for dirty)仍健在其默认值为25，在10.2中被彻底废弃。由于引入了增量检查点，DBWn也会主动去遍历LRU链表，将发现的Dirty Buffer移至Checkpoint Queue(dirty queue)上，该扫描同样也受到隐式参数_db_writer_scan_depth_pct的限制。

b.若前台进程在遍历LRU链表，顺带将脏块(dirty block)移动到LRUW链表上时，前台进程可能意识到LRUW链表的长度已经达到了某种阀值。该阀值定义了dirty queue(脏队列)的最大长度，该阀值一般受隐式参数2*_db_block_write_batch或_db_large_dirty_queue的影响。当此时刻LRUW链表将不在接受任何脏块(dirty buffer)。DBWR后台进程将被通知完成批量清理脏块的任务。在此情形下，DBWR处于一种”慌乱”状态中将不遗余力地去清理LRUW与LRU(仍需扫描，扫描深度如上述参数)链表中的脏块。与此同时，前台进程将被禁止访问LRU链表以避免产生进一步弄脏块和扫描操作。这种情形很像操作系统空闲内存小于所设minfree时，所发生的paging。此外在V7323中DBWR进程批量写出IO的大小受制于隐式参数_db_block_write_batch的值，因为该参数在之后的版本中被彻底废弃了，所以不再介绍。

8i以后：此情形仍将触发DBWR写出，但_db_block_write_batch隐式参数被彻底废弃，所以实际限制dirty queue(checkpoint queue)长度的是隐式参数_db_large_dirty_queue，其默认为25个buffer；该参数也可以通过x$kvit内部视图的”kcbldq large dirty queue if kcbclw reaches this”观察，该参数到11.2中依然健在。

c.DBWR后台进程每三秒空闲超时后被唤醒，每次超时均会唤醒DBWR去遍历buffer headers(扫描2*_db_block_write_batch个块)以寻找并写出任意脏块(排序块，临时块会被跳过)。这样做的目的是不让DBWR过于空闲。

8i以后:DBWR仍会通过调用semtimedop函数陷入3s一次的睡眠中，若在过去三秒中DBWR始终空闲，那么在它醒来后会写出少量的脏块到磁盘上。但不同于早期版本，目前版本中DBWR会每三秒根据增量检查点的要求写出脏块(脏块来源于CKPT-Q链表及由DBWR进程主动去扫描LRU链表的tail)。详见Jonathan Lewis的<How have the log_checkpoint_interval and log_checkpoint_timeout changed from version 7?>:

“(Oracle decided to keep trickling dirty blocks to disc at a higher rate than had been effected by the old 3-second idle write rate (every 3 seconds, dbwr wakes up and writes a few blocks to disc if it has had no other work in the interval).? To achieve this, they changed the meaning of the two log checkpoint parameters. This change was made possible by a change in the architecture of the buffer management, which now allows Oracle to queue dirty buffers in the order that they were first made dirty ??.Amongst other things, Oracle already kept a low redo block address (lrba)on each buffer header for each dirty buffer. This identifies the first redo block that started the process of changing that buffered block from the state that is currently on disc to the dirty state that is in the buffer. The function of the log_checkpoint_interval is simply to limit the distance between a buffer’s lrba and the addreess of the redo block that lgwr is currently writing. If the lrba is too low, then the block was first dirtied too long ago and it has to be written to disc (and its lrba set to zero). Since Oracle now queues dirty blocks in the order they were first dirtied (i.e. lrba order) it is a quick and cheap process to find such blocks.For example: if lgwr is currently writing redo block 12,100 and the log_checkpoint_interval is set to 4,000, then dbwr will be cued to write any dirty blocks with an lrba less than 8,100. This check is carried out every 3 seconds, and I believe the control files and possibly any affected data files are updated with the SCN at which this incremental checkpoint took place.)” http://www.jlcomp.demon.co.uk/faq/log_checkpoint.html

d.8i以前不存在incremental checkpoint增量检查点，完全检查点发生时，LGWR将通知DBWR写出一系列的current,dirty和非临时buffer到磁盘上。注意因检查点写出的buffer块并不会被置为free状态，这些脏块仍保留在数据库高速缓存中，以减少不必要的物理IO。因为8i以后的检查点已发生巨大转变，故不再展开。

8i以后:完全检查点为增量检查点所”取代”，完全检查点仅在alter system checkpoint或shutdown(非abort)等少数情况下发生，增量检查点会引发checkpoint queue(dirty queue)上的脏块递进地被写出，每三秒CKPT后台进程将计算检查点目标RBA(Redo Block Address)，该目标RBA基于:当前RBA,log_checkpoint_timeout,log_checkpoint_interval,fast_start_mttr_target,fast_start_io_target,最小在线日志的大小等因素。当增量检查点发生时所有在目标RBA相应时间之前被弄脏的buffer块都当被写出(When a checkpoint is initiated, DBWR writes all buffers on the queue until the checkpoint RBA is less than the head of the CKPTQ RBA)。在Oracle 10g中实现了自动调整检查点，只要不显示地(explicitly)设置参数FAST_START_MTTR_TARGET为零，自动检查点调整都将被启用，数据库将以较低的I/O负载写出脏块以提高性能。需要注意的是当启用mttr advisor时有必要将log_checkpoint_timeout,log_checkpoint_interval,fast_start_io_target三个参数设置为零。– DBWR write For checkpoint。

e.9i以前的rac称作”ops oracle parallel server”，因为当时还没有出现cache fusion功能，所以节点间同步缓存需要通过将current块写入到磁盘上来完成，这种因为block ping request的需求而导致的DBWR写出，可以归类为write for ping request (The biggest performance robber in the OPS architecture was the DB block ping. A DB block ping would occur when an instance participating in an OPS database had a block in its cache that another participating instance required. In OPS, if another instance required the block in the cache of a second instance, the block would have to be written out to disk, the locks transferred, and then the block re-read into the requesting instance.)

8i以后:在Oracle 8i的ops中初步实现了cache fusion(Oracle 8i (OPS) introduced the initial phase of cache fusion. The data blocks were transferred from the SGA of one instance to the SGA of another instance without the need to write the blocks to disk.)到9i中cache fusion技术日渐成熟，ops也更名为”RAC real application cluster”以区别于老式的cluster。因为cache fusion这一革命性的特性出现，9i以后的oracle cluster中dbwr被从需要不断完成write for ping request的性能窘境中拯救出来了；因而block ping request这种触发写出的条件也不再成立。

Who pulls the background process dbwr’s trigger?