诗檀软件Biot 分享《使用VirtualBox在Oracle Linux 5.7上安装Oracle Database 11g Release 2 RAC的最佳实践》,下载地址:
诗檀软件Biot 分享《使用VirtualBox在Oracle Linux 5.7上安装Oracle Database 11g Release 2 RAC的最佳实践》,下载地址:
GC FREELIST等待事件 freelist empty
kclevrpg Lock element event number is the name hash bucket the LE belongs to.
kclnfndnew – Find an le for a given name
ORACLE RAC节点意外重启Node Eviction诊断流程图
导致实例逐出的五大问题 (Doc ID 1526186.1)
Oracle Database – Enterprise Edition – 版本 10.2.0.1 到 11.2.0.3 [发行版 10.2 到 11.2]
本文档所含信息适用于所有平台
本文档针对导致实例驱逐的主要问题为 DBA 提供了一个快速概述。
DBA
实例崩溃,警报日志显示“ORA-29740:evicted by member …(被成员…驱逐)”错误。
检查所有实例的 lmon 跟踪文件,这对确定实例驱逐的原因代码而言非常重要。查找包含“kjxgrrcfgchk:Initiating reconfig”的行。
这将提供一个原因代码,如“kjxgrrcfgchk:Initiating reconfig, reason 3”。实例驱逐时发生的大多数 ora-29740 错误是由于原因 3(“通信故障”) 造成的。
Document 219361.1 (Troubleshooting ORA-29740 in a RAC Environment) 介绍了以下几种可能造成原因 3的 ora-29740 错误原因:
a) 网络问题。
b) 资源耗尽(CPU、I/O 等)
c) 严重的数据库争用。
d) Oracle bug。
实例驱逐时,警报日志显示许多“IPC send timeout”错误。此消息通常伴随数据库性能问题。
lmon、lms 和 lmd 进程报告“IPC send timeout”错误的另一个原因是网络问题或服务器资源(CPU 和内存)问题。这些进程可能无法获得 CPU 运行调度或这些进程发送的网络数据包丢失。
涉及 lmon、lmd 和 lms 进程的通信问题导致实例驱逐。被驱逐实例的警报日志显示的信息类似于如下示例
IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309
如果某实例被驱逐,警报日志中的“IPC Send timeout detected(检测到 IPC 发送超时)”通常伴随着其它问题,如 ora-29740 和“Waiting for clusterware split-brain resolution(等待集群件“脑裂”解决方案)”
1) 检查网络,确保无网络错误,如 UDP 错误或 IP 数据包丢失或故障错误。
2) 检查网络配置,确保所有节点上的所有网络配置均设置正确。
例如,所有节点上 MTU 的大小必须相同,并且如果使用巨帧,交换机也能够支持大小为 9000 的 MTU。
3) 检查服务器是否存在 CPU 负载问题或可用内存不足。
4) 检查数据库在实例驱逐之前是否正处于挂起状态或存在严重的性能问题。
5) 检查 CHM (Cluster Health Monitor) 输出,以查看服务器是否存在 CPU 或内存负载问题、网络问题或者 lmd 或 lms 进程出现死循环。CHM 输出只能在特定平台和版本中使用,因此请参阅 CHM 常见问题 Document 1328466.1
6) 如果 OSWatcher 尚未设置,请按照 Document 301137.1 中的说明进行设置以运行 OSWatcher。
CHM 输出不可用时,使用 OSWatcher 输出将有所帮助。
在实例崩溃/驱逐前,该实例或数据库正处于挂起状态。当然,也可能是节点挂起。
在执行驱逐其他实例动作的实例警报日志中,您可能会看到与以下消息类似的消息:
Remote instance kill is issued [112:1]:8
或者
Evicting instance 2 from cluster
在一个或多个实例崩溃之前,警报日志显示“Waiting for clusterware split-brain resolution(等待集群件“脑裂”解决方案)”。这通常伴随着“Evicting instance n from cluster(从集群驱逐实例 n)”,其中 n 是指被驱逐的实例编号。
常见原因有:
1) 实例级别的“脑裂”通常由网络问题导致,因此检查网络设置和连接非常重要。但是,因为如果网络已关闭,集群件 (CRS) 就会出现故障,所以只要 CRS 和数据库使用同一网络,则网络不太可能会关闭。
2) 服务器非常繁忙和/或可用内存量低(频繁的交换和内存扫描),将阻止 lmon 进程被调度。
3) 数据库或实例正处于挂起状态,并且 lmon 进程受阻。
4) Oracle bug
以上原因与问题 1的原因相似(警报日志显示 ora-29740 是实例崩溃/驱逐的原因)。
1) 检查网络,确保无网络错误,如 UDP 错误或 IP 数据包丢失或故障错误。
2) 检查网络配置,确保所有节点上的所有网络配置均设置正确。
例如,所有节点上 MTU 的大小必须相同,并且如果使用巨帧,交换机也能够支持大小为 9000 的 MTU。
3) 检查服务器是否存在 CPU 负载问题或可用内存不足。
4) 检查数据库在实例驱逐之前是否正处于挂起状态或存在严重的性能问题。
5) 检查 CHM (Cluster Health Monitor) 输出,以查看服务器是否存在 CPU 或内存负载问题、网络问题或者 lmd 或 lms 进程出现死循环。CHM 输出只能在特定平台和版本中使用,因此请参阅 CHM 常见问题 Document 1328466.1
6) 如果 OSWatcher 尚未设置,请按照 Document 301137.1 中的说明进行设置以运行 OSWatcher。
CHM 输出不可用时,使用 OSWatcher 输出将有所帮助。
一个实例驱逐其他实例时,在问题实例自己关闭之前,所有实例都处于等待状态,但是如果问题实例因为某些原因不能终止自己,发起驱逐的实例将发出 Member Kill 请求。Member Kill 请求会要求 CRS 终止问题实例。此功能适用于 11.1 及更高版本。
例如,以上消息表示终止实例 8 的 Member Kill 请求已发送至 CRS。
问题实例由于某种原因正处于挂起状态且无响应。这可能是由于节点存在 CPU 和内存问题,并且问题实例的进程无法获得 CPU 运行调度。
第二个常见原因是数据库资源争用严重,导致问题实例无法完成远程实例驱逐该实例的请求。
另一个原因可能是由于实例尝试中止自己时,一个或多个进程“幸存”了下来。除非实例的所有进程全部终止,否则 CRS 不认为该实例已终止,而且不会通知其它实例该问题实例已经被终止。这种情况下的一个常见问题是一个或多个进程变成僵尸进程且未终止。
并导致CRS通过节点重启或 rebootless restart( CRS 重新启动但节点不重启)进行重新启动。这种情况下,问题实例的警报日志显示
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 23305
(实例终止未能终止一个或多个进程
实例被 LMON, pid = 23305 终止)
1) 查找数据库或实例挂起的原因。对数据库或实例挂起问题进行故障排除时,获取全局 systemstate 转储和全局hang analyze 转储是关键。如果无法获取全局 systemstate 转储,则应获取在大致相同时间所有实例的本地 systemstate 转储。
2) 检查 CHM (Cluster Health Monitor) 输出,以查看服务器是否存在 CPU 或内存负载问题、网络问题或者 lmd 或 lms 进程出现死循环。CHM 输出只能在某些平台和版本中使用,因此请参阅 CHM 常见问题Document 1328466.1
3) 如果 OSWatcher 尚未设置,请按照 Document 301137.1 中的说明进行设置以运行 OSWatcher。
CHM 输出不可用时,使用 OSWatcher 输出将有所帮助.
【转】导致 Scan VIP 和 Scan Listener(监听程序)出现故障的最常见的 5 个问题 (Doc ID 1602038.1)
Oracle Database – Enterprise Edition – 版本 11.2.0.1 到 11.2.0.3 [发行版 11.2]
本文档所含信息适用于所有平台
本说明简要总结了导致 SCAN VIP 和 SCAN LISTENERS 故障的最常见问题
所有遇到 SCAN 问题的用户
在其中一个节点上,SCAN VIP 显示状态“UNKNOWN”和“CHECK TIMED OUT”
另两个 SCAN VIP 在其他节点上启动,显示状态“ONLINE”
crsctl stat res -t
——————————————————————————–
Cluster Resources
——————————————————————————–
ora.scan1.vip 1 ONLINE UNKNOWN rac2 CHECK TIMED OUT
ora.scan2.vip 1 ONLINE ONLINE rac1
ora.scan3.vip 1 ONLINE ONLINE rac1
SCAN VIP 是 11.2 版本集群件的新功能。
安装之后,验证 SCAN 配置和状态:
– crsctl status resource -w ‘TYPE = ora.scan_vip.type’ -t
必须显示3 个SCAN 地址 ONLINE
ora.scan1.vip 1 ONLINE ONLINE rac2
ora.scan2.vip 1 ONLINE ONLINE rac1
ora.scan3.vip 1 ONLINE ONLINE rac1
– crsctl status resource -w ‘TYPE = ora.scan_vip.type’ -t
应显示 LISTENER_SCAN<x> ONLINE
– srvctl config scan /srvctl config scan_listener
显示 SCAN 和 SCAN listener(监听程序)配置:scan 名称、网络和所有 SCAN VIP(名称和 IP)、端口
– cluvfy comp scan
在执行 SCAN VIP 和 SCAN listener故障切换后,实例未注册到 SCAN listener。这种情况只会发生在其中1 个 scan listener上。客户机连接间歇性出现“ORA-12514 TNS:listener does not currently know of service requested in connect descriptor”。
1. 未发布的 Bug 12659561:在执行 scan listener故障切换后,数据库实例可能未注册到 scan listener(请参阅 Note 12659561.8),这一问题已在 11.2.0.3.2 中修复,针对 11.2.0.2 的 Merge patch13354057 适用于特定平台。
2. 未发布的 Bug 13066936:在执行 scan 故障切换时,实例未注册服务(请参阅 Note 13066936.8)。
1) 对于以上两个 Bug,解决方法是执行以下步骤,在未注册到 SCAN listener的数据库实例上注销并重新注册remote_listener。
show parameter remote_listener
alter system set remote_listener=”;
alter system register;
alter system set remote_listener='<scan>:<port>’;
alter system register;
2) 服务未注册到 SCAN listener(监听程序)时要检查的其他要点:
a. 正确定义了 remote_listener 和 local_listener
b. sqlnet.ora 中定义了 EZCONNECT,示例:NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)
c. /etc/hosts 或 DNS 中定义了 SCAN,并且如果在两处都定义,则检查是否存在任何不匹配情况
d. nslookup <scan> 应以 round-robin (循环)方式显示 SCAN VIP
e. 如果未配置 Secure transports (COST) 的类,则不要在 listener.ora 中设置 SECURE_REGITER_<listener>。
公网关闭时,Scan Vip 应切换到下一个节点。在 11.2.0.1 的一些环境中,Scan Vip 可能会停留在错误的节点上。
Database – RAC/Scalability 社区
为了与 Oracle 专家和业内同行进一步讨论这个话题,我们建议您加入 My Oracle Support 的 Database – RAC/Scalability 社区参与讨论。
10.2.0.4以后vip不会自动relocate back回原节点, 原因是ORACLE开发人员发现在实际使用中会遇到这样的情况: relocate back回原节点 需要停止VIP并在原始节点再次启动该VIP,但是如果原始节点上的公共网络仍不可用,则这个relocate的尝试将再次失败而failover到第二节点。 在此期间VIP将不可用,所以从10.2.0.4和11.1开始,默认的实例检查将不会自动relocate vip到原始节点。
详细见下面的Note介绍:
Applies to:
Oracle Server – Enterprise Edition – Version 10.2.0.4 to 11.1.0.7 [Release 10.2 to 11.1]
Information in this document applies to any platform.
Symptoms
Starting from 10.2.0.4 and 11.1, VIP does not fail-over back to the original node even after the public network problem is resolved. This behavior is the default behavior in 10.2.0.4 and 11.1 and is different from that of 10.2.0.3
Cause
This is actually the default default behavior in 10.2.0.4 and 11.1
In 10.2.0.3, on every instance check, the instance attempted to relocate the VIP back to the preferred node (original node), but that required stopping the VIP and then attempt to restart the VIP on the original node. If the public network on the original node is still down, then the attempt to relocate VIP to the original node will fail and the VIP will fail-over back to the secondary node. During this time, the VIP is not available, so starting from 10.2.0.4 and 11.1, the default behavior is that the instance check will not attempt to relocate the VIP back to the original node.
Solution
If the default behavior of 10.2.0.4 and 11.1 is not desired and if there is a need to have the VIP relocate back to the original node automatically when the public network problem is resolved, use the following workaround
Uncomment the line
ORA_RACG_VIP_FAILBACK=1 && export ORA_RACG_VIP_FAILBACK
in the racgwrap script in $ORACLE_HOME/bin
With the above workaround, VIP will relocate back to the original node when CRS performs the instance check, so in order for the VIP to relocate automatically, the node must have at least one instance running.
The instance needs to be restarted or CRS needs to be restarted to have the VIP start relocating back to the original node automatically if the change is being made on the existing cluster.
Relying on automatic relocation of VIP can take up to 10 minutes because the instance check is performed once every 10 minutes. Manually relocating the VIP is only way to guarantee quick relocation of VIP back to the original node.
To manually relocate the VIP, start the nodeapps by issuing
srvctl start nodeapps -n <node name>
Starting the nodeapps does not harm the online resources such as ons and gsd.
【Maclean Liu技术分享】12c 12.1.0.1 RAC Real Application Cluster 安装教学视频 基于Vbox+Oracle Linux 5.7
安装步骤脚本下载:
Maclean技术分享 12c RAC 安装OEL 5.7 12.1.0.1 RAC VBox安装脚本.txt
视频观看地址:
[Read more…]
本文永久地址:https://www.askmac.cn/archives/oracle-clusterware-11-2.html
到Oracle数据库11g第2版(11.2)的过渡中,Oracle集群做了大量的改变,完全重新设计了CRSD,引进了“本地CRS”(OHASD)和紧密集成的代理层更换RACK层。新的功能,如Grid Naming Service,即插即用,集群时间同步服务和Grid IPC。集群同步服务(CSS)可能是影响最小的变化,但它提供支持新功能的功能,以及添加了新的功能,如支持IPMI。
有了这个技术文件,我们想借此机会,提供所有的我们已经积累了多年的11.2的发展技术,并将其转发给那些刚刚开始学习Oracle 11.2集群的人。本文提供了总体概述,以及相关的诊断和调试的详细信息。
由于这是Oracle集群诊断文章的第一版本,而不是11.2集群覆盖的全部详细信息。如果你觉得你可以对本文档提供帮助与修改,请告诉我们。
这节将介绍Oracle集群的主要守护进程
下图是关于Oracle集群11.2版本所使用的守护进程,资源和代理的高度概括。
11.2版本和之前的版本的第一个大的区别就是OHASD守护进程,替代了所有在11.2之前版本里的初始化脚本。
Oracle集群由两个独立的堆栈组成。上层Cluster Ready Services守护进程(CRSD)堆栈和下层Oracle High Availability Services守护进程(ohasd)堆栈。这两个堆栈有促进集群操作几个进程。下面的章节将详细介绍这些内容。
OHASD是启动一个节点的所有其他后台程序的守护进程。OHASD将替换所有存在于11.2之前版本的初始化脚本。
OHASD入口点是/etc/inittab文件,其执行/etc/init.d/ohasd和/etc/init.d/init.ohasd。/etc/init.d/ohasd脚本是包含开始和停止操作的RC脚本。/etc/init.d/init.ohasd脚本是OHASD框架控制脚本将生成Grid_home/bin/ ohasd.bin可执行文件。
集群控制文件位于/ etc/ ORACLE / scls_scr/<hostname>/root(这是Linux的位置),并维护CRSCTL;换句话说,一个“crsctl enable / disable crs”命令将更新该目录中的文件。
如:
[root@rac1 root]# ls /etc/oracle/scls_scr/rac1/root crsstart ohasdrun ohasdstr
# crsctl enable -h Usage: crsctl enable crs Enable OHAS autostart on this server
# crsctl disable –h Usage: crsctl disable crs Disable OHAS autostart on this server |
scls_scr/<hostname>/root/ohasdstr文件的内容是控制CRS堆栈的自动启动;文件中的两个可能的值是“enable” – 启用自动启动,或者“disable” – 禁用自动启动。
scls_scr/<hostname>/root/ohasdrun文件控制init.ohasd脚本。三个可能的值是“reboot” – 和OHASD同步,“restart” – 重启崩溃的OHASD,“stop” – 计划OHASD关机。
Oracle 11.2集群有OHASD最大的好处是在一个集群的方式运行某些CRSCTL命令的能力。命令是完全独立于操作系统,因此他们只能靠ohasd。如果ohasd正在运行,则远程操作,如启动,停止和检查远程节点的堆栈状态都是可以执行的。
集群命令包括:
[root@rac2 bin]# ./crsctl stop cluster
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘rac2’ CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘rac2’ CRS-2673: Attempting to stop ‘ora.OCR_VOTEDISK.dg’ on ‘rac2’ CRS-2673: Attempting to stop ‘ora.registry.acfs’ on ‘rac2’ 。。。。。。。。。。。。。
[root@rac2 bin]# ./crsctl start cluster CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘rac2’ CRS-2676: Start of ‘ora.cssdmonitor’ on ‘rac2’ succeeded CRS-2672: Attempting to start ‘ora.cssd’ on ‘rac2’ CRS-2672: Attempting to start ‘ora.diskmon’ on ‘rac2’ CRS-2676: Start of ‘ora.diskmon’ on ‘rac2’ succeeded 。。。。。。。。。。。。。
[root@rac2 bin]# ./crsctl check cluster CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online |
OHASD能执行更多的功能,如处理和管理Oracle本地库(OLR),以及作为OLR服务器。在集群中,OHASD以root身份运行;在Oracle重新启动的环境下,以Oracle用户运行OHASD管理应用程序资源。
Oracle 11.2的集群堆栈由OHASD守护进程启动,这本身是由一个启动了的节点的/etc/init.d/init.ohasd脚本产生的。另外用’CRSCTL stop CRS后用‘CRSCTL start CRS‘,ohasd开始运行的节点上。然后OHASD守护进程将启动其他守护进程和代理。每个集群守护进程由存储在OLR的OHASD资源表示。下面的图表显示了OHASD资源/集群守护程序和各自的代理进程和所有者的关系。
Resource Name | Agent Name | Owner |
ora.gipcd | oraagent | crs user |
ora.gpnpd | oraagent | crs user |
ora.mdnsd | oraagent | crs user |
ora.cssd | cssdagent | Root |
ora.cssdmonitor | cssdmonitor | Root |
ora.diskmon | orarootagent | Root |
ora.ctssd | orarootagent | Root |
ora.evmd | oraagent | crs user |
ora.crsd | orarootagent | Root |
ora.asm | oraagent | crs user |
ora.driver.acfs | orarootagent | Root |
ora.crf (new in 11.2.0.2) | orarootagent | root |
下面的图片显示OHASD管理资源/守护进程之间的所有资源依赖关系:
一个节点典型的守护程序资源列表如下。要获得守护资源列表,我们需要使用-init标志和CRSCTL命令。
[grid@rac1 admin]$ crsctl stat res -init -t
——————————————————————————- NAME TARGET STATE SERVER STATE_DETAILS ——————————————————————————- Cluster Resources ——————————————————————————- ora.asm 1 ONLINE ONLINE rac1 Started ora.cluster_interconnect.haip 1 ONLINE ONLINE rac1 ora.crf 1 ONLINE OFFLINE ora.crsd 1 ONLINE ONLINE rac1 。。。。。。 |
下面的列表会显示所使用的类型和层次。一切是建立在基本“resource”类型上。cluster_resource使用“resource”类型作为基本类型。cluster_resource作为基本类型构建出ora.daemon.type,守护进程资源都是使用“ora.daemon.type”类型作为基本类型。
[grid@rac1 admin]$ crsctl stat type -init
TYPE_NAME=application BASE_TYPE=cluster_resource
TYPE_NAME=cluster_resource BASE_TYPE=resource
TYPE_NAME=generic_application BASE_TYPE=cluster_resource
TYPE_NAME=local_resource BASE_TYPE=resource
TYPE_NAME=ora.asm.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.crf.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.crs.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.cssd.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.cssdmonitor.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.ctss.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.daemon.type BASE_TYPE=cluster_resource
TYPE_NAME=ora.diskmon.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.drivers.acfs.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.evm.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.gipc.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.gpnp.type BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.haip.type BASE_TYPE=cluster_resource
TYPE_NAME=ora.mdns.type BASE_TYPE=ora.daemon.type
TYPE_NAME=resource BASE_TYPE= |
用ora.cssd资源作为一个例子,所有的ora.cssd属性可以使用crsctl stat res ora.cssd –init –f显示。(列出一部分比较重要的)
[grid@rac1 admin]$ crsctl stat res ora.cssd -init -f
NAME=ora.cssd TYPE=ora.cssd.type STATE=ONLINE TARGET=ONLINE ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r–,user:grid:r-x AGENT_FILENAME=%CRS_HOME%/bin/cssdagent%CRS_EXE_SUFFIX% CHECK_INTERVAL=30 CLEAN_ARGS=abort CLEAN_COMMAND= CREATION_SEED=6 CSSD_MODE= CSSD_PATH=%CRS_HOME%/bin/ocssd%CRS_EXE_SUFFIX% CSS_USER=grid ID=ora.cssd LOGGING_LEVEL=1 START_DEPENDENCIES=weak(concurrent:ora.diskmon)hard(ora.cssdmonitor,ora.gpnpd,ora.gipcd)pullup(ora.gpnpd,ora.gipcd) STOP_DEPENDENCIES=hard(intermediate:ora.gipcd,shutdown:ora.diskmon,intermediate:ora.cssdmonitor) |
为了调试守护进程资源,-init标志一直要用。要启用额外的调试例如ora.cssd:
[root@rac2 bin]# ./crsctl set log res ora.cssd:3 -init
Set Resource ora.cssd Log Level: 3 |
检查log级别
[root@rac2 bin]# ./crsctl get log res ora.cssd -init
Get Resource ora.cssd Log Level: 3 |
要检查资源属性,如log级别:
[root@rac2 bin]# ./crsctl stat res ora.cssd -init -f | grep LOGGING_LEVEL
DAEMON_LOGGING_LEVELS=CSSD=2,GIPCNM=2,GIPCGM=2,GIPCCM=2,CLSF=0,SKGFD=0,GPNP=1,OLR=0 LOGGING_LEVEL=3 |
Oracle 11.2集群引入了一个新概念,代理,这使得Oracle集群更强大和高性能。这些代理是多线程的守护进程,实现多个资源类型的入口点和为不同的用户生成新流程。代理是高可用的,此外oraagent,orarootagent和cssdagent/ cssdmonitor,可以有一个应用程序代理和脚本代理。
两个主要代理是oraagent和orarootagent。 ohasd和CRSD各使用一个oraagent和一个orarootagent。如果CRS用户和Oracle用户不同,那么CRSD将利用两个oraagent和一个orarootagent。
ohasd’s oraagent:
crsd’s oraagent:
ohasd’s orarootagent:
crsd’s orarootagent:
请参照章节: “cssdagent and cssdmonitor”.
请参照章节:“application and scriptagent”.
ohasd/crsd代理的日志放在Grid_home/log/<hostname>/agent/ {ohasd|crsd}/ <agentname>_<owner>/ <agentname>_<o wner>.log.例如,ora.crsd是ohasd管理属于root用户,那么代理的日志名字为:
Grid_home/log/<hostname>/agent/ohasd/orarootagent_root/orarootagent_root.log
[grid@rac2 orarootagent_root]$ ls /u01/app/11.2.0/grid/log/rac2/agent/ohasd/orarootagent_root
orarootagent_root.log orarootagent_rootOUT.log orarootagent_root.pid |
同一个代理日志可以存放不同资源的日志,如果这些资源是由相同的守护进程管理的。
如果一个代理进程崩溃了,
-核心文件将被写入
Grid_home/log/<hostname>/agent/{ohasd|crsd}/<agentname>_<owner>
-堆栈调用将写入
Grid_home/log/<hostname>/agent/{ohasd|crsd}/<agentname>_<owner>/<agentna me>_<owner>OUT.log
代理日志的格式如下:
<timestamp>:[<component>][<thread id>]…
<timestamp>:[<component>][<thread id>][<entry point>]…
例如:
2016-04-01 13:39:23.070: [ora.drivers.acfs][3027843984]{0:0:2} [check] execCmd ret = 0
[ clsdmc][3015236496]CLSDMC.C returnbuflen=8, extraDataBuf=A6, returnbuf=8D33FD8 2016-04-01 13:39:24.201: [ora.ctssd][3015236496]{0:0:213} [check] clsdmc_respget return: status=0, ecode=0, returnbuf=[0x8d33fd8], buflen=8 2016-04-01 13:39:24.201: [ora.ctssd][3015236496]{0:0:213} [check] translateReturnCodes, return = 0, state detail = OBSERVERCheckcb data [0x8d33fd8]: mode[0xa6] offset[343 ms]. |
如果发生错误,确定发生了什么的入口点:
-集群告警日志,Grid_home/log/<hostname>/alert<hostname>.log
如:/u01/app/11.2.0/grid/log/rac2/alertrac2.log
-OHASD/CRSD日志
Grid_home/log/<hostname>/ohasd/ohasd.log
Grid_home/log/<hostname>/crsd/crsd.log
-对应的代理日志文件
请记住,一个代理日志文件将包含多个资源的启动/停止/检查。以crsd orarootagent资源名称”ora.rac2.vip”为例。
[root@rac2 orarootagent_root]# grep ora.rac2.vip orarootagent_root.log
。。。。。。。。 2016-04-01 12:30:33.606: [ora.rac2.vip][3013606288]{2:57434:199} [check] Failed to check 192.168.1.102 on eth0 2016-04-01 12:30:33.607: [ora.rac2.vip][3013606288]{2:57434:199} [check] (null) category: 0, operation: , loc: , OS error: 0, other: 2016-04-01 12:30:33.607: [ora.rac2.vip][3013606288]{2:57434:199} [check] VipAgent::checkIp returned false 。。。。。。。。。。。。。 |
CSS守护进程(OCSSD)管理集群的配置,集群里有哪些节点,并在有节点离开或加入时通知集群成员。
ASM和数据库实例的其他集群守护进程依赖于一个有效的CSS。如果OCSSD因任何原因不能引导程序,比如没有发现投票的文件信息,所有其他层级将无法启动。
OCSSD还可以通过网络心跳(NHB)和磁盘心跳(DHB)监控集群的健康。NHB是主要的指标,一个节点还活着,并能参与群集。而DHB将主要用于解决脑裂。
下面的部分将列出并解释ocssd使用的线程。
– 集群监听线程(CLT) – 试图在启动时连接到所有远程节点,接收和处理所有收到的消息,并响应其他节点的连接请求。每当从节点收到一个数据包,监听重置该节点漏掉的统计数量。
– 发送线程(ST) -专门每秒发送一次网络心跳(NHB)到所有节点,和使用grid IPC(GIPC)每秒发送一次当地的心跳(LHB)到cssdagent和cssdmonitor。
– 投票线程(PT) – 监视远程节点的NHB的。如果CSS守护进程之间的通信通道发生故障时,心跳会被错过。如果某一个节点有太多的心跳信号被错过了,它被怀疑是关闭或断开。重新配置的线程会被唤醒,重新配置将发生,并最终将一个节点驱逐。
在重新配置管理节点,唤醒的重新配置管理线程着眼于每个节点,看哪些节点已经错过了NHB的太久。在重新配置管理器线程参与了与其他CSS守护程序投票过程中,一旦确定了新的群集成员,重新配置管理器线程会在投票的文件写入驱逐通知。该RMT还发送关闭消息给被驱逐的节点。投票文件会监控检查心跳的裂脑,直到他们的磁盘心跳已经停止了<misscount>秒,远程节点才会被踢走。
– 发现进程 -发现投票文件
– 避开线程 – 用于I / O防护diskmon进程通信,如果使用EXADATA。
投票文件群集成员线程
– 磁盘ping线程(每个投票的文件)
与它相关联的节点数量和递增序列号的投票文件一起写入群集成员的当前视图;
读取驱逐通知看它的主机节点是否已被驱逐;
这个线程还监视远程节点投票磁盘心跳信息。磁盘心跳信息,以便重新配置过程中用于确定一个远程OCSSD是否已经终止。
– kill block线程 – (每个投票的文件)监控投票文件可用性,以确保足够可访问的投票文件的数量。如果使用的Oracle冗余,我们需要配置多数投票磁盘在线。
– 工作线程 – (11.2.0.1里新增加的,每个投票文件)各种I / O在投票文件。
– 磁盘Ping监视器 – 监视器的I / O投票文件状态
此监视线程,确保磁盘ping线程正确地读取多数投票配置文件里的kill blocks。如果我们不能对投票文件进行I/O操作,由于I / O挂起或I / O故障或其他原因,我们把这个投票文件设置离线。该线程监视磁盘ping线程。如果CSS是无法读取多数投票的文件,它可能不再获得至少一个盘在所有的节点上。这个节点有可能会错过的驱逐通知;换句话说,CSS是不能够进行合作,并必须被终止。
– 节点杀死线程 – (瞬时的)用于通过IPMI杀死节点
– 成员杀死线程 – (瞬时的)杀成员期间使用
成员-杀死(监控)线程
本地杀死线程 – 当一个CSS客户端开始杀死成员,当地CSS杀死线程将被创建
– SKGXN监视器(skgxnmon只出现在供应商集群)
这个线程寄存器与SKGXN节点组的成员观察节点组成员身份的变化。当重新配置事件发生时,该线程从SKGXN请求当前节点组成员的位图,并将其与它接收到的最后的时间和其他两个位图的当前值的位图:驱逐待定,其标识节点在被关闭中,VMON的组成员,这表明其节点的过程oclsmon仍在运行(节点仍然是up的)。当一个成员的转变被确认,节点监视线程启动相应的操作。
在Oracle集群11g第2版(11.2)减少了配置要求,这意味节点启动时自动添加回去,如果已经停机很久则删除它们。停止超过一个星期都不再olsnodes报道的服务器。当他们离开集群这些服务器自动管理,所以你不必从集群中明确地将其删除。
固定节点
相应的命令来更改节点固定行为(固定或不固定任何特定节点),是crsctl pin/unpin的CSS命令。固定节点是指节点名称与节点号码的关联是固定的。如果一个节点不固定,如果租赁到期时,节点号可能会改变。一个固定节点的租约永不过期。用crsctl delete node命令删除一个节点隐含取消节点固定。
– 在Oracle集群升级,所有服务器都固定,而经过Oracle集群的全新安装11g第2版(11.2),您添加到集群中的所有服务器都不固定。
– 在安装了11.2集群的服务器上有比11.2早版本的实例,那么您无法取消固定。
固定一个节点需要滚动升级到Oracle集群件11g第2版(11.2),将自动完成。我们已经看到有客户进行手动升级失败,是因为没有固定节点。
端口分配
对于CSS和节点监视器固定端口分配已被删除,所以不应该有与其他应用程序的端口竞争。唯一的例外是滚动升级过程中我们分配两个固定的端口。
GIPC
该CSS层是使用新的通信层Grid PC(GIPC),它仍然支持11.2之前使用CLSC通信层。在11.2.0.2,GIPC将支持的使用多个NIC的单个通信链路,例如CSS / NM间的通信。
集群告警日志
多个cluster_alert.log消息已被添加便于更快的定位问题。标识符将在alert.log和链接到该问题的守护程序日志条目都被打印。标识符是组件中唯一的,例如CSS或CRS。
2009-11-24 03:46:21.110
[crsd(27731)]CRS-2757:Command ‘Start’ timed out waiting for response from the resource ‘ora.stnsp006.vip’. Details at (:CRSPE00111:) in
/scratch/grid_home_11.2/log/stnsp005/crsd/crsd.log.
2009-11-24 03:58:07.375
[cssd(27413)]CRS-1605:CSSD voting file is online: /dev/sdj2; details in
/scratch/grid_home_11.2/log/stnsp005/cssd/ocssd.log.
独占模式
在Oracle集群11g第2版(11.2)集群独占模式是一个新的概念。此模式将允许您在一个节点上启动堆栈无需其他跟多的堆栈启动。投票文件不是必需的,不需要的网络连接。此模式用于维护或故障定位。因为这是一个用户调用命令确保在同一时刻只有一个节点是开启的。在独占模式下root用户在某一个节点上使用crsctl start crs –excl命令启动堆栈。
如果集群中的另一个节点已经启动,那么独占模式启动时将失败。OCSSD守护进程会主动去检查节点,如果发现有其他节点已经启动,那么启动将失败报CRS-4402。这不是错误;这是一个预期的行为时,因为另一节点已经启动。约翰·利思说,“你收到CRS-4402时是没有错误文件的”。
发现投票文件
识别投票文件的方法在11.2已经改变。投票文件在11.1和更早版本里的OCR配置,在11.2投票文件通过在GPNP配置文件中的CSS文件投票字符串的发现位置。 例如:
CSS voting file discovery string referring to ASM
发现CSS投票文件字符串指向SM,所以将使用在ASM搜寻字符串值。最常见的是你会看到系统这个配置(例如Linux中,使用旧的2.6内核),其中裸设备仍可配置,裸设备被CRS和ASM使用。
例如:
<orcl:CSS-Profile id=”css” DiscoveryString=”+asm” LeaseDuration=”400″/>
<orcl:ASM-Profile id=”asm” DiscoveryString=”” SPFile=””/>
对于ASM搜寻字符串空值意味着它将恢复到特定的操作系统默认情况下。在Linux上就是/dev/raw/raw*。
CSS voting file discovery string referring to list of LUN’s/disks
在下面的例子中,CSS文件投票字符串发现其实是指磁盘/ LUN列表中。这可能是配置在块设备或设备使用非默认位置。在这种情况下,对于CSS VF发现字符串与ASM发现字符串的值是相同的。
<orcl:CSS-Profile id=”css” DiscoveryString=”/dev/shared/sdsk-a[123]-*-part8″ LeaseDuration=”400″/>
<orcl:ASM-Profile id=”asm” DiscoveryString=”/dev/shared/sdsk-a[123]-*-part8 SPFile=””/>
一些投票文件标识符必须在磁盘上找到接受它作为一种投票磁盘:文件的唯一标识符,集群GUID和匹配的配置化身号(CIN)。可以使用vdpatch检查设备是否是一个投票文件。
获得租约的机制是通过获得该节点的节点数字。租约表示一个节点拥有由租约期限定义的周期关联的节点数量。一个租借期限在GPNP参数文件里硬编码为一个星期。一个节点拥有从上次续租的时间租约的租期。租赁是考虑每DHB过程中得到更新。因此,租赁期满被定义
如下 – lease expiry time = last DHB time + lease duration(租约到期时间=最后一次DHB时间+租约期限)。
有两种类型的租约
– 固定租约
节点使用硬编码的静态节点数量。固定租约是指涉及到使用静态节点号旧版本集群的升级方案中使用。
– 不固定租约
一个节点动态获取节点号使用租约获得算法。租约获得算法旨在解决其试图在同一时间获得相同的插槽节点之间的冲突。
对于一个成功的租约操作会记录在Grid_home/log/<hostname>/alert<hostname>.log,
[cssd(8433)]CRS-1707:Lease acquisition for node staiv10 number 5 completed
对于租约失败的,会有一些信息记录在<alert>hostname.log和ocssd.log。在当前版本中没有可调调整租约。
下面的章节将描述主要的组件和技术用来解决脑裂。
Heartbeats心跳
CSS的使用集群成员的两个主要的心跳机制,网络心跳(NHB)和磁盘心跳(DHB)。心跳机制是故意冗余的,它们用于不同的目的。在NHB用于检测群集连接丢失的,而DHB主要用于脑裂解决。每个群集节点必须参加心跳协议为了被认为是健康的集群成员。
Network Heartbeat (NHB)
该NHB是在集群安装过程中配置为私有互连专用网络接口。CSS每秒钟从一个节点发送NHB到集群中的所有其他节点,并从远程节点接收每秒一个NHB。该NHBB也发送cssdmonitor和cssdagent。
在NHB包含从本地节点时间戳信息,并用于远程节点找出NHB被发送。这表明一个节点可以参与集群的活动,例如组成员的变化,消息发送等。如果NHB缺少<misscount>秒(在Linux11.2数据库中30秒),集群成员资格更改(群集重新配置)是必需的。如果网络连接在小于<misscount>秒恢复连接到该网络,那么就不一定是致命的。
要调试NHB问题,增加OCSSD日志级别为3对于看每个心跳消息有时是有帮助的。在每个节点上的root用户运行CRSCTL设置日志命令:
# crsctl set log css ocssd:3
监测misstime最大值,看看是否misscount正在增加,这将定位网络问题。
# tail -f ocssd.log | grep -i misstime
2009-10-22 06:06:07.275: [ ocssd][2840566672]clssnmPollingThread: node 2, stnsp006, ninfmisstime 270, misstime 270, skgxnbit 4, vcwmisstime 0, syncstage 0 2009-10-22 06:06:08.220: [ ocssd][2830076816]clssnmHBInfo: css timestmp 1256205968 220 slgtime 246596654 DTO 28030 (index=1) biggest misstime 220 NTO 28280 2009-10-22 06:06:08.277: [ ocssd][2840566672]clssnmPollingThread: node 2, stnsp006, ninfmisstime 280, misstime 280, skgxnbit 4, vcwmisstime 0, syncstage 0 2009-10-22 06:06:09.223: [ ocssd][2830076816]clssnmHBInfo: css timestmp 1256205969 223 slgtime 246597654 DTO 28030 (index=1) biggest misstime 1230 NTO 28290 2009-10-22 06:06:09.279: [ ocssd][2840566672]clssnmPollingThread: node 2, stnsp006, ninfmisstime 270, misstime 270, skgxnbit 4, vcwmisstime 0, syncstage 0 2009-10-22 06:06:10.226: [ ocssd][2830076816]clssnmHBInfo: css timestmp 1256205970 226 slgtime 246598654 DTO 28030 (index=1) biggest misstime 2785 NTO 28290 |
要显示当前misscount设置的值,使用命令crsctl get css misscount。我们不支持misscount设置默认值以外的值。
[grid@rac2 ~]$ crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services. |
Disk Heartbeat (DHB)
除了NHB,我们需要使用DHB来解决脑裂。它包含UNIX本地时间的时间戳。
DHB明确的机制,决定有关节点是否还活着。当DHB心跳丢失时间过长,则该节点被假定为死亡。当连接到磁盘丢失时间’过长’,盘会考虑脱机。
关于“太长”的定义取决于对DHB下列情形。首先,长期磁盘I / O超时(LIOT),其中有一个默认的200秒的设定。如果我们不能在时间内完成一个投票文件内的I / O,我们将此投票文件脱机。其次,短期磁盘I / O超时(SIOT),其中CSS集群重新配置过程中使用。SIOT是有关misscount(misscount(30) – reboottime(3)=27秒)。默认重启时间为3秒。要显示CSS 的disktimeout参数的值,使用命令crsctl get css disktimeout。
[grid@rac2 ~]$ crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services. |
网络分离检测
最后NHB的时间戳和最近DHB的时间戳进行比较,以确定一个节点是否仍然活着。
当最近DHB和最后NHB的时间戳之间的差是大于SIOT (misscount – reboottime),一个节点被认为仍然活跃。
当时间戳之间的增量小于重启时间,节点被认为是还活着。
如果该最后DHB读取的时间大于SIOT,该节点被认为是死的。
如果时间戳之间的增量比SIOT大比reboottime少,节点的状态不明确,我们必须等待做出决定,直到我们陷入上述三个类别之一。
当网络发生故障,并且仍有活动的节点无法和其他节点通信,网络被认为是分裂。为了保持分裂发生时数据完整性,节点中的一个必须失败,尚存节点应该是最优子群集。节点通过三种可能的方式逐出:
– 通过网络发送驱逐消息。在大多数情况下,这将失败,因为现有的网络故障。
– 通过投票的文件
– 通过IPMI,如果支持和配置
在我们用下面的例子与节点A,B,C和D集群更为详细的解释:
Nodes A and B receive each other’s heartbeats
Nodes C and D receive each other’s heartbeats
Nodes A and B cannot see heartbeats of C or D
Nodes C and D cannot see heartbeats of A or B
Nodes A and B are one cohort, C and D are another cohort
Split begins when 2 cohorts stop receiving NHB’s from each other
CSS任务是一个对称的失败,也就是,A+B组停止接收C+D组发送的NHB,C+D组通知接收A+B组发送的NHB。
在这种情况下,CSS使用投票文件和DHB解决脑裂。kill block,是投票文件结构的一个组成部分,将更新和用于通知已被驱逐的节点。每个节点每一秒读取它的kill block,当另一个节点已经更新kill block后,就会自杀。
在像上面,有相似大小的子群集的情况下,子群中含有低节点编号的子群的节点将生存下来,而其他子群集节点将重新启动。
在一个更大的群集分裂的情况下,更大的子群将生存。在两个节点的群集的情况下,节点号小的节点在网络分离下将存活下来,独立于网络发生错误的位置。
连接到一个节点所需的多数投票文件保持活跃。
在11.2.0.1中kill守护程序是一个没有杀死CSS组的成员的权利。它是由在I/ O客户端加入组OCSSD库代码催生,并在需要时重生。每个用户有一个杀守护进程(oclskd)(例如crsowner,oracle)。
杀死成员说明
下面这些OCSSD线程参与杀死成员:
client_listener – receives group join and kill requests
peer_listener – receives kill requests from remote nodes
death_check – provides confirmation of termination
member_kill – spawned to manage a member kill request
local_kill – spawned to carry out member kills on local node
node termination – spawned to carry out escalation
Member kills are issued by clients who want to eliminate group members doing IO, for example:
LMON of the ASM instance
LMON of a database instance
crsd on Policy Engine (PE) master node (new in 11.2)
成员杀死总是涉及远程目标;无论是远程ASM或数据库实例。成员杀死请求被移交到本地OCSSD,再把请求发送到OCSSD目标节点上。
在某些情况下,可能在11.2.0.1及更早版本,如极端的CPU和内存资源缺乏,远程节点的杀死守护进程或远程OCSSD不能服务当地OCSSD的成员杀死时间请求(misscount秒),因此成员杀死请求超时。如果LMON(ASM和/或RDBMS)要求成员杀死,那么请求将由当地OCSSD升级到远程节点杀。通过CRSD成员终止请求将永远不会被升级为节点杀,相反,我们依靠orarootagent的检查动作检测功能失调CRSD并重新启动。目标节点的OCSSD将收到成员杀死升级的要求,并会自杀,从而迫使节点重新启动。
与杀守护进程运行实时线程cssdagent / cssdmonitor(11.2.0.2),有更高的机会杀死请求成功,尽管高系统负载。
如果IPMI配置和功能的OCSSD节点监视器将使用IPMI产生一个节点终止线程关闭远程节点。节点终止线程通过管理LAN远程BMC通信;它将建立一个认证会话(只有特权用户可以关闭一个节点),并检查电源状态。下一步骤是请求被断电并反复检查状态直至节点状态为OFF。接到OFF状态后,我们将再次开启远程节点,节点终止线程将退出。
成员杀死的例子:
由于CPU紧缺,数据库实例3的LMON发起成员杀死比如在节点2:
2009-10-21 12:22:03.613810 : kjxgrKillEM: schedule kill of inst 2 inc 20
in 20 sec
2009-10-21 12:22:03.613854 : kjxgrKillEM: total 1 kill(s) scheduled kgxgnmkill: Memberkill called – group: DBPOMMI, bitmap:1
2009-10-21 12:22:22.151: [ CSSCLNT]clssgsmbrkill: Member kill request: Members map 0x00000002
2009-10-21 12:22:22.152: [ CSSCLNT]clssgsmbrkill: Success from kill call rc 0
本地的ocssd(第三节点,内部节点2号)收到的成员杀死请求:
2009-10-21 12:22:22.151: [ ocssd][2996095904]clssgmExecuteClientRequest: Member kill request from client (0x8b054a8)
2009-10-21 12:22:22.151: [ ocssd][2996095904]clssgmReqMemberKill: Kill requested map 0x00000002 flags 0x2 escalate 0xffffffff
2009-10-21 12:22:22.152: [ ocssd][2712714144]clssgmMbrKillThread: Kill requested map 0x00000002 id 1 Group name DBPOMMI flags 0x00000001 start time 0x91794756 end time 0x91797442 time out 11500 req node 2
DBPOMMI is the database group where LMON registers as primary member time out = misscount (in milliseconds) + 500ms
map = 0x2 = 0010 = second member = member 1 (other example: map = 0x7 = 0111 = members 0,1,2)
远程的ocssd(第二节点,内部节点1号)收到的PID的杀死守护进程的请求和提交:
2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmmkLocalKillThread: Local kill requested: id 1 mbr map 0x00000002 Group name DBPOMMI flags 0x00000000 st time 1088320132 end time 1088331632 time out 11500 req node 2
2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmmkLocalKillThread: Kill requested for member 1 group (0xe88ceda0/DBPOMMI)
2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmUnreferenceMember: global grock DBPOMMI member 1 refcount is 7
2009-10-21 12:22:22.201: [ ocssd][3799477152]GM Diagnostics started for mbrnum/grockname: 1/DBPOMMI
2009-10-21 12:22:22.201: [ ocssd][3799477152]group DBPOMMI, member 1 (client
0xe330d5b0, pid 23929)
2009-10-21 12:22:22.201: [ ocssd][3799477152]group DBPOMMI, member 1 (client 0xe331fd68, pid 23973) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.201: [ ocssd][3799477152]group DG_LOCAL_POMMIDG, member 0
(client 0x89f7858, pid 23957) sharing group DBPOMMI, member 1, share type xmbr
2009-10-21 12:22:22.201: [ ocssd][3799477152]group DBPOMMI, member 1 (client 0x8a1e648, pid 23949) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.201: [ ocssd][3799477152]group DBPOMMI, member 1 (client 0x89e7ef0, pid 23951) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.202: [ ocssd][3799477152]group DBPOMMI, member 1 (client 0xe8aabbb8, pid 23947) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.202: [ ocssd][3799477152]group DG_LOCAL_POMMIDG, member 0
(client 0x8a23df0, pid 23949) sharing group DG_LOCAL_POMMIDG, member 0, share type normal
2009-10-21 12:22:22.202: [ ocssd][3799477152]group DG_LOCAL_POMMIDG, member 0
(client 0x8a25268, pid 23929) sharing group DG_LOCAL_POMMIDG, member 0, share type normal
2009-10-21 12:22:22.202: [ ocssd][3799477152]group DG_LOCAL_POMMIDG, member 0
(client 0x89e9f78, pid 23951) sharing group DG_LOCAL_POMMIDG, member 0, share type normal
在这一点上,oclskd.log将表明成功杀死这些进程,并完成此杀死请求。在11.2.0.2及更高版本,杀死守护线程将执行杀死:
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsnkillagent_main:killreq received:
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23929
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23973
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23957
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23949
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23951
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23947
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23949
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23929
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23951
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23947
但是,如果在(misscount+1/2秒)的请求没完成,本地节点上的OCSSD升级节点杀死请求:
2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillThread: Time up:
Start time -1854322858 End time -1854311358 Current time -1854311358 timeout 11500
2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillThread: Member kill request complete.
2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillSendEvent: Missing answers or immediate escalation: Req member 2 Req node 2 Number of answers expected 0 Number of answers outstanding 1
2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmQueueGrockEvent: groupName(DBPOMMI) count(4) master(0) event(11), incarn 0, mbrc 0, to member 2, events 0x68, state 0x0
2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmMbrKillEsc: Escalating node
1 Member request 0x00000002 Member success 0x00000000 Member failure 0x00000000 Number left to kill 1
2009-10-21 12:22:33.656: [ ocssd][2712714144]clssnmKillNode: node 1 (staiu02) kill initiated
2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmMbrKillThread: Exiting
ocssd目标节点将中止,迫使一个节点重启:
2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillThread: Time up.
Timeout 11500 Start time 1088320132 End time 1088331632 Current time 1088331632
2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillResults: Replying to kill request from remote node 2 kill id 1 Success map 0x00000000 Fail map 0x00000000
2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillThread: Exiting
…
2009-10-21 12:22:34.679: [
ocssd][3948735392](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node 2, sync 151438398, stamp 2440656688
2009-10-21 12:22:34.679: [ ocssd][3948735392]###################################
2009-10-21 12:22:34.679: [ ocssd][3948735392]clssscExit: ocssd aborting from thread clssnmvKillBlockThread
2009-10-21 12:22:34.679: [ ocssd][3948735392]###################################
如何识别客户端是谁最初发出成员杀死请求?
在ocssd.log里,请求者可以被找到:
2009-10-21 12:22:22.151: [ocssd][2996095904]clssgmExecuteClientRequest: Member kill request from client (0x8b054a8)
<search backwards to when client registered>
2009-10-21 12:13:24.913: [ocssd][2996095904]clssgmRegisterClient:
proc(22/0x8a5d5e0), client(1/0x8b054a8)
<search backwards to when process connected to ocssd>
2009-10-21 12:13:24.897: [ocssd][2996095904]clssgmClientConnectMsg: Connect from con(0x677b23) proc(0x8a5d5e0) pid(20485/20485) version 11:2:1:4, properties: 1,2,3,4,5
用’ps’,或从其他历史(如trace文件,IPD/OS,OSWatcher),这个进程可以通过进程id识别:
$ ps -ef|grep ora_lmon
spommere 20485 1 0 01:46 ? 00:01:15 ora_lmon_pommi_3
智能平台管理接口(IPMI),今天是包含在许多服务器的行业标准管理协议。 IPMI独立于操作系统系统,如果系统不通电也能工作。IPMI服务器包含一个基板管理控制器(BMC),其用于与服务器通信(BMC)。
关于使用IPMI避开节点
为了支持会员杀死升级为终止节点,您必须配置和使用一个外部机制能够重启问题节点, 或从Oracle集群或从运行的操作系统的配置和使用能够重新启动该节点。IPMI是这样的机制,从11.2开始支持。通常情况下,在安装的过程中配置IPMI。如果在安装过程中没有配置IPMI,则可以在CRS的安装完成后用CRSCTL配置。
About Node-termination Escalation with IPMI
To use IPMI for node termination, each cluster member node must be equipped with a Baseboard Management Controller (BMC) running firmware compatible with IPMI version 1.5, which supports IPMI over a local area network (LAN). During database operation, member-kill escalation is accomplished by communication from the evicting ocssd daemon to the victim node’s BMC over LAN. The IPMI over LAN protocol is carried over an authenticated session protected by a user name and password, which are obtained from the administrator during installation. If the BMC IP addresses are DHCP assigned, ocssd requires direct communication with the local BMC during CSS startup. This is accomplished using a BMC probe command (OSD), which communicates with the BMC through an IPMI driver, which must be installed and loaded on each cluster system.
OLR Configuration for IPMI
There are two ways to configure IPMI, either during the Oracle Clusterware installation via the Oracle Universal Installer or afterwards via crsctl.
OUI – asks about node-fencing via IPMI
tests for driver to enable full support (DHCP addresses)
obtains IPMI username and password and configures OLR on all cluster nodes
Manual configuration – after install or when using static IP addresses for BMCs
crsctl query css ipmidevice
crsctl set css ipmiadmin <ipmi-admin>
crsctl set css ipmiaddr
参见: Oracle Clusterware Administration and Deployment Guide, “Configuration and Installation for Node Fencing” for more information and Oracle Grid Infrastructure Installation Guide, “Enabling Intelligent Platform Management Interface (IPMI)”
有时有必要改变ocssd的默认日志级别。
在11.2的日志默认级别是2.要改变日志级别,root用户在一个节点上执行下面命令:
# crsctl set log css CSSD:N (where N is the logging level)
Logging level 2 = 默认的
Logging level 3 =详细信息,显示各个心跳信息包括misstime,有助于调试NHB的相关问题。
Logging level 4 = 超级详细
大多数问题在级别2就能解决了,有一些需要级别3,很少需要级别4. 使用3或4级,跟踪信息可能只保持几个小时(甚至分钟),因为跟踪文件可以填满和信息可以被覆盖。请注意,日志级别高会造成性能影响ocssd由于数量的跟踪。如果你需要保持更长一段时间的数据,创建一个cron作业来备份和压缩CSS日志。
为了增强对cssdagent或cssdmonitor的跟踪,可以通过crsctl命令实现:
# crsctl set log res ora.cssd=2 -init
# crsctl set log res ora.cssdmonitor=2 -init
在Oracle11.2的集群,CSS输出堆栈的dump信息到cssdOUT.log中。有助于在发生重启之前刷新诊断数据到磁盘上。因此在11.2上我们不考虑有必要的diagwait(默认0)改变,除非支持或者开发有相关建议。
在非常罕见的情况下,只有在调试期间,可能也许必要禁用ocssd重新启动。这可以通过以下crsctl命令。禁用重启应该在支持或开发人员的指导下,可以在线做不会有堆栈重启。
# crsctl modify resource ora.cssd -attr “ENV_OPTS=DEV_ENV” -init
# crsctl modify resource ora.cssdmonitor -attr “ENV_OPTS=DEV_ENV” –init
在11.2.0.2启用更高的日志级别可能介绍各种模块。
用下面命令列出css守护进程的所有模块的名字:
[root@rac2 bin]# ./crsctl lsmodules css
List CSSD Debug Module: CLSF List CSSD Debug Module: CSSD List CSSD Debug Module: GIPCCM List CSSD Debug Module: GIPCGM List CSSD Debug Module: GIPCNM List CSSD Debug Module: GPNP List CSSD Debug Module: OLR List CSSD Debug Module: SKGFD |
CLSF and SKGFD – 关于仲裁盘的I/O
CSSD – same old one
GIPCCM – gipc communication between applications and CSS
GIPCGM – communication between peers in the GM layer
GIPCNM – communication between nodes in the NM layer
GPNP – trace for gpnp calls within CSS
OLR – trace for olr calls within CSS
下面是如何对不同的模块设置不同的日志级别的例子:
# crsctl set log css GIPCCM=1,GIPCGM=2,GIPCNM=3
# crsctl get log css CSSD=4
检查当前的跟踪日志级别用下面的命令:
# crsctl get log ALL
# crsctl get log css GIPCCM
CSSDAGENT and CSSDMONITOR
CSSDAGENT and CSSDMONITOR几乎提供相同的功能。cssdagent启动,停止,检查ocssd守护进程状态。cssdmonitor监控cssdagent。没有ora.cssdagent资源,也不是ocssd守护进程的资源。
在11.2之前实现上面两个代理的功能是oprocd,olcsmon守护进程。cssdagent和cssdmonitor运行在实时优先级锁定内存,就像ocssd一样。
另外,cssdagent 和cssdmonitor提供下面的服务来确保数据完整性:
监控ocssd,如果ocssd失败,那么cssd* 重启节点
监控节点调度:如果节点夯住了/没有进程调度,重启节点。
更全面的决策是否需要重新启动,cssdagent和cssdmonitor通过NHB从ocssd接收状态信息,确保本地节点的状态被远程节点认为是准确的。此外,集成将利用时间其他节点感知当地节点为目的,如文件系统同步得到完整的诊断数据。
为了启动ocssd代理调试,可以用crsctl set log res ora.cssd:3 –init命令。这个操作的日志记录在Grid_home/log/<hostname>/agent/ohasd/oracssdagent_root/oracssdagent_root.log和更多跟踪信息写在oracssdagent_root.log里。
2009-11-25 10:00:52.386: [ AGFW][2945420176] Agent received the message: RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099
2009-11-25 10:00:52.387: [ AGFW][2966399888] Executing command:
res_attr_modified for resource: ora.cssd 1 1
2009-11-25 10:00:52.387: [ USRTHRD][2966399888] clsncssd_upd_attr: setting trace to level 3
2009-11-25 10:00:52.388: [ CSSCLNT][2966399888]clssstrace: trace level set to 2 2009-11-25 10:00:52.388: [ AGFW][2966399888] Command: res_attr_modified for resource: ora.cssd 1 1 completed with status: SUCCESS
2009-11-25 10:00:52.388: [ AGFW][2945420176] Attribute: LOGGING_LEVEL for
resource ora.cssd modified to: 3
2009-11-25 10:00:52.388: [ AGFW][2945420176] config version updated to : 7 for ora.cssd 1 1
2009-11-25 10:00:52.388: [ AGFW][2945420176] Agent sending last reply for: RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099
2009-11-25 10:00:52.484: [ CSSCLNT][3031063440]clssgsgrpstat: rc 0, gev 0, incarn
2, mc 2, mast 1, map 0x00000003, not posted
同样适用于cssdmonitor(ora.cssdmonitor)资源。
heartbeats(心跳)
Disk HeartBeat (DHB) 磁盘心跳,定期的写在投票文件里,一秒钟一次
Network HeartBeat (NHB)网络心跳,每一秒钟发送一次到其他节点上
Local HeartBeat (LHB)本地心跳,每一秒钟一次发送到代理或监控
ocssd 线程
Sending Thread (ST) 同一时间发送网络心跳和本地心跳
Disk Ping thread 每一秒钟把磁盘心跳写到投票文件里
Cluster Listener (CLT) 接收其他节点发送过来的消息,主要是网络心跳
agent/monitor线程
HeartBeat thread (HBT)从ocssd接收本地心跳和检测连接失败
OMON thread (OMT) 监控连接失败
OPROCD thread (OPT) 监控agent/moniter调度进程
VMON thread (VMT)取代clssvmon可执行文件,注册在skgxn组供应商集群软件
Timeouts(超时)
Misscount (MC) 一个节点在被删除之前没有网络心跳的时间
Network Time Out (NTO) 一个节点在被删除之前没有网络心跳的最大保留时间
Disk Time Out (DTO) 大多数投票文件被认为是无法访问的最大时间
ReBoot Time (RBT) 允许重新启动的时间,默认是三秒钟。
Misscount, SIOT, RBT
Disk I/O Timeout amount of time for a voting file to be offline before it is unusable
SIOT – Short I/O Timeout, in effect during reconfig
LIOT – Long I/O Timeout, in effect otherwise
Long I/O Timeout – (LIOT)通过crsctl set css disktimeout配置超时时间,默认200秒。
Short I/O Timeout (SIOT) is (misscount – reboot time)
In effect when NHB’s missed for misscount/2
ocssd terminates if no DHB for SIOT
Allows RBT seconds after termination for reboot to complete
Disk Heartbeat Perceptions
Other node perception of local state in reconfig
No NHB for misscount, node not visible on network
No DHB for SIOT, node not alive
If node alive, wait full misscount for DHB activity to be missing, i.e. node not alive
As long as DHB’s are written, other nodes must wait
Perception of local state by other nodes must be valid to avoid data corruption
Disk Heartbeat Relevance
DHB only read starting shortly before a reconfig to remove the node is started
When no reconfig is impending, the I/O timeout not important, so need not be monitored
If the disk timeout expires, but the NHB’s have been sent to and received from other nodes, it will still be misscount seconds before other nodes will start a reconfig
The proximity to a reconfig is important state information for OPT
Clocks
Time Of Day Clock (TODC) the clock that indicates the hour/minute/second of the day (may change as a result of commands)
aTODC is the agent TODC
cTODC is the ocssd TODC
Invariant Time Clock (ITC) a monotonically increasing clock that is invariant i.e. does not change as a result of commands). The invariant clock does not change if time set backwards or forwards; it is always constant.
aITC is the agent ITC
cITC is the ocssd ITC
是如何工作的
ocssd state information contains the current clock information, the network time out (NTO) based on the node with the longest time since the last NHB and a disk I/O timeout based on the amount of time since the majority of voting files was last online. The sending thread gathers this current state information and sends both a NHB and local heartbeat to ensure that the agent perception of the aliveness of ocssd is the same as that of other nodes.
The cluster listener thread monitors the sending thread. It ensures the sending thread has been scheduled recently and wakes up if necessary. There are enhancements here to ensure that even after clock shifts backwards and forwards, the sending thread is scheduled accurately.
There are several agent threads, one is the oprocd thread which just sleeps and wakes up periodically. Upon wakeup, it checks if it should initiate a reboot, based on the last known ocssd state information and the local invariant time clock (ITC). The wakeup is timer driven. The heartbeat thread is just waiting for a local heartbeat from the ocssd. The heartbeat thread will calculate the value that the oprocd thread looks at, to determine whether to reboot. It checks if the oprocd thread has been awake recently and if not, pings it awake. The heartbeat thread is event driven and not timer driven.
文件系统同步
当ocssd失败, 启动文件系统同步。有大量的时间来做到这一点,我们可以等待几秒钟同步。最后当地心跳表明我们可以等多久,等待事件基于misscount。当等待时间超时了,oprocd会重启这个节点。大多数情况下,诊断数据会写到磁盘里。在极少数的情况下,如因为CSS夯住同步还没执行才会没写到磁盘。
集群就绪服务是管理高可用操作的主要程序。CRS守护进程管理集群资源基于配置在OCR上的每个资源信息。这包括启动,停止,监控和故障转移操作。csrd守护进程监控数据库实例,监听,等等,当发生故障时自动重启这些组件。
crsd守护进程由root用户运行,发生故障后自动重启。当数据库集群安装单实例环境在ASM和数据库重启时,ohasd代替crsd管理应用资源。
Policy Engine
概述
在11.2上资源的高可用是由OHASD(通常用于基础设施资源)和CRSD(应用程序部署在集群上)处理。这两个守护进程共享相同的体系结构和大部分的代码库,对于大多数意图和目的,OHASD可以看做是CRSD在单节点的集群上。在后续部分中讨论适用于这两个守护进程。
从11.2开始,CRSD的体系结构实现了主从模型:一个单一的CRSD在集群里被选作主,其他的都是从。在守护进程启动和每次主被重新选择,CRSD把当前主写入crsd.log日志里。
grep “PE MASTER” Grid_home/log/hostname/crsd/crsd.*
crsd.log:2010-01-07 07:59:36.529: [ CRSPE][2614045584] PE MASTER NAME: staiv13
CRSD是一个分布式应用程序由几个“模块”组成。模块主要是state-less和操作通过交换信息。状态(上下文)总是携带每个信息;大多数交互在本质上是异步的。有些模块有专用的,有些线程共享一个线程和一些操作共享线程池。重要的CRSD模块如下:
例如,一个客户机请求修改资源将产生以下交互
CRSCTL UI Server PE OCR Module PE Reporter (event publishing)
Proxy (to notify the agent)
CRSCTL UI Server PE
注意UiServer/PE/Proxy每一个可以在不同的节点上,如下图:
Resource Instances & IDs
在11.2中,CRS模块支持资源多样性的两个概念:基数和程度。In 11.2, CRS modeling supports two concepts of resource multiplicity: cardinality and degree. The former controls the number of nodes where the resource can run concurrently while the latter controls the number of instances of the resource that can be run on each node. To support the concepts, the PE now distinguishes between resources and resource instances. The former can be seen as a configuration profile for the entire resource while the latter represents the state data for each instance of the resource. For example, a resource with CARDINALITY=2, DEGREE=3 will have 6 resource instances. Operations that affect resource state (start/stopping/etc.) are performed using resource instances. Internally, resource instances are referred to with IDs which following the following format: “<A> <B>
<C>” (note space separation), where <A> is the resource name, <C> is the degree of the instance (mostly 1), and <B> is the cardinality of the instance for cluster_resource resources or the name of the node to which the instance is assigned for local_resource names. That’s why resource name have “funny” decorations in logs:
[ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new target state: [ONLINE] old
value: [OFFLINE]
Log Correlation
CRSD is event-driven in nature. Everything of interest is an event/command to process. Two kinds of commands are distinguished: planned and unplanned. The former are usually administrator-initiated (add/start/stop/update a resource, etc.) or system-initiated (resource auto start at node reboot, for instance) actions while the latter are normally unsolicited state changes (a resource failure, for example). In either case, processing such events/commands is what CRSD does and that’s when module interaction takes place. One can easily follow the interaction/processing of each event in the logs, right from the point of origination (say from the UI module) through to PE and then all the way to the agent and back all the way using the concept referred to as a “tint”. A tint is basically a cluster-unique event ID of the following format: {X:Y:Z}, where X is the node number, Y a node-unique number of a process where the event first entered the system, and Z is a monotonically increasing sequence number, per process. For instance, {1:25747:254} is a tint for the 254th event that originated in some process internally referred to us 25747 on node number 1. Tints are new in 11.2.0.2 and can be seen in CRSD/OHASD/agent logs. Each event in the system gets assigned a unique tint at the point of entering the system and modules prefix each log message while working on the event with that tint.
例如,在3节点的集群,node0是PE,在node1上执行“crsctl start resource r1 –n node2”,恰好如上面的图形,将会在日志里产生下面信息:
节点1上的CRSD日志(crsctl总是连接本地CRSD;UI服务器把命令转发到PE)
2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Container [ Name: UI_START
…
RESOURCE:
TextMessage[r1]
2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Sending message to PE. ctx= 0xa3819430
节点0上的CRSD日志(with PE master)
2009-12-29 17:07:24.745: [ CRSPE][2660580256] {1:25747:256} Cmd : 0xa7258ba8 :
flags: HOST_TAG | QUEUE_TAG
2009-12-29 17:07:24.745: [ CRSPE][2660580256] {1:25747:256} Processing PE
command id=347. Description: [Start Resource : 0xa7258ba8]
2009-12-29 17:07:24.748: [ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new
target state: [ONLINE] old value: [OFFLINE]
2009-12-29 17:07:24.748: [ CRSOCR][2664782752] {1:25747:256} Multi Write Batch
processing…
2009-12-29 17:07:24.753: [ CRSPE][2660580256] {1:25747:256} Sending message to
agfw: id = 2198
这里,PE执行政策评估和目标节点与代理进行交互(开始行动)和OCR(记录目标的新值)。
CRSD节点2上的日志(启动代理,将消息转发给它)
2009-12-29 17:07:24.763: [ AGFW][2703780768] {1:25747:256} Agfw Proxy Server
received the message: RESOURCE_START[r1 1 1] ID 4098:2198
2009-12-29 17:07:24.767: [ AGFW][2703780768] {1:25747:256} Starting the agent:
/ade/agusev_bug/oracle/bin/scriptagent with user id: agusev and incarnation:1
节点2上的代理日志 (代理执行启动命令)
2009-12-29 17:07:25.120: [ AGFW][2966404000] {1:25747:256} Agent received the
message: RESOURCE_START[r1 1 1] ID 4098:1459
2009-12-29 17:07:25.122: [ AGFW][2987383712] {1:25747:256} Executing command:
start for resource: r1 1 1
2009-12-29 17:07:26.990: [ AGFW][2987383712] {1:25747:256} Command: start for
resource: r1 1 1 completed with status: SUCCESS
2009-12-29 17:07:26.991: [ AGFW][2966404000] {1:25747:256} Agent sending reply
for: RESOURCE_START[r1 1 1] ID 4098:1459
几点2上的CRSD日志(代理回复,将信息传回PE)
2009-12-29 17:07:27.514: [ AGFW][2703780768] {1:25747:256} Agfw Proxy Server
received the message: CMD_COMPLETED[Proxy] ID 20482:2212
2009-12-29 17:07:27.514: [ AGFW][2703780768] {1:25747:256} Agfw Proxy Server
replying to the message: CMD_COMPLETED[Proxy] ID 20482:2212
节点0上的CRSD 日志(收到回复信息,通知通讯员并返回给UI服务器,通讯员发布信息到EVM)
2009-12-29 17:07:27.012: [ CRSPE][2660580256] {1:25747:256} Received reply to
action [Start] message ID: 2198
2009-12-29 17:07:27.504: [ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new
external state [ONLINE] old value: [OFFLINE] on agusev_bug_2 label = []
2009-12-29 17:07:27.504: [ CRSRPT][2658479008] {1:25747:256} Sending UseEvm mesg
2009-12-29 17:07:27.513: [ CRSPE][2660580256] {1:25747:256} UI Command [Start
Resource : 0xa7258ba8] is replying to sender.
节点1上的CRSD日志(crsctl命令执行完成,UI服务器写出响应,完成API请求)
2009-12-29 17:07:27.525: [UiServer][2689649568] {1:25747:256} Container [ Name:
UI_DATA
r1: TextMessage[0]
]
2009-12-29 17:07:27.526: [UiServer][2689649568] {1:25747:256} Done for
ctx=0xa3819430
The above demonstrates the ease of following distributed processing of a single request across 4 processes on 3 nodes by using tints as a way to filter, extract, group and correlate information pertaining to a single event across a plurality of diagnostic logs.
11.2集群的一个新特性是即插即用,由GPnP守护进程管理。GPnPD提供访问GPnP概要文件,在集群的节点协调更新概要文件,以确保所有的节点都有最近的概要文件。
GPnP配置概要文件和钱夹配置,对于每一个节点都是相同的。在数据库安装过程中概要文件和钱夹会被创建并复制。GPnP概要文件是一个XML的测试文件,其中包含必要的引导信息组成一个集群。信息内容比如集群名字,GUID,发现字符串,预期的网络连接。不包含节点的细节信息。配置文件由GPnPD管理,存在于每个节点的GPnP缓存上。如果没有进行更改,那么在所有节点上都是一样的。通过序列号来鉴别配置文件。
GPnP钱包只是一个二进制blob,包含公共/私有RSA密钥, 用于登录和验证GPnP概要文件。钱夹对于所有的GPnP是相同的,在安装数据库软件时创建,不会更改且永远的活着的。
一个典型的配置文件将包含以下信息。永远不会直接改变XML文件; 通过使用支持工具,比如ASMCA,asmcd,oifcfg等等。来修改GPnP的配置信息。
不建议用GPnP 工具来修改GPnP配置文件,要修改配置文件需要很多步骤。如果添加了无效的信息,那么就会弄坏配置文件,并后续会产生问题。
# gpnptool get
Warning: some command line parameters were defaulted. Resulting command line:
/scratch/grid_home_11.2/bin/gpnptool.bin get -o-
<?xml version=”1.0″ encoding=”UTF-8″?><gpnp:GPnP-Profile Version=”1.0″
xmlns=”http://www.grid-pnp.org/2005/11/gpnp-profile”
xmlns:gpnp=”http://www.grid- pnp.org/2005/11/gpnp-profile”
xmlns:orcl=”http://www.oracle.com/gpnp/2005/11/gpnp- profile”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd”
ProfileSequence=”4″ ClusterUId=”0cd26848cf4fdfdebfac2138791d6cf1″
ClusterName=”stnsp0506″ PALocation=””><gpnp:Network-Profile><gpnp:HostNetwork
id=”gen” HostName=”*”><gpnp:Network id=”net1″ IP=”10.137.8.0″ Adapter=”eth0″
Use=”public”/><gpnp:Network id=”net2″ IP=”10.137.20.0″ Adapter=”eth2″
Use=”cluster_interconnect”/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-
Profile id=”css” DiscoveryString=”+asm”
LeaseDuration=”400″/><orcl:ASM-Profile id=”asm”
DiscoveryString=”/dev/sdf*,/dev/sdg*,/voting_disk/vote_node1″
SPFile=”+DATA/stnsp0506/asmparameterfile/registry.253.699162981″/>
<ds:Signature xmlns:ds=”http://www.w3.org/2000/09/xmldsig#”>
<ds:SignedInfo><ds:CanonicalizationM ethod
Algorithm=”http://www.w3.org/2001/10/xml-exc-c14n#”/><ds:SignatureMethod
Algorithm=”http://www.w3.org/2000/09/xmldsig#rsa-sha1″/><ds:Reference URI=””>
<ds:Transforms><ds:Transform Algorithm=”http://www.w3.org/2000/09/xmldsig#enveloped-signature”/>
<ds:Transform Algorithm=”http://www.w3.org/2001/10/xml-exc-c14n#”>
<InclusiveNamespaces xmlns=”http://www.w3.org/2001/10/xml-exc-c14n#”
PrefixList=”gpnp orcl xsi”/></ds:Transform></ds:Transforms>
<ds:DigestMethod Algorithm=”http://www.w3.org/2000/09/xmldsig#sha1″/><ds:DigestValue>ORAmrPMJ/plFtG Tg/mZP0fU8ypM=</ds:DigestValue>
</ds:Reference></ds:SignedInfo><ds:SignatureValue>
K u7QBc1/fZ/RPT6BcHRaQ+sOwQswRfECwtA5SlQ2psCopVrO6XJV+BMJ1UG6sS3vuP7CrS8LXrOTyoIxSkU 7xWAIB2Okzo/Zh/sej5O03GAgOvt+2OsFWX0iZ1+2e6QkAABHEsqCZwRdI4za3KJeTkIOPliGPPEmLuImu
DiBgMk=</ds:SignatureValue></ds:Signature></gpnp:GPnP-Profile>
Success.
初始化GPnP配置在安装数据库集群软件时由root脚本创建并传播。在全新安装配置文件的内容来自于数据库安装结构在Grid_home/crs/install/crsconfig_params。
GPnP守护进程和其他进程一样由OHASD管理并由OHASD产生oraagent。GPnPD的主要目的是服务配置文件,因此是为了启动堆栈。GPnPD的主要启动顺序:
有几个客户端工具能够直接修改GPnP配置文件。要求ocssd是运行的:
注意,参数文件的改变会系列化整个集群的CSS锁(bug 7327595)。
Grid_home/bin/gpnptool是真正维护gpnp文件的工具。查看详细的信息,可以运行:
Oracle GPnP Tool Usage:
“gpnptool <verb> <switches>”, where verbs are:
create Create a new GPnP Profile
edit Edit existing GPnP Profile
getpval Get value(s) from GPnP Profile
get Get profile in effect on local node
rget Get profile in effect on remote GPnP node put Put profile as a current best
find Find all RD-discoverable resources of given type
lfind Find local gpnpd server
check Perform basic profile sanity checks
c14n Canonicalize, format profile text (XML C14N)
sign Sign/re-sign profile with wallet’s private key
unsign Remove profile signature, if any
verify Verify profile signature against wallet certificate
help Print detailed tool help
ver Show tool version
为了获取更多的日志和跟踪文件,可以设置环境变量GPNP_TRACELEVEL 范围为0-6。GPnP跟踪文件在:
Grid_home/log/<hostname>/alert*, Grid_home/log/<hostname>/client/gpnptool*, other client logs Grid_home/log/<hostname>/gpnpd|mdnsd/* Grid_home/log/<hostname>/agent/ohasd/oraagent_<username>/*
产品安装文件里有基本信息,位置在:
Grid_home/crs/install/crsconfig_params
Grid_home/cfgtoollogs/crsconfig/root*
Grid_home/gpnp/*,
Grid_home/gpnp/<hostname>/* [profile+wallet]
如果GPnP 安装失败,应该进行下面失败场景的检查:
如果是在GPnP运行过程中产生错误,应该进行如下检查:
上面的解决的所有第一步都应该先活动守护进程的日志文件并通过crsctl stat res –init –t检查资源的状态。
GPnPD没有运行的其他解决步骤:
# gpnptool check -\
p=/scratch/grid_home_11.2/gpnp/stnsp006/profiles/peer/profile.xml
Profile cluster=”stnsp0506″, version=4
GPnP profile signed by peer, signature valid.
Got GPnP Service current profile to check against.
Current GPnP Service Profile cluster=”stnsp0506″, version=4
Error: profile version 4 is older than- or duplicate of- GPnP Service current profile version 4.
Profile appears valid, but push will not succeed.
# gpnptool verify Oracle GPnP Tool
verify Verify profile signature against wallet certificate Usage:
“gpnptool verify <switches>”, where switches are:
-p[=profile.xml] GPnP profile name
-w[=file:./] WRL-locator of OracleWallet with crypto keys
-wp=<val> OracleWallet password, optional
-wu[=owner] Wallet certificate user (enum: owner,peer,pa)
-t[=3] Trace level (min..max=0..7), optional
-f=<val> Command file name, optional
-? Print verb help and exit
– 如果GPnPD服务在本地,可以用gpnptool lfind进行检查
# gpnptool lfind
Success. Local gpnpd found.
‘gpnptool get’ 可以返回本地配置文件的信息。如果gpnptool lfind|get夯住了,从客户端 夯住的信息和GPnPD日志在Grid_home/log/<hostname>/gpnpd,将会对进一步解决问题有很大的帮助。
– 检查远程GPnPD是响应的,’find’选项将很有帮助:
# gpnptool find -h=stnsp006
Found 1 instances of service ‘gpnp’. mdns:service:gpnp._tcp.local.://stnsp006:17452/agent=gpnpd,cname=stnsp0506
,host=stnsp006,pid=13133/gpnpd h:stnsp006 c:stnsp0506
如果上面的操作挂起了或者返回错误了,检查
Grid_home/log/<hostname>/mdnsd/*.log files和 gpnpd日志。
– 检查所有节点都是响应的,运行gpnptool find –c=<clustername>
# gpnptool find -c=stnsp0506
Found 2 instances of service ‘gpnp’. mdns:service:gpnp._tcp.local.://stnsp005:23810/agent=gpnpd,cname=stnsp0506
,host=stnsp005,pid=12408/gpnpd h:stnsp005 c:stnsp0506 mdns:service:gpnp._tcp.local.://stnsp006:17452/agent=gpnpd,cname=stnsp0506,host=stnsp006,pid=13133/gpnpd h:stnsp006 c:stnsp0506
我们将GPnP配置文件存放在本地OLR和OCR。如果配置文件丢失或者损坏,GPnPD重备份中重建配置文件。
集群中GNS执行名称解析。GNS并不总是使用mDNS的性能原因。
在11.2我们支持使用DHCP私人互连和几乎所有的公共网络上的虚拟IP地址。为集群之外的客户端发现集群中的虚拟主机,我们提供了GNS。这适用于任何高级DNS为外部提供名称解析。
本节介绍如何简单的进行DHCP和GNS的配置。一个复杂的网络环境可能需要更复杂的解决方案。配置GNS和DHCP必须在grid安装之前。
GSN提供什么
DHCP提供了动态配置的主机IP地址,但是不能提供一个好的外部客户端使用的名字,因此在混合服务器已经很罕见了。在Oracle 11.2集群,提供了我们的服务来解析名称解决这个问题,和DNS的连接客户是可见的。
设置网络配置
让GNS为客户端工作,需要配置高级别的DNS来代表集群中的一个子区域,集群必须在DNS已知的一个地址运行GNS。GNS地址将用集群中配置的静态VIP来维护。GNS守护进程将跟随在集群vip和子区域的服务名。
需要配置四方面:
获取一个IP地址作为GNS-VIP
从网络管理员那里请求一个ip地址分配作为GNS-VIP。这个IP地址必须是已经分配了的公司DNS作为给定集群的GNS-VIP,例如strdv0108-gns.mycorp.com。这个地址在集群软件安装之后将由集群软件管理。
创建一个下面格式的条目在适当的DNS区域文件里:
# Delegate to gns on strdv0108
strdv0108-gns.mycorp.com NS strdv0108.mycorp.com
#Let the world know to go to the GNS vip strdv0108.mycorp.com 10.9.8.7
在这里,子区域是strdv0108.mycorp.com,GNS VIP 已经分配了的名称是strdv0108-gns.us.mycorp.com(对应于一个静态IP地址),GNS守护进程将监听默认端口53。
注意:这并不是建立一个地址的名字strdv0108.mycorp.com,创建了一种解析子区域中名字的方法,比如clusterNode1- VIP.strdv0108.mycorp.com。
DHCP
一个主机要求一个IP地址发送广播消息给硬件网络。一个DHCP服务器可以相应请求,并返回一个地址,连同其他消息,比如使用什么网关,用了那个DNS服务,改用什么域名,改用什么NTP服务,等等。
当我们获取DHCP公共网络,我们有几个IP地址:
GNS VIP 不能从DHCP获取,因为它必须提前知道,因此必须静态分配。
DHCP配置文件在/etc/dhcp.conf
使用下面的配置例如:
/etc/dhcp.conf 将包含类似的信息:
subnet 10.228.212.0 netmask 255.255.252.0
{
default-lease-time 43200;
max-lease-time 86400;
option subnet-mask 255.255.252.0;
option broadcast-address 10.228.215.255;
option routers 10.228.212.1;
option domain-name-servers M.N.P.Q, W.X.Y.Z; option domain-name “strdv0108.mycorp.com”; pool
{
range 10.228.212.10 10.228.215.254;
}
}
名称解析
/etc/resolv.conf必须包含可以解析企业DNS服务器的命名服务器条目,和总超时周期配置必须低于30秒。例如:
/etc/resolv.conf:
options attempts: 2
options timeout: 1
search us.mycorp.com mycorp.com
nameserver 130.32.234.42
nameserver 133.2.2.15
/etc/nsswitch.conf控制名称服务查找顺序。在一些系统上配置,网络信息系统可能在解析Oracle SCAN时产生错误。建议在搜索列表里添加NIS 条目。
/etc/nsswitch.conf
hosts: files dns nis
请参阅:Oracle Grid Infrastructure Installation Guide,
“DNS Configuration for Domain Delegation to Grid Naming Service” for more information.
在11.2 GNS由集群代理orarootagent管理。这个代理启动,停止和检查DNS。GNS添加到OCR和GNS添加到集群的信息通过srvctl add gns –d <mycluster.company.com>命令。
在服务器启动GNS服务器,从子区域中检索名字,需要的服务OCR和启动线程。GNS服务器将做的第一件事是一个自我检查一次所有正在运行的线程。它执行一个测试,看看名称解析正在工作。客户端API调用分配一个虚拟的名称和地址,然后服务器试图解析这个名字。如果解析成功和一个地址匹配的虚拟地址,自我检查将成功并把信息写入alert<hostname>.log.这样做自我检查是只有一次,即使测试失败GNS服务器一直运行。
GNS服务的默认trace路径是Grid_home/log/<hostname>/gnsd/。trace文件看起来像下面的格式:
<Time stamp>: [GNS][Thread ID]<Thread name>::<function>:<message>
2009-09-21 10:33:14.344: [GNS][3045873888] Resolve::clsgnmxInitialize: initializing mutex 0x86a7770 (SLTS 0x86a777c).
GNS代理orarootagent会定期检查GNS服务。检查是通过查询GNS的状态。
代理是否成功与GNS广告,执行:
#grep -i ‘updat.*gns’
Grid_home/log/<hostname>/agent/crsd/orarootagent_root/orarootagent_*
orarootagent_root.log:2009-10-07 10:17:23.513: [ora.gns.vip] [check] Updating GNS with stnsp0506-gns-vip 10.137.13.245
orarootagent_root.log:2009-10-07 10:17:23.540: [ora.scan1.vip] [check] Updating GNS with stnsp0506-scan1-vip 10.137.12.200
orarootagent_root.log:2009-10-07 10:17:23.562: [ora.scan2.vip] [check] Updating GNS with stnsp0506-scan2-vip 10.137.8.17
orarootagent_root.log:2009-10-07 10:17:23.580: [ora.scan3.vip] [check] Updating GNS with stnsp0506-scan3-vip 10.137.12.214
orarootagent_root.log:2009-10-07 10:17:23.597: [ora.stnsp005.vip] [check] Updating GNS with stnsp005-vip 10.137.12.228
orarootagent_root.log:2009-10-07 10:17:23.615: [ora.stnsp006.vip] [check] Updating GNS with stnsp006-vip 10.137.12.226
命令行接口通过srvctl与GNS进行交互(唯一的支持途径)。crsctl可以停在和启动ora.gsn但是我们不支持这个除了直接告诉开发。
用下面操作实现GNS操作:
# srvctl {start|stop|modify|etc.} gns …
启动 gns
# srvctl start gns [-l <log_level>] – where –l is the level of logging that GNS should run with.
停止gns
# srvctl stop gns
发布名称和地址
# srvctl modify gns -N <name> -A <address>
默认的GNS服务日志级别是0,我们可以通过ps –ef | grep gnsd.bin简单查看:
/scratch/grid_home_11.2/bin/gnsd.bin -trace-level 0 -ip-address 10.137.13.245 – startup-endpoint ipc://GNS_stnsp005_31802_429f8c0476f4e1
调试GNS服务是可能需要提高日志级别。必须先通过srvctl stop gns命令停掉GNS服务,并通过srvctl start gns –v –l 5重启。只有root用户可以停止和启动GNS。
Usage: srvctl start gns [-v] [-l <log_level>] [-n <node_name>]
-v Verbose output
-l <log_level> Specify the level of logging that GNS should run with.
-n <node_name> Node name
-h Print usage
trace级别在0-6之间,级别5应该在所有情况下都够用了,不推荐设置级别到6,而且gnsd将消耗大量的CPU。
在11.2.0.1由于8705125bug,在初始化安装后默认的GNS服务日志级别是6。用‘srvctl stop / start’命令停掉和重启GNS,把日志级别设成0,这只需要停止和启动gnsd.bin,不会对正在运行的集群产生其他影响。
用srvctl 可用查看当前GNS配置
srvctl config gns –a GNS is enabled.
GNS is listening for DNS server requests on port 53 GNS is using port 5353 to connect to mDNS
GNS status: OK
Domain served by GNS: stnsp0506.oraclecorp.com GNS version: 11.2.0.1.0
GNS VIP network: ora.net1.network
从11.2.0.2开始,使用-l 选项对调试GNS很有帮助。
Grid 进程间的通讯是一个普通的通讯设施用来替代CLSC/NS.他提供一个完全的控制从操作系统到任何客户端的通讯堆栈。在11.2之前依赖NS已经撤掉了,但是为了往下兼容,存在CLSC客户端(主要是从11.1开始)。
GIPC可以支持多种通讯类型:CLSC, TCP, UDP, IPC和GIPC。
关于GIPC端点的监听配置是有点不同的。私人/集群的互连现在定义在GPnP配置文件里。
The requirement for the same interfaces to exist with the same name on all nodes is more relaxed, as long as communication will be established.在GPnP配置文件里关于私人和公共的网络连接配置:
<gpnp:Network id=”net1″ IP=”10.137.8.0″ Adapter=”eth0″ Use=”public”/>
<gpnp:Network id=”net2″ IP=”10.137.20.0″ Adapter=”eth2″ Use=”cluster_interconnect”/>
本文永久地址:https://www.askmac.cn/archives/oracle-clusterware-11-2.html
日志和诊断
GIPC的默认trace级别只是输出错误,默认的trace级别在不同组件之间是0-2。要调试和GIPC相关的问题,你应该提高跟踪日志的级别,下面将进行介绍。
通过crsctl设置跟踪日志级别
用crsctl设置不同组件的GIPC trace级别。
例如:
# crsctl set log css COMMCRS:abcd
Where
如果只想定义GIPC的跟踪日志级别,修改为默认值2,执行:
# crsctl set log css COMMCRS:2242
为所有的组件打开GIPC跟踪((NM, GM,等等),设置:
# crsctl set log css COMMCRS:3 or
# crsctl set log css COMMCRS:4
级别为4的话,会产生大量的跟踪日志,因此ocssd.log就会很快的进行循环覆盖。
通过GIPC_TRACE_LEVEL和GIPC_FIELD_LEVEL设置跟踪级别
Another option is to set a pair of environment variables for the component using GIPC as communication e.g. ocssd. In order to achieve this, a wrapper script is required. Taking ocssd as an example, the wrapper script is Grid_home/bin/ocssd that invokes ‘ocssd.bin’. Adding the variables below to the wrapper script (under the LD_LIBRARY_PATH) and restarting ocssd will enable GIPC tracing. To restart ocssd.bin, perform a crsctl stop/start cluster.
case `/bin/uname` in Linux)
LD_LIBRARY_PATH=/scratch/grid_home_11.2/lib export LD_LIBRARY_PATH
export GIPC_TRACE_LEVEL=4
export GIPC_FIELD_LEVEL=0x80
# forcibly eliminate LD_ASSUME_KERNEL to ensure NPTL where available
LD_ASSUME_KERNEL=
export LD_ASSUME_KERNEL LOGGER=”/usr/bin/logger”
if [ ! -f “$LOGGER” ];then
LOGGER=”/bin/logger”
fi
LOGMSG=”$LOGGER -puser.err”
;;
这将设置跟踪级别为4,环境变量的值
GIPC_TRACE_LEVEL=3 (valid range [0-6])
GIPC_FIELD_LEVEL=0x80 (only 0x80 is supported)
通过GIPC_COMPONENT_TRACE设置跟踪级别
使用GIPC_COMPONENT_TRACE环境变量进行更细粒度的跟踪。定义的组建为GIPCGEN, GIPCTRAC, GIPCWAIT, GIPCXCPT, GIPCOSD, GIPCBASE, GIPCCLSA, GIPCCLSC, GIPCEXMP, GIPCGMOD, GIPCHEAD, GIPCMUX, GIPCNET, GIPCNULL, GIPCPKT, GIPCSMEM, GIPCHAUP, GIPCHALO, GIPCHTHR, GIPCHGEN, GIPCHLCK, GIPCHDEM, GIPCHWRK
例如:
# export GIPC_COMPONENT_TRACE=GIPCWAIT:4,GIPCNET:3
跟踪信息样子如下:本文永久地址:https://www.askmac.cn/archives/oracle-clusterware-11-2.html
2009-10-23 05:47:40.952: [GIPCMUX][2993683344]gipcmodMuxCompleteSend: [mux] Completed send req 0xa481c0e0 [00000000000093a6] { gipcSendRequest : addr ”, data 0xa481c830, len 104, olen 104, parentEndp 0x8f99118, ret gipcretSuccess (0), objFlags 0x0, reqFlags 0x2 }
2009-10-23 05:47:40.952: [GIPCWAIT][2993683344]gipcRequestSaveInfo: [req]
Completed req 0xa481c0e0 [00000000000093a6] { gipcSendRequest : addr ”, data 0xa481c830, len 104, olen 104, parentEndp 0x8f99118, ret gipcretSuccess (0), objFlags 0x0, reqFlags 0x4 }
只有一些层级CSS,GPnPD,GNSD,和很小部分的MDNSD现在使用GIPC。
其他的如CRS/EVM/OCR/CTSS 从11.2.0.2开始使用GIPC。设置GIPC跟踪日志级别对于调试连接问题将很重要。
The CTSS is a new feature in Oracle Clusterware 11g release 2 (11.2), which takes care of time synchronization in a cluster, in case the network time protocol daemon is not running or is not configured properly.
The CTSS synchronizes the time on all of the nodes in a cluster to match the time setting on the CTSS master node. When Oracle Clusterware is installed, the Cluster Time Synchronization Service (CTSS) is installed as part of the software package. During installation, the Cluster Verification Utility (CVU) determines if the network time protocol (NTP) is in use on any nodes in the cluster. On Windows systems, CVU checks for NTP and Windows Time Service.
If Oracle Clusterware finds that NTP is running or that NTP has been configured, then NTP is not affected by the CTSS installation. Instead, CTSS starts in observer mode (this condition is logged in the alert log for Oracle Clusterware). CTSS then monitors the cluster time and logs alert messages, if necessary, but CTSS does not modify the system time. If Oracle Clusterware detects that NTP is not running and is not configured, then CTSS designates one node as a clock reference, and synchronizes all of the other cluster member time and date settings to those of the clock reference.
Oracle Clusterware considers an NTP installation to be misconfigured if one of the following is true:
To check whether CTSS is running in active or observer mode run crsctl check ctss
CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
or
CRS-4701: The Cluster Time Synchronization Service is in Active mode. CRS-4702: Offset from the reference node (in msec): 100
The tracing for the ctssd daemon is written to the octssd.log. The alert log (alert<hostname>.log) also contains information about the mode in which CTSS is running.
[ctssd(13936)]CRS-2403:The Cluster Time Synchronization Service on host node1 is in observer mode.
[ctssd(13936)]CRS-2407:The new Cluster Time Synchronization Service reference node
is host node1.
[ctssd(13936)]CRS-2401:The Cluster Time Synchronization Service started on host node1.
There are pre-install CVU checks performed automatically during installation, like: cluvfy stage –pre crsinit <>
This step will check and make sure that the operating system time synchronization software (e.g. NTP) is either properly configured and running on all cluster nodes, or on none of the nodes.
During the post-install check, CVU will run cluvfy comp clocksync –n all. If CTSS is in observer mode, it will perform a configuration check as above. If the CTSS is in active mode, we verify that the time difference is within the limit.
When CTSS comes up as part of the clusterware startup, it performs step time sync, and if everything goes well, it publishes its state as ONLINE. There is a start dependency on ora.cssd but note that it has no stop dependency, so if for some reasons (maybe faulted CTSSD), CTSSD dumps core or exits, nothing else should be affected.
The chart below shows the start dependency build on ora.ctssd for other resources.
crsctl stat res ora.ctssd -init –t
———————————————————————- NAME TARGET STATE SERVER STATE_DETAILS
———————————————————————-
ora.ctssd
1 ONLINE ONLINE node1 OBSERVER
Debugging mdnsd
In order to capture mdnsd network traffic, use the mDNS Network Monitor located in
Grid_home/bin:
# mkdir Grid_home/log/$HOSTNAME/netmon
# Grid_home/bin/oranetmonitor &
The output from oranetmonitor will be captured in netmonOUT.log in the above directory.
在ASM上存储OCR和voting files 消除了第三方卷管理器和消除了安装Oracle集群时的OCR和投票文件复杂的磁盘分区。
ASM管理存储投票文件和其他文件不一样。当投票文件放在ASM磁盘组的磁盘上时,Oracle集群正确的记录是存放在哪个磁盘组上的哪个磁盘。如果ASM坏了,CSS还能继续访问投票文件。如果你选择存放投票文件在ASM上,所有的投票文件都要存放在ASM上。我们不支持有一部分投票文件在ASM上有一部分在NAS上。
在一个磁盘组上能够存放的投票文件的数量依赖于你的ASM磁盘组冗余 。
By default, Oracle ASM puts each voting file in its own failure group within the disk group. A failure group is a subset of the disks in a disk group, which could fail at the same time because they share hardware, e.g. a disk controller. The failure of common hardware must be tolerated. For example, four drives that are in a single removable tray of a large JBOD (Just a Bunch of Disks) array are in the same failure group because the tray could be removed, making all four drives fail at the same time. Conversely, drives in the same cabinet can be in multiple failure groups if the cabinet has redundant power and cooling so that it is not necessary to protect against failure of the entire cabinet. However, Oracle ASM mirroring is not intended to protect against a fire in the computer room that destroys the entire cabinet. If voting files stored on Oracle ASM with Normal or High redundancy, and the storage hardware in one failure group suffers a failure, then if there is another disk available in a disk group in an unaffected failure group, Oracle ASM recovers the voting file in the unaffected failure group.
$ crsctl query css votedisk
## STATE | File Universal Id | File Name | Disk group |
— —– | —————– | ——— | ———- |
o 投票文件存放在ASM里,一个现存的投票文件损坏可能会自动删除和添加回去。
$ crsctl replace css votedisk /nas/vdfile1 /nas/vdfile2 /nas/vdfile3
或
$ crsctl replace css votedisk +OTHERDG
假如是扩展的oracle集群/扩展的RAC配置, 第三投票文件必须存放在第三方存储上的三个位置防止数据中心宕机。我们支持第三方投票文件在标准的NFS上. 更多信息参考附录 “Oracle Clusterware 11g release 2 (11.2) – Using standard NFS to support a third voting file on a stretch cluster configuration”.
参见: Oracle Clusterware Administration and Deployment Guide, “Voting file, Oracle Cluster Registry, and Oracle Local Registry” for more information. For information about extended clusters and how to configure the quorum voting file see the Appendix.
在11.2,OCR可以存放在ASM中。ASM的成员关系和状态表(PST)在多个磁盘上复制并存放在OCR。因此OCR可以容忍丢失相同数量的磁盘和底层磁盘组,针对错误磁盘,可以重定位/重新均衡。
为了在磁盘组中存放OCR,磁盘组有一个特殊的文件类型叫’ocr’.
默认的配置文件的位置是/etc/oracle/ocr.loc
# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA
local_only=FALSE
From a user and maintenance perspective, the rest remains the same. The OCR can only be configured in ASM when the cluster completely migrated to 11.2 (crsctl query crs activeversion >= 11.2.0.1.0). We still support mixed configurations, so we could have OCR’s stored in ASM and another stored on a supported NAS device, as we support up to 5 OCR locations in 11.2.0.1. We do not support raw or block devices for neither OCR nor voting files anymore.
在ASM实例启动时,OCR磁盘组自动挂载。CRSD和ASM维护依赖于OHASD。
OCRCHECK
There are small enhancements in ocrcheck like the –config which is only checking the configuration. Run ocrcheck as root otherwise the logical corruption check will not run. To check OLR data use the –local keyword.
Usage: ocrcheck [-config] [-local]
Shows OCR version, total, used and available space Performs OCR block integrity (header and checksum) checks Performs OCR logical corruption checks (11.1.0.7)
‘-config’ checks just configuration (11.2) ‘-local’ checks OLR, default OCR
Can be run when stack is up or down
输出结果就像:
# ocrcheck
Status of Oracle Cluster Registry is as follows: Version : 3
Total space (kbytes) : 262120 Used space (kbytes) : 3072 Available space (kbytes) : 259048 ID : 701301903
Device/File Name : +DATA
Device/File integrity check succeeded Device/File Name : /nas/cluster3/ocr3
Device/File integrity check succeeded Device/File Name : /nas/cluster5/ocr1
Device/File integrity check succeeded Device/File Name : /nas/cluster2/ocr2
Device/File integrity check succeeded Device/File Name : /nas/cluster4/ocr4
Device/File integrity check succeeded
Cluster registry integrity check succeeded Logical corruption check succeeded
OLR结构和OCR相似,是一个节点的本地信息库,是由OHASD管理的。OLR的配置信息只属于本地几点,不和其他节点共享。
配置信息存放在‘/etc/oracle/olr.loc’ (on Linux)或其他操作系统的类似位置上。在安装好Oracle集群后的默认位置:
在OLR里存放信息,是必须由OHASD启动或添加到集群的;包括的数据关于GPnP钱夹,集群配置和版本信息。
OLR的密匙属性和OCR是一样的,检查或者转储OLR信息的工具和OCR也是一样的。
查看OLR的位置,运行命令:
# ocrcheck -local –config
Oracle Local Registry configuration is : Device/File Name : Grid_home/cdata/node1.olr
转储OLR的内容,执行命令:
# ocrdump -local –stdout (or filename)
ocrdump –h to get the usage
参见:Oracle Clusterware Administration and Deployment Guide, “Managing the Oracle Cluster Registry and Oracle Local Registries” for more information about using the ocrconfig and ocrcheck.
ASM挂载磁盘组之前OCR操作必须能执行。强制卸载OCR或ASM实例被强制关闭会报错。
当堆栈是运行的,CRSD保持读写OCR。
OHASD maintains the resource dependency and will bring up ASM with the required diskgroup mounted before it starts CRSD.
Once ASM is up with the diskgroup mounted, the usual ocr* commands (ocrcheck, ocrconfig, etc.) can be used.
执行关闭有活动的OCR的ASM实例会报ORA-15097错误。(意味着在这个节点上运行着CRSD)。为了查看哪个客户端在访问ASM,执行命令:
asmcmd lsct (v$asm_client)
DB_Name Status Software_Version Compatible_version Instance_Name Disk_Group
+ASM CONNECTED 11.2.0.1.0 11.2.0.1.0 +ASM2 DATA
asmcmd lsof
DB_Name Instance_Name Path
+ASM +ASM2 +data.255.4294967295
+data.255用来标识在ASM上的OCR。
产生了一些错误,
The ASM Diskgroup Resource
当一个磁盘组被创建,磁盘组资源将自动创建名字,ora.<DGNAME>.dg,状态被设置成ONLINE。如果磁盘组卸载了,那么状态就会设置成OFFLINE,由于这是CRS管理的资源。当删除一个磁盘组时,磁盘组资源也会被删除。
数据库要访问ASM文件时会在数据库和磁盘组之间自动建立依赖关系。然而,当数据库不再使用ASM文件或者ASM文件被移除了,我们没法自动移除依赖关系,这就需要用srvctl命令行工具了。
典型的ASM alert.log 里的成功/失败和警告信息:
Success:
NOTE: diskgroup resource ora.DATA.dg is offline
NOTE: diskgroup resource ora.DATA.dg is online
Failure
ERROR: failed to online diskgroup resource ora.DATA.dg
ERROR: failed to offline diskgroup resource ora.DATA.dg
Warning
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
This warning may appear when the stack is started WARNING: unknown state for diskgroup resource ora.DATA.dg
如果错误发生了,查看alert.log里关于资源操作的状态信息,如:
“ERROR”: the resource operation failed; check CRSD log and Agent log for more details
Grid_home/log/<hostname>/crsd/
Grid_home/log/<hostname>/agent/crsd/oraagent_user/
“WARNING”: cannot communicate with CRSD.
在引导ASM实例启动和在CRSD之前挂载磁盘组,这个警告可以忽略。
磁盘组资源的状态和磁盘组时要一致的。在少数情况下,会出现短暂的不同步。执行srvctl让状态同步,或者等待一段时间让代理去刷新状态。如果这个不同步的时间比较长,请检查CRSD 日志和ASM日期看更多的细节信息。
打开更全面的跟踪用事件event=”39505 trace name context forever, level 1“。
一个仲裁故障组是故障组的一种特殊类型,不包含用户数据也在决定冗余要求时也不需要考虑。
COMPATIBLE.ASM磁盘组的compatibility属性必须设置为11.2或更高,用来在磁盘组里存放OCR或投票文件。
在拓展的/延伸的集群或者两个存储阵列需要第三个投票文件时,在安装数据库软件时我们不支持创建一个仲裁故障组。
创建一个仲裁故障组的磁盘组在第三阵列可用:
SQL> CREATE DISKGROUP PROD NORMAL REDUNDANCY
FAILGROUP fg1 DISK ‘<a disk in SAN1>’
FAILGROUP fg2 DISK ‘<a disk in SAN2>’
QUORUM FAILGROUP fg3 DISK ‘<another disk or file on a third location>’
ATTRIBUTE ‘compatible.asm’ = ’11.2.0.0’;
如果是用asmca创建的磁盘组,添加仲裁盘到磁盘组里,Oracle集群会自动改变CSS仲裁盘的位置,例如:
$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
— —– —————– ——— ———
Located 3 voting file(s).
如果是通过SQL*PLUS,就要执行crsctl replace css votedisk。
本文永久地址:https://www.askmac.cn/archives/oracle-clusterware-11-2.html
参见:Oracle Database Storage Administrator’s Guide, “Oracle ASM Failure Groups” for more information. Oracle Clusterware Administration and Deployment Guide, “Voting file, Oracle Cluster Registry, and Oracle Local Registry” for more information about backup and restore and failure recovery.
Oracle建议把ASM SPFILE存放在磁盘组上。你不能给已经存在的ASM SPFILE创建别名。
如果你没有用共享的Oracle grid家目录,Oracle ASM实例会使用PFILE。相同规则的文件名,默认位置,和查找用来适用于数据库初始化参数的文件和也适用于ASM的初始化参数文件。
ASM查找参数参数文件的顺序是:
例如:在Linux环境下,SPFILE的默认路径是在Oracle grid的家目录下:
$ORACLE_HOME/dbs/spfile+ASM.ora
Backing Up, Moving a ASM spfile
你可以备份,复制,或移动ASM SPFILE 用ASMCMD的spbackup,spcopy或spmove命令。关于ASMCMD的命令参见Oracle Database Storage Administrator’s Guide。
参见:Oracle Database Storage Administrator’s Guide “Configuring Initialization Parameters for an Oracle ASM Instance” for more information.
Oracle 集群管理应用和进程是通过管理你在集群里注册的资源。你在集群里注册的资源数量取决于你的应用。应用只由一个进程组成,通常就只有一个资源。有些复制的应用,由多个进程或组件组成,可能需要多个资源。
通常,所有的资源是唯一的但是有些资源可能有共同的属性。Oracle集群用资源类型来组织这些相似的资源。用资源类型有这些好处:
每个在Oracle集群注册的资源都要有一个指定的资源类型。除了在Oracle集群中的资源类型,可以用crsctl工具自定义资源类型。资源类型包括:
所有用户定义的资源类型必须是基础的,直接的或间接的,为local_resource类型或cluster_resource类型。
执行crsctl stat type命令可以列出说有的定义的类型:
TYPE_NAME=application
BASE_TYPE=cluster_resource
TYPE_NAME=cluster_resource
BASE_TYPE=resource
TYPE_NAME=local_resource
BASE_TYPE=resource
TYPE_NAME=ora.asm.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.cluster_resource.type
BASE_TYPE=cluster_resource
TYPE_NAME=ora.cluster_vip.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.cluster_vip_net1.type
BASE_TYPE=ora.cluster_vip.type
TYPE_NAME=ora.database.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.diskgroup.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.eons.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.gns.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.gns_vip.type
BASE_TYPE=ora.cluster_vip.type
TYPE_NAME=ora.gsd.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.listener.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.local_resource.type
BASE_TYPE=local_resource
TYPE_NAME=ora.network.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.oc4j.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.ons.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.registry.acfs.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.scan_listener.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.scan_vip.type
BASE_TYPE=ora.cluster_vip.type
TYPE_NAME=resource
BASE_TYPE=
列出类型的所有属性和默认值,执行crsctl stat type <typeName> -f (for full configuration) or –p (for static configuration)。
这节说明组成资源类型定义的属性。一个资源类型的定义是抽象的和只读的。这些类型可能只是当做其他类型的基础。在11.2.0.1的集群不允许直接拓展用户定义的类型。
查看所有的基础资源类型的名称和默认值,运行crsctl stat type resource –p命令。
Name | History | Description |
NAME | From 10gR2 | The name of the resource. Resource names must be unique and may not be modified once the resource is created. |
TYPE | From 10gR2,
modified |
Semantics are unchanged; values other than application exist
Type: string Special Values: No |
CHECK_INTERVAL | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
DESCRIPTION | From 10gR2 | Unchanged Type: string
Special Values: No |
RESTART_ATTEMPTS | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
START_TIMEOUT | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
STOP_TIMEOUT | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
SCRIPT_TIMEOUT | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
UPTIME_THRESHOLD | From 10gR2 | Unchanged Type: string
Special Values: No Per-X Support: Yes |
AUTO_START | From 10gR2 | Unchanged Type: string
Format: restore|never|always Required: No Default: restore Special Values: No |
BASE_TYPE | New | The name of the base type from which this type extends. This is the value of the “TYPE” in the base type’s profile.
Type: string Format: [name of the base type] Required: Yes Default: empty string (none) Special Values: No Per-X Support: No |
DEGREE | New | This is the count of the number of instances of the resource that are allowed to run on a single server. Today’s application has a fixed degree of one. Degree supports multiplicity within a server
Type: unsigned integer Format: [number of attempts, >=1] Required: No Default: 1 Special Values: No |
ENABLED | New | The flag that governs the state of the resource as far as being managed by Oracle Clusterware, which will not attempt to manage a disabled resource whether directly or because of a dependency to another resource. However, stopping of the resource when requested by the administrator will be allowed
(so as to make it possible to disable a resource without having to stop it). Additionally, any change to the resource’s state performed by an ‘outside force’ will still be proxied into the clusterware. Type: unsigned integer Format: 1 | 0 Required: No Default: 1 Special Values: No Per-X Support: Yes |
START_DEPENDENCIES | New | Specifies a set of relationships that govern the start of the resource.
Type: string Required: No Default: Special Values: No |
STOP_DEPENDENCIES | New | Specifies a set of relationships that govern the stop of the resource.
Type: string Required: No Default: Special Values: No |
AGENT_FILENAME | New | An absolute filename (that is, inclusive of the path and file name) of the agent program that handles this type. Every resource type must have an agent program that handles its resources. Types can do so by either specifying the value for this attribute or inheriting it from their base type.
Type: string Required: Yes Special Values: Yes Per-X Support: Yes (per-server only) |
ACTION_SCRIPT | From 10gR2,
modified |
An absolute filename (that is, inclusive of the path and file name) of the action script file. This attribute is used in conjunction with the AGENT_FILENAME. CRSD will invoke the script in the manner it did in 10g for all entry points (operations) not implemented in the agent binary. That is, if the agent program implements a particular entry point, it is invoked; if it does not, the script specified in this attribute will be executed.
Please note that for backwards compatibility with previous releases, a built-in agent for the application type will be included with CRS. This agent is implemented to always invoke the script specified with this attribute. Type: string Required: No Default: Special Values: Yes Per-X Support: Yes (per-server only) |
ACL | New | Contains permission attributes. The value is populated at resource creation time based on the identity of the process creating the resource, unless explicitly overridden. The value can subsequently be changed using the APIs/command line utilities, provided that such a change is allowed based on the existing permissions of the resource.
Format:owner:<user>:rwx,pgrp:<group>:rwx,other::r— Where owner: the OS User of the resource owner, followed by the permissions that the owner has. Resource actions will be executed as with this user ID. pgrp: the OS Group that is the resource’s primary group, followed by the permissions that members of the group have other: followed by permissions that others have Type: string Required: No Special Values: No |
STATE_CHANGE_EVENT_TEM PLATE | New | The template for the State Change events. Type: string Required: No
Default: Special Values: No |
PROFILE_CHANGE_EVENT_TE MPLATE | New | The template for the Profile Change events. Type: string Required: No
Default: Special Values: No |
ACTION_FAILURE_EVENT_TE MPLATE | New | The template for the State Change events.
Type: string Required: No Default: Special Values: No |
LAST_SERVER | New | An internally managed, read-only attribute that contains the name of the server on which the last start action has succeeded.
Type: string Required: No, read-only Default: empty Special Values: No |
OFFLINE_CHECK_INTERVAL | New | Used for controlling off-line monitoring of a resource. The value represents the interval (in seconds) to use for implicitly monitoring the resource when it is OFFLINE. The monitoring is turned off if the value is 0
Type: unsigned integer Required: No Default: 0 Special Values: No Per-X Support: Yes |
STATE_DETAILS | New | An internally managed, read-only attribute that contains details about the state of the resource. The attribute fulfills the following needs:
1. CRSD understood resource states (Online, Offline, Intermediate, etc) may map to different resource-specific values (mounted, unmounted, open, closed, etc). In order to provide a better description of this mapping, resource agent developers may choose to provide a ‘state label’ as part of providing the value of the STATE. 2. Providing the label, unlike the value of the resource state, is optional. If not provided, the Policy Engine will use CRSD- understood state values (Online, Offline, etc). Additionally, in the event the agent is unable to provide the label (as may also happen to the value of STATE), the Policy Engine will set the value of this attribute to do it is best at providing the details as to why the resource is in the state it is (why it is Intermediate and/or why it is Unknown) Type: string Required: No, read-only Default: empty Special Values: No |
The local_resource type is the basic building block for resources that are instantiated for each server but are cluster oblivious and have a locally visible state. While the definition of the type is global to the clusterware, the exact property values of the resource instantiation on a particular server are stored on that server. This resource type has no equivalent in Oracle Clusterware 10gR2 and is a totally new concept to Oracle Clusterware.
The following table specifies the attributes that make up the local_resource type definition. To see all default values run the command crsctl stat type local_resource –p.
Name | Description |
ALIAS_NAME | Type: string Required: No Special Values: Yes Per-X Support: No |
LAST_SERVER | Overridden from resource: the name of the server to which the
resource is assigned (“pinned”).
|
Only Cluster Administrators will be allowed to register local resources.
The cluster_resource is the basic building block for resources that are cluster aware and have globally visible state. 11.1‘s application is a cluster_resource. The type’s base is resource. The type definition is read-only. The following table specifies the attributes that make up the cluster_resource type definition.
The following table specifies the attributes that make up the cluster_resource type definition. Run crsctl stat type cluster_resource –p to see all default values.
Name | History | Description |
ACTIVE_PLACEMENT | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No |
FAILOVER_DELAY | From 10gR2 | Unchanged, Deprecated Special Values: No |
FAILURE_INTERVAL | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
FAILURE_THRESHOLD | From 10gR2 | Unchanged
Type: unsigned integer Special Values: No Per-X Support: Yes |
PLACEMENT | From 10gR2 | Format: value
where value is one of the following: restricted Only servers that belong to the associated server pool(s) or hosting members may host instances of the resource. favored If only SERVER_POOLS or HOSTING_MEMBERS attribute is non-empty, servers belonging to the specified server pool(s)/hosting member list will be considered first if available; if/when none are available, any other server will be used. If both SERVER_POOLS and HOSTING_MEMBERS are populated, the former indicates preference while the latter – restricts the choices to the servers within that preference balanced Any ONLINE, enabled server may be used for placement. Less loaded servers will be preferred to more loaded ones. To measure how loaded a server is, clusterware will use the LOAD attribute of resources that are ONLINE on the server. The sum total of LOAD values is used as the absolute measure of the current server load. Type: string Default: balanced Special Values: No |
HOSTING_MEMBERS | From 10g | The meaning from this attribute is taken from the previous release.
Although not officially deprecated, the use of this attribute is discouraged. Special Values: No Required: @see SERVER_POOLS
|
SERVER_POOLS | New | Format:
* | [<pool name1> […]] This attribute creates an affinity between the resource and one or more server pools as far as placement goes. The meaning of this attribute depends on what the value of PLACEMENT is. When a resource should be able to run on any server of the cluster, a special value of * needs to be used. Note that only Cluster Administrators can specify * as the value for this attribute. Required: restricted PLACEMENT requires either SERVER_POOLS or HOSTING_MEMBERS favored PLACEMENT requires either SERVER_POOLS or HOSTING_MEMBERS but allows both. Balanced PLACEMENT does not require a value Type: string Default: * Special Values: No |
CARDINALITY | New | The count of the number of servers on which a resource wants to be running simultaneously. In other words, this is the ‘upper’ limit for resource cardinality. There’s currently no support for the ‘lower’ cardinality limit.
Please note CRS special values may be used for specifying values of this attribute. Type: string Format: max Required: No Default: 1 Special Values: Yes |
LOAD | New | A non-negative, numeric value designed to represent a quantitative measure of how much server capacity an instance of the resource consumes. The value of this parameter is interpreted in conjunction with that of the PLACEMENT attribute. For balanced placement policy, the value of this attribute place a role in determining where the resource is best placed. This value is an improvement to the original behavior of the balanced placement policy which assumed that the load of every resource is a constant and equal number (1).
Type: unsigned integer Format: non-negative number Required: No Default:1 Special Values: No Per-X Support: Yes |
With Oracle Clusterware 11.2 a new dependency concept is introduced, to be able to build dependencies for start and stop actions independent and have a much better granularity.
If resource A has a hard dependency on resource B, B must be ONLINE before A will be started. Please note there is no requirement that A and B be located on the same server.
A possible parameter to this dependency would allow resource B to be in either in ONLINE or INTERMEDIATE state. Such a variation is sometimes referred to as the intermediate dependency.
Another possible parameter to this dependency would make it possible to differentiate if A requires that B be present on the same server or on any server in the cluster. In other words, this illustrates that the presence of resource B on the same server as A is a must for resource A to start.
If the dependency is on a resource type, as opposed to a concrete resource, this should be interpreted as “any resource of the type”. The aforementioned modifiers for locality/state still apply accordingly.
If resource A has a weak dependency on resource B, an attempt to start of A will attempt to start B if is not ONLINE. The result of the attempt to start B is, however, of no consequence to the result of starting A (it is ignored). Additionally, if start of A causes an attempt to start B, failure to start A has no affect on B.
A possible parameter to this dependency is whether or not the start of A should wait for start of B to complete or may execute concurrently.
Another possible parameter to this dependency would make it possible to differentiate if A desires that B be running on the same server or on any server in the cluster. In other words, this illustrates that the presence of resource B on the same server as A is a desired for resource A to start. In addition to the desire to have the dependent resource started locally or on any server in the cluster, another possible parameter is to start the dependent resource on every server where it can run。
If the dependency is on a resource type, as opposed to a concrete resource, this should be interpreted as “every resource of the type”. The aforementioned modifiers for locality/state still apply accordingly.
If resource A attracts B, then whenever B needs to be started, servers that currently have A running will be first on the list of placement candidates. Since a resource may have more than one resource to which it is attracted, the number of attraction-exhibiting resources will govern the order of precedence as far as server placement goes.
If the dependency is on a resource type, as opposed to a concrete resource, this should be interpreted as “any resource of the type”.
A possible flavor of this relation is to require that a resource’s placement be re-evaluated when a related resource’s state changes. For example, resource A is attracted to B and C. At the time of starting A, A is started where B is. Resource C may either be running or started thereafter. Resource B is subsequently shut down/fails and does not restart. Then resource A requires that at this moment its placement be re-evaluated and it be moved to C. This is somewhat similar to the AUTOSTART attribute of the resource profile, with the dependent resource’s state change acting as a trigger as opposed to a server joining the cluster.
A possible parameter to this relation is whether or not resources in intermediate state should be counted as running thus exhibit attraction or not.
If resource A excludes resource B, this means that starting resource A on a server where B is running will be impossible. However, please see the dependency’s namesake for STOP to find out how B may be stopped/relocated so A may start.
If a resource A needs to be auto-started whenever resource B is started, this dependency is used. Note that the dependency will only affect A if it is not already running. As is the case for other dependency types, pull-up may cause the dependent resource to start on any or the same server, which is parameterized. Another possible parameter to this dependency would allow resource B to go to either in ONLINE or INTERMEDIATE state to trigger pull-up of A. Such a variation is sometimes referred to as the intermediate dependency. Note that if resource A has pull-up relation to resources B and C, then it will only be pulled up when both B and C are started. In other words, the meaning of resources mentioned in the pull-up specification is interpreted as a Boolean AND.
Another variation in this dependency is if the value of the TARGET of resource A plays a role: in some cases, a resource needs to be pulled-up irrespective of its TARGET while in others only if the value of TARGET is ONLINE. To accommodate both needs, the relation offers a modifier to let users specify if the value of the TARGET is irrelevant; by default, pull-up will only start resources if their TARGET is ONLINE. Note that this modifier is on the relation, not on any of the targets as it applies to the entire relation.
If the dependency is on a resource type, as opposed to a concrete resource, this should be interpreted as “any resource of the type”. The aforementioned modifiers for locality/state still apply accordingly.
The property between two resources that desire to avoid being co-located, if there’s no alternative other than one of them being stopped, is described by the use of the dispersion relation. In other words, if resource A prefers to run on a different server than the one occupied by resource B, then resource A is said to have a dispersion relation to resource B at start time. This sort of relation between resources has an advisory effect, much like that of attraction: it is not binding as the two resources may still end up on the same server.
A special variation on this relation is whether or not crsd is allowed/expected to disperse resources, once it is possible, that are already running. In other words, normally, crsd will not disperse co-located resources when, for example, a new server becomes online: it will not actively relocate resources once they are running, only disperse them when starting them. However, if the dispersion is ‘active’, then crsd will try to relocate one of the resources that disperse to the newly available server.
A possible parameter to this relation is whether or not resources in intermediate state should be counted as running thus exhibit attraction or not.
在11.2,CRSD所有者组织了大量的事件,RLB事件是来源于数据库。如果eONS 没有在运行,ReporterModule尝试缓存事件直到eONS启动。事件确保发送和接收发生动作发生的顺序。
每个节点在crsd的oraagent进程里运行一个数据库代理,一个ONS代理,和一个eONS代理。这些代理复制停止/启动/检查操作。每一个代理不是用专用的线程,用一个线程池来执行多种资源的操作。
在oraagent日志里,可以通过字符串”Thread:[EonsSub ONS]”, “Thread:[EonsSub EONS]” 和”Thread:[EonsSub FAN]”辨识出eONS subscriber线程。在下面的例子中,一个服务已经停止,这个节点的crsd oraagent程序和三个eONS会受到这个事件:
2009-05-26 23:36:40.479: [AGENTUSR][2868419488][UNKNOWN] Thread:[EonsSub FAN]
process {
2009-05-26 23:36:40.500: [AGENTUSR][2868419488][UNKNOWN] Thread:[EonsSub FAN]
process }
2009-05-26 23:36:40.540: [AGENTUSR][2934963104][UNKNOWN] Thread:[EonsSub ONS]
process }
2009-05-26 23:36:40.558: [AGENTUSR][2934963104][UNKNOWN] Thread:[EonsSub ONS]
process {
2009-05-26 23:36:40.563: [AGENTUSR][2924329888][UNKNOWN] Thread:[EonsSub EONS]
process {
2009-05-26 23:36:40.564: [AGENTUSR][2924329888][UNKNOWN] Thread:[EonsSub EONS]
process }
On one node of the cluster, the eONS subscriber of the following agents also assumes the role of a publisher or processor or master (pick your favorite terminology):
The publishers/processors can be identified by searching for “got lock”:
staiu01/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26 19:51:41.549: [AGENTUSR][2934959008][UNKNOWN] CssLock::tryLock, got lock CLSN.ONS.ONSPROC
staiu02/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26 19:51:41.626: [AGENTUSR][3992972192][UNKNOWN] CssLock::tryLock, got lock CLSN.ONS.ONSNETPROC
staiu03/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26 20:00:21.214: [AGENTUSR][2856319904][UNKNOWN] CssLock::tryLock, got lock
CLSN.RLB.pommi
staiu02/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26 20:00:27.108: [AGENTUSR][3926576032][UNKNOWN] CssLock::tryLock, got lock CLSN.FAN.pommi.FANPROC
These CSS-based locks work in such a way that any node can grab the lock if it is not already held. If the process of the lock holder goes away, or CSS thinks the node went away, the lock is released and someone else tries to get the lock. The different processors try to grab the lock whenever they see an event. If a processor previously was holding the lock, it doesn’t have to acquire it again. There is currently no implementation of a “backup” or designated failover-publisher.
In a cluster of 2 or more nodes, one onsagent’s eONS subscriber will also assume the role of CLSN.ONS.ONSNETPROC, i.e. is responsible for just publishing network down events. The publishers with the roles of CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC cannot and will not run on the same node, i.e. they must run on distinct nodes.
If both the CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC simultaneously get their public network interface pulled down, there may not be any event.
Another additional thread tied to the dbagent thread in the oraagent process of only one node in the cluster, is ” Thread:[RLB:dbname]”, and it dequeues the LBA/RLB/affinity event from the SYS$SERVICE_METRICS queue, and publishes the event to eONS clients. It assumes the lock role of CLSN.RLB.dbname. The CLSN.RLB.dbname publisher can run on any node, and is not related to the location of the MMON master (who enqueues LBA events into the SYS$SERVICE_METRICS queue. So since the RLB publisher (RLB.dbname) can run on a different node than the ONS publisher (ONSPROC), RLB events can be dequeued on one node, and published to ONS on another node. There is one RLB publisher per database in the cluster
Sample trace, where Node 3 is the RLB publisher, and Node 2 has the ONSPROC role:
– Node 3:
2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]
Thread:[RLB:pommi] publishing message srvname = rlb
2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]
Thread:[RLB:pommi] publishing message payload = VERSION=1.0 database=pommi service=rlb { {instance=pommi_3 percent=25 flag=UNKNOWN aff=FALSE}{instance=pommi_4 percent=25 flag=UNKNOWN aff=FALSE}{instance=pommi_2 percent=25 flag=UNKNOWN aff=FALSE}{instance=pommi_1 percent=25 flag=UNKNOWN aff=FALSE} } timestamp=2009-05-28 19:29:10
The RLB events will be received by the eONS subscriber of the ONS publisher (ONSPROC) who then posts the event to ONS:
– Node 2:
2009-05-28 19:29:40.773: [AGENTUSR][3992976288][UNKNOWN] Publishing the
ONS event type database/event/servicemetrics/rlb
The above description is only valid for 11.2.0.1. In 11.2.0.2, the eONS proxy a.k.a eONS server will be removed, and its functionality will be assumed by evmd. In addition, the tracing as described above, will change significantly. The major reason for this change was the high resource usage of the eONS JVM.
In order to find the publishers in the oraagent.log in 11.2.0.2, search for these patterns:
“ONS.ONSNETPROC CssLockMM::tryMaster I am the master” “ONS.ONSPROC CssLockMM::tryMaster I am the master” “FAN.<dbname> CssLockMM::tryMaster I am the master” “RLB.<dbname> CssSemMM::tryMaster I am the master”
Oracle不建议为集群件和RAC配置单独的接口。如果配置多个私有接口我们建议绑定成一个单独的接口,为了给网卡故障提供冗余。除非绑定,多个私有接口只是提供负载均衡,不能故障转移。
改变接口名字的后果取决于你改变了那个接口的名字和你是否也改变了IP地址。如果你只是改变了接口名字,那么后果是次要的。如果你改变的是公共接口的名字存储在OCR中,那么你必须在每个节点上修改应用。因此,你要停掉节点上的应用来进行修改。
可以用oifcfg delif / setif修改集群的网络互连,也可以修改集群的私有网络互连,在集群件重启时生效。
Oracle RAC网络互连必须用相同的接口。不要配置私有的网络互连在集群件上没有定义的不同接口。
参见: Oracle Clusterware Administration and Deployment Guide, “Changing Network Addresses on Manually Configured Networks” for more information.
misscount的值很重要,Oracle不支持修改misscount得默认值。可以通过下面命令获取misscount的值:
# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
第三方集群软件misscount的默认值是600,是为了给第三方软件提供更多的时间来做节点的加入/删除的决定。不要修改第三方集群软件的misscount默认设置.
当一个集群成功安装或者一个节点启动了,那么就可以检查整个集群或者一个节点的健康了。
本地节点的OHASD已经启动和如果守护进程是健康运行的,就可以进行‘crsctl check has’检查。
# crsctl check has
CRS-4638: Oracle High Availability Services is online
‘crsctl check crs’可以检查OHASD,CRSD,OCSSD和EVM守护进程。
# crsctl check crs
CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
‘crsctl check cluster –all’将检查集群里所有节点的所有守护进程
# crsctl check cluster –all
************************************************************** node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
************************************************************** node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
**************************************************************
在用crsctl start cluster命令启动集群时,监控输出,尝试启动所有资源应该是成功的,如果有资源启动失败,到相应的日志里查找错误信息。
# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1’
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1’ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1’
CRS-2672: Attempting to start ‘ora.diskmon’ on ‘node1’
CRS-2676: Start of ‘ora.diskmon’ on ‘node1’ succeeded
CRS-2676: Start of ‘ora.cssd’ on ‘node1’ succeeded
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1’
CRS-2676: Start of ‘ora.ctssd’ on ‘node1’ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1’
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1’
CRS-2676: Start of ‘ora.evmd’ on ‘node1’ succeeded
CRS-2676: Start of ‘ora.asm’ on ‘node1’ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1’
CRS-2676: Start of ‘ora.crsd’ on ‘node1’ succeeded
Oracle 集群管理工具有命令可以用来管理集群框架下的所有实体。包括集群的守护进程,钱夹管理在集群的所有节点上。
你可以用CRSCTL命令在集群上进行一些炒作,比如:
几乎所有的操作都是这个集群范围的。
参见:Oracle Clusterware Administration and Deployment Guide, “CRSCTL Utility Reference” for more information about using crsctl.
可以在root用户下用crsctl set log命令启动动态调试CRS,CSS,EVM和集群的子构件。你可以动态修改调试级别用crsctl debug命令。调试信息保存在OCR中,在下次启动时使用。你可以始终开启资源调试。
调试性能和选项的完整列表在“Oracle Clusterware Administration and Deployment Guide”的“Troubleshooting and Diagnostic Output”章节里有列出。
Oracle集群用统一的日志目录结构来合并组件的日志文件。这种合并结构简化了诊断信息的收集和在分析问题时提供帮助。
Oracle集群在日志文件里使用循环的方法。如果你不能在文件里找到指定的告警细节信息,那么这个文件可能被循环成一个循环版本,典型的结尾是 *.lnumber,这个数字从01开始,产生更多的日志时这个数字会增长,总是能够不同的日志对应于不同的日志文件。一般不需要参考下面这些文件除非Oracle支持提出要求。你可以在文件里查看循环版本的日志文件。日志的保留策略,The log retention policy, however, foresees that older logs are be purged as required by the amount of logs generated
GRID_HOME/log/<host>/diskmon – Disk Monitor Daemon
GRID_HOME/log/<host>/client – OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL – edit the
GRID_HOME/srvm/admin/ocrlog.ini file to increase the trace level
GRID_HOME/log/<host>/admin – not used
GRID_HOME/log/<host>/ctssd – Cluster Time Synchronization Service
GRID_HOME/log/<host>/gipcd – Grid Interprocess Communication Daemon
GRID_HOME/log/<host>/ohasd – Oracle High Availability Services Daemon
GRID_HOME/log/<host>/crsd – Cluster Ready Services Daemon
GRID_HOME/log/<host>/gpnpd – Grid Plug and Play Daemon
GRID_HOME/log/<host>/mdnsd – Mulitcast Domain Name Service Daemon
GRID_HOME/log/<host>/evmd – Event Manager Daemon GRID_HOME/log/<host>/racg/racgmain – RAC RACG
GRID_HOME/log/<host>/racg/racgeut – RAC RACG GRID_HOME/log/<host>/racg/racgevtf – RAC RACG
GRID_HOME/log/<host>/racg – RAC RACG (only used if pre-11.1 database is installed)
GRID_HOME/log/<host>/cssd – Cluster Synchronization Service Daemon GRID_HOME/log/<host>/srvm – Server Manager
GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 – HA Service Daemon Agent
GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root – HA Service Daemon CSS Agent
GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root – HA Service Daemon ocssdMonitor Agent
GRID_HOME/log/<host>/agent/ohasd/orarootagent_root – HA Service Daemon Oracle Root Agent
GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11 – CRS Daemon Oracle Agent
GRID_HOME/log/<host>/agent/crsd/orarootagent_root – CRS Daemon Oracle Root Agent
GRID_HOME/log/<host>/agent/crsd/ora_oc4j_type_oracle11 – CRS Daemon OC4J Agent (11.2.0.2 feature and not used in 11.2.0.1)
GRID_HOME/log/<host>/gnsd – Grid Naming Services Daemon
获取某个事件所有的相关的跟踪文件最好的路径是Grid_home/bin/diagcollection.pl。收集所有的trace和root用户在所有节点运行一个OCRDUMP命令“diagcollection.pl –collect –crshome <GRID_HOME>”。
# Grid_home/bin/diagcollection.pl
Production Copyright 2004, 2008, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool diagcollection
–collect
[–crs] For collecting crs diag information
[–adr] For collecting diag information for ADR [–ipd] For collecting IPD-OS data
[–all] Default.For collecting all diag information.
[–core] UNIX only. Package core files with CRS data
[–afterdate] UNIX only. Collects archives from the specified date. Specify in mm/dd/yyyy format
[–aftertime] Supported with -adr option. Collects archives after the specified time. Specify in YYYYMMDDHHMISS24 format
[–beforetime] Supported with -adr option. Collects archives before the specified date. Specify in YYYYMMDDHHMISS24 format
[–crshome] Argument that specifies the CRS Home location
[–incidenttime] Collects IPD data from the specified time.Specify in MM/DD/YYYY24HH:MM:SS format If not specified, IPD data generated in the past 2 hours are collected
[–incidentduration] Collects IPD data for the duration after the specified time. Specify in HH:MM format.If not specified, all IPD data after incidenttime are collected
NOTE:
./diagcollection.pl –collect –crs –crshome <CRS Home>
–clean cleans up the diagnosability information gathered by this script
–coreanalyze UNIX only. Extracts information from core files and stores it in a text file
更多的关于收集IPD的信息看6.4章节
如果是安装的供应商的集群软件,就需要给Oracle支持提供更多的关于集群的文件。
从11.2开始。有些集群的信息里有用”(:” and “:)”用包含起来的文本。通常情况下,和下面的例子类似,这个标识符在文件中以”Details in…”开始和包含日志文件路径。这个标识符叫做DRUID或者诊断记录的唯一ID:
2009-07-16 00:18:44.472
[/scratch/11.2/grid/bin/orarootagent.bin(13098)]CRS-5822:Agent
‘/scratch/11.2/grid/bin/orarootagent_root’ disconnected from server.
Details at (:CRSAGF00117:) in
/scratch/11.2/grid/log/stnsp014/agent/crsd/orarootagent_root/orarootagent_root.log.
DRUID是用来关联外部产品信息和数据库集群的内部诊断日志文件。诊断问题时对用户没有直接帮助,主要是用来提供给Oracle支持人员。
有一些基于Java的GUI工具在遇到问题时能够运行用来设置跟踪级别:
“setenv SRVM_TRACE true” (or “export SRVM_TRACE=true”)
“setenv SRVM_TRACE_LEVEL 2” (or “export SRVM_TRACE_LEVEL=2″)
在OUI安装出错时可用运行-debug选项(如安装时执行”./runInstaller -debug”
集群在某种情况下,会重启一个节点来确保整个集群上的数据库和其他应用的健康运行。当决定重启问题节点时,普通的活动日志(比如集群的alert日志)就不可靠了:重启往往发生在操作系统刷写缓存日志到磁盘之前,这就意味着关于导致重启的原因可能丢失了。
在11.2集群里有个新特性叫Reboot Advisory,用来提高保留集群重启的说明内容。这时集群发生了重启,一个短的解释性消息会产生和试图在下面两种途径发布:
重启决策信息写到一个小文件里(通常在本地连接的存储),没有I/O缓存请求。这个文件在发生失败时(重启集群)被创建和提前格式化好的,因此I/O会有非常高的成功率,即使是在失败的系统。重启决策信息会被广播到可用的网络接口。
这些操作是并行的并有时间限制,因此不会对重启有延时。尝试多个磁盘和网络来获取这些信息,至少有一个会成功,往往是都成功的。成功的存储和发送Reboot Advisory信息,最后出现在集群的一个或多个节点上的alert日志里。
当网络广播Reboot Advisory信息成功后,在集群的其他节点的告警日志里就会出现相关的信息。这个事情是稍纵即逝的,因此立马就能看到和确定产生重启的原因。这些消息包含要进行重启节点的主机名用来区别集群里的其他节点。只是同一个集群里的失败节点会显示这些信息。
如果Reboot Advisory成功的吧信息写到一个磁盘文件里,在这个节点下次启动集群时,在告警日志的前面会产生相关的信息。
Reboot Advisory 有一个时间戳,3天内的启动都会扫描这些文件。这个扫描不能是空文件或被标记成已经公布了的文件,因此如果3天内在一个节点上多次重启,那么同一个Reboot Advisory会在告警日志里多次出现。
Reboot Advisories用相同的告警日志,一般有两个部分。一部分是CRS-8011,显示重启节点的主机名和时间戳(重启的大约时间点)。例如:
[ohasd(24687)]CRS-8011:reboot advisory message from host: sta00129, component: CSSMON, with timestamp: L-2009-05-05-10:03:25.340
在CRS-8011信息后面的是CRS-8013,表达了强制重启的信息,例如:
[ohasd(24687)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from ocssd at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
请注意所有的在CRS-8013里“text”后面的集群组件的重启信息。这可能会产生重要的紧急信息,这些文本信息不是来自于Oracle的NLS信息文件,通常是用英语和USASCII7字符集。
在某些情况下,Reboot Advisories可能会在文本信息里添加二进制诊断数据。那么可能就会出现CRS-8014和一个或多个CRS-8015信息。这些二进制文件只要在重启问题报告给Oracle解决时有用。
不同的组件可以在同一时间往告警日志里写数据,因此关于Reboot Advisory的信息可能会出现在其他信息的中间。然而不同的Reboot Advisory参数的信息不会交叉在一起,一个Reboot Advisory产生的所有信息会在另一个Reboot Advisory产生的信息之前。
更多的信息,可以参照Oracle Errors manual discussion of messages CRS- 8011 and –8013。
ocrpatch 开发于2005年,是为了给开发和支持人员提供一个修复错误和修改OCR的工具,当官方的工具如ocrcofig或crsctl无法处理这些变化时。ocrpatch不是集群版本的一部分。ocrpathd的功能描述在单独的文档里,因此在这里我们不会深入细节,ocrpatch文档的位置在stcontent的public RAC Performance Group Folder 里。
介绍
vdpatch是一个适用于11.2集群的新的Oracle内部工具。vdpatch和ocrpatch有很多相同的代码,比如look和feel就很像。这个工具的目的是便于诊断CSS关于投票文件连接的问题。vdpatch是基于每个块的操作,例如,它可以从投票文件通过块数目或者名字读取(不是写)512字节块。
一般用法
vdpatch只能root用户运行,其他用户会受到报错:
$ vdpatch
VD Patch Tool Version 11.2 (20090724) Oracle Clusterware Release 11.2.0.2.0
Copyright (c) 2008, 2009, Oracle. All rights reserved. [FATAL] not privileged
[OK] Exiting due to fatal error …
投票文件的名字和路径可以通过’crsctl query css votedisk’命令获取。这个命令只能是OCSSD是运行状态下执行。如果OCSSD没有启动,crsct将没有信号
# crsctl query css votedisk
Unable to communicate with the Cluster Synchronization Services daemon.
如果OCSSD是运行的,你能收到下面的输出:
$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
— —– —————– ——— ———
Located 3 voting file(s).
上面的输出表明定义了三个投票文件在磁盘组+VDDG上面,每个位于特定的裸设备上,属于哪个ASM磁盘组。vdpatch可以每次只查看一个设备的内容:
# vdpatch
VD Patch Tool Version 11.2 (20090724)
Oracle Clusterware Release 11.2.0.2.0
Copyright (c) 2008, 2009, Oracle. All rights reserved.
vdpatch> op /dev/raw/raw100
[OK] Opened /dev/raw/raw100, type: ASM
如果投票文件在裸设备上,crsctl和vdpatch可以显示:
# vdpatch
VD Patch Tool Version 11.2 (20090724) Oracle Clusterware Release 11.2.0.2.0
Copyright (c) 2008, 2009, Oracle. All rights reserved. vdpatch> op /dev/raw/raw126
[OK] Opened /dev/raw/raw126, type: Raw/FS
要打开其他投票文件,简单的再运行’op’:
vdpatch> op /dev/raw/raw126
[OK] Opened /dev/raw/raw126, type: Raw/FS
vdpatch> op /dev/raw/raw130
[INFO] closing voting file /dev/raw/raw126
[OK] Opened /dev/raw/raw130, type: Raw/FS
用’h’命令,可以列出所有的可用命令:
vdpatch> h
Usage: vdpatch
BLOCK operations
op <path to voting file> open voting file
rb <block#> read block by block# rb status|kill|lease <index> read named block
index=[0..n] => Devenv nodes 1..(n-1) index=[1..n] => shiphome nodes 1..n
rb toc|info|op|ccin|pcin|limbo read named block
du dump native block from offset
di display interpreted block
of <offset> set offset in block, range 0-511 MISC operations
i show parameters, version, info
h this help screen
exit / quit exit vdpatch
投票文件块可以读块号和块类型名。TOC, INFO, OP, CCIN, PCIN 和 LIMBO类型只会出现在投票文件的块上,因此读一个块可以执行如’rb toc’; 将输出512比特块的十六进制/ASCII 的dump文件,解释块的内容:
vdpatch> rb toc [OK] Read block 4
[INFO] clssnmvtoc block
0 73734C63 6B636F54 01040000 00020000 00000000 ssLckcoT…………
20 00000000 40A00000 00020000 00000000 10000000 ….@……………
40 05000000 10000000 00020000 10020000 00020000 ………………..
…
…
420 00000000 00000000 00000000 00000000 00000000 ………………..
440 00000000 00000000 00000000 00000000 00000000 ………………..
460 00000000 00000000 00000000 00000000 00000000 ………………..
480 00000000 00000000 00000000 00000000 00000000 ………………..
500 00000000 00000000 00000000 …………
[OK] Displayed block 4 at offset 0, length 512 [INFO] clssnmvtoc block
magic1_clssnmvtoc: 0x634c7373 – 1665954675
magic2_clssnmvtoc: 0x546f636b – 1416586091
fmtvmaj_clssnmvtoc: 0x01 – 1
fmtvmin_clssnmvtoc: 0x04 – 4
resrvd_clssnmvtoc: 0x0000 – 0
maxnodes_clssnmvtoc: 0x00000200 – 512
incarn1_clssnmvtoc: 0x00000000 – 0
incarn2_clssnmvtoc: 0x00000000 – 0
filesz_clssnmvtoc: 0x0000a040 – 41024
blocksz_clssnmvtoc: 0x00000200 – 512
hdroff_clssnmvtoc: 0x00000000 – 0
hdrsz_clssnmvtoc: 0x00000010 – 16
opoff_clssnmvtoc: 0x00000005 – 5
statusoff_clssnmvtoc: 0x00000010 – 16
statussz_clssnmvtoc: 0x00000200 – 512
killoff_clssnmvtoc: 0x00000210 – 528
killsz_clssnmvtoc: 0x00000200 – 512
leaseoff_clssnmvtoc: 0x0410 – 1040
leasesz_clssnmvtoc: 0x0200 – 512
ccinoff_clssnmvtoc: 0x0006 – 6
pcinoff_clssnmvtoc: 0x0008 – 8
limbooff_clssnmvtoc: 0x000a – 10
volinfooff_clssnmvtoc: 0x0003 – 3
块类型STATUS, KILL 和LEASE,存在一个块在每个集群节点上,因此用’rb‘命令必须包括一个十六进制的数来表示节点号。在开发环境,十六进制从0开始,在生产环境,十六进制从1开始。因此要读开发环境下第五个节点的KILL块,执行’rb kill 4’,在生成环境就要执行’rb kill 5’。
在开发环境下读第三个节点的STATUS块:
vdpatch> rb status 2 [OK] Read block 18
[INFO] clssnmdsknodei vote block
0 65746F56 02000000 01040B02 00000000 73746169 etoV…………stai
20 75303300 00000000 00000000 00000000 00000000 u03……………..
40 00000000 00000000 00000000 00000000 00000000 ………………..
60 00000000 00000000 00000000 00000000 00000000 ………………..
80 00000000 3EC40609 8A340200 03000000 03030303 ….> 4……….
100 00000000 00000000 00000000 00000000 00000000 ………………..
120 00000000 00000000 00000000 00000000 00000000 ………………..
140 00000000 00000000 00000000 00000000 00000000 ………………..
160 00000000 00000000 00000000 00000000 00000000 ………………..
180 00000000 00000000 00000000 00000000 00000000 ………………..
200 00000000 00000000 00000000 00000000 00000000 ………………..
220 00000000 00000000 00000000 00000000 00000000 ………………..
240 00000000 00000000 00000000 00000000 00000000 ………………..
260 00000000 00000000 00000000 00000000 00000000 ………………..
280 00000000 00000000 00000000 00000000 00000000 ………………..
300 00000000 00000000 00000000 00000000 00000000 ………………..
320 00000000 00000000 00000000 00000000 00000000 ………………..
340 00000000 00000000 00000000 8E53DF4A ACE84A91 ………….S.J..J.
360 E4350200 00000000 03000000 441DDD4A 6051DF4A .5……….D..J`Q.J
380 00000000 00000000 00000000 00000000 00000000 ………………..
400 00000000 00000000 00000000 00000000 00000000 ………………..
420 00000000 00000000 00000000 00000000 00000000 ………………..
440 00000000 00000000 00000000 00000000 00000000 ………………..
460 00000000 00000000 00000000 00000000 00000000 ………………..
480 00000000 00000000 00000000 00000000 00000000 ………………..
500 00000000 00000000 00000000 …………
[OK] Displayed block 18 at offset 0, length 512
[INFO] clssnmdsknodei vote block
magic_clssnmdsknodei: 0x566f7465 – 1450144869
nodeNum_clssnmdsknodei: 0x00000002 – 2
fmtvmaj_clssnmdsknodei: 0x01 – 1
fmtvmin_clssnmdsknodei: 0x04 – 4
prodvmaj_clssnmdsknodei: 0x0b – 11
prodvmin_clssnmdsknodei: 0x02 – 2
killtime_clssnmdsknodei: 0x00000000 – 0
nodeName_clssnmdsknodei: staiu03
inSync_clssnmdsknodei: 0x00000000 – 0
reconfigGen_clssnmdsknodei: 0x0906c43e – 151438398
dskWrtCnt_clssnmdsknodei: 0x0002348a – 144522
nodeStatus_clssnmdsknodei: 0x00000003 – 3
nodeState_clssnmdsknodei[CLSSGC_MAX_NODES]:
node 0: 0x03 – 3 – MEMBER
node 1: 0x03 – 3 – MEMBER
node 2: 0x03 – 3 – MEMBER
node 3: 0x03 – 3 – MEMBER
timing_clssnmdsknodei.sts_clssnmTimingStmp: 0x4adf538e – 1256149902 – Wed Oct 21 11:31:42 2009
timing_clssnmdsknodei.stms_clssnmTimingStmp: 0x914ae8ac – 2437605548
timing_clssnmdsknodei.stc_clssnmTimingStmp: 0x000235e4 – 144868
timing_clssnmdsknodei.stsi_clssnmTimingStmp: 0x00000000 – 0
timing_clssnmdsknodei.flags_clssnmdsknodei: 0x00000003 – 3
unique_clssnmdsknodei.eptime_clssnmunique: 0x4add1d44 – 1256004932 – Mon Oct 19 19:15:32 2009
ccinid_clssnmdsknodei.cin_clssnmcinid: 0x4adf5160 – 1256149344 – Wed Oct 21 11:22:24 2009
ccinid_clssnmdsknodei.unique_clssnmcinid: 0x00000000 – 0
pcinid_clssnmdsknodei.cin_clssnmcinid: 0x00000000 – 0 – Wed Dec 31 16:00:00 1969
pcinid_clssnmdsknodei.unique_clssnmcinid: 0x00000000 – 0
我们不允许vdpatch计划改变投票文件。删除和重建投票文件建议用crsctl命令。
在11.2你可以通过Grid_home/bin/appvipcfg创建和删除应用或uservip
Production Copyright 2007, 2008, Oracle.All rights reserved Usage:
appvipcfg create -network=<network_number>
-ip=<ip_address>
-vipname=<vipname>
-user=<user_name>[-group=<group_name>]
delete -vipname=<vipname>
appvipcfg命令行工具在默认的网络上(默认创建的资源ora.net.network)只能创建一个应用VIP。如果要创建应用 VIP在不同的网络或子网上,必须进行手工配置。
例如创建一个uservip在一个不同的网络上(ora.net2.network)。
srvctl add vip -n node1 -k 2 -A appsvip1/255.255.252.0/eth2
crsctl add type coldfailover.vip.type -basetype ora.cluster_vip_net2.type
crsctl add resource coldfailover.vip -type coldfailover.vip.type -attr \
“DESCRIPTION=USRVIP_resource,RESTART_ATTEMPTS=0,START_TIMEOUT=0, STOP_TIMEOUT=0, \
CHECK_INTERVAL=10, USR_ORA_VIP=10.137.11.163, \
START_DEPENDENCIES=hard(ora.net2.network)pullup(ora.net2.network), \
STOP_DEPENDENCIES=hard(ora.net2.network), \
ACL=’owner:root:rwx,pgrp:root:r-x,other::r–,user:oracle11:r-x'”
这个区域,有一些已知的bug:
– 8623900 srvctl remove vip -i <ora.vipname> is removing the associated ora.netx.network
– 8620119 appvipcfg should be expanded to create a network resource
– 8632344 srvctl modify nodeapps -a will modify the vip even if the interface is not valid
– 8703112 appsvip should have the same behavior as ora.vip like vip failback
– 8758455 uservip start failed and orarootagent core dump in clsn_agent::agentassert
– 8761666 appsvipcfg should respect /etc/hosts entry for apps ip even if gns is configured
– 8820801 using a second network (k 2) I’m able to add and start the same ip twice
应用程序或脚本代理通过特定的用户代码管理应用资源。Oracle集群件包含一个特殊的共享库,允许用户插入应用用定义好的接口进行制定的操作。
下面的章节介绍如何用Oracle的集群件代理框架接口创建一个代理。
在一个资源上进行操作时,行动入口点用来指向用户定点的代码。每个资源类型,集群件要求行动入口点定义了下面的行动:
start : 启动资源的行动Actions to be taken to start the resource
stop : 关闭资源的温和的行动Actions to gracefully stop the resource
check : 检查资源状态的行动Actions taken to check the status of the resource
clean : 强制关闭资源的行动Actions to forcefully stop the resource.
这些行动入口点可以用C++ 代码来定义或在脚本里。如果这些行动没有明确的定义,集群件假定他们默认的定义在脚本里。这些脚本位置通过ACTION_SCRIPT属性。因此它可能是混合的代理,一些行动的图库点用脚本,其他的一些用C++。
考虑到集群件要管理资源的文件,一个代理管理这些资源有下面的任务:
On startup : 创建文件Create the file.
On shutdown : 温和的删除文件Gracefully delete the file.
On check command:检查是否存在改文件 Detect whether the file is present or not.
On clean command: 强制删除文件Forcefully delete the file.
为了描述Oracle集的这个特殊的资源,第一次创建一个专门的资源类型,它包含了所有该资源类的特征属性。在这种情况下,所描述的唯一的特殊属性是要监控的文件名。这可以用CRSCTL命令来完成。在定义的资源类型,我们也可以指定ACTION_SCRIPT和代理FILENAME属性。这些被用来指shell脚本和包含用于代理的动作的入口点的可执行文件。
一旦资源类型定义,完成所需的任务一个专门的代理有几种选择写-代理可以写成一个脚本,作为C / C ++程序或混合。
Grid_home/crs/demo/demo脚本文件是一个已经包含所有行动入口点和资源文件代理的shell 脚本。通过实现下面几步来测试这个脚本:
(1)启动集群件安装
(2)用crsctl工具添加一个新的资源类型
$ crsctl add type test_type1 -basetype cluster_resource -attr \
“ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt” -attr \
“ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/to/demoActionScript”
适当的修改文件路劲,给集群件添加一个新的资源类型。在文本文件里会添加属性并当做一个参数传递给CRSCTL工具
(3)用CRSCTL工具添加新的资源到集群里。命令如下:
$ crsctl add resource r1 -type test_type1 -attr “PATH_NAME=/tmp/r1.txt”
$ crsctl add resource r2 -type test_type1 -attr “PATH_NAME=/tmp/r2.txt”
指定资源路径是必须的。
(4)用CRSCTL工具启动/停止资源
$ crsctl start res r1
$ crsctl start res r2
$ crsctl check res r1
$ crsctl stop res r2
在Grid_home/crs/demo目录下,Oracle提供了一个demoagent1.cpp。这是一个简单的C++程序,功能和上面的shell脚本类似。这个程序还监控本地机器上指定的文件。测试这个程序,执行下面的步骤:
(1)编译demoagent1.cpp和生成文件。
makefile文件需要根据当地的编译器/连接路径和安装位置进行修改。输出将是一个可执行命名demoagent1。
(2)启动集群件
(3)用crsctl工具添加一个新的资源类型
$ crsctl add type test_type1 -basetype cluster_resource -attr \
“ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt” -attr \
“ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/to/demoActionScript”
适当的修改文件路劲,给集群件添加一个新的资源类型。在文本文件里会添加属性并当做一个参数传递给CRSCTL工具
(4)用CRSCTL工具添加新的资源到集群里。命令如下:
$ crsctl add resource r3 -type test_type1 -attr “PATH_NAME=/tmp/r1.txt”
$ crsctl add resource r4 -type test_type1 -attr “PATH_NAME=/tmp/r2.txt”
指定资源路径是必须的。
(5)用CRSCTL工具启动/停止资源
$ crsctl start res r3
$ crsctl start res r4
$ crsctl check res r3
$ crsctl stop res r4
在Grid_home/crs/demo目录下,Oracle提供了一个demoagent2.cpp。这是一个简单的C++程序,功能和上面的shell脚本类似。这个程序还监控本地机器上指定的文件。不过,这一方案只定义了检查行动的切入点 – 所有其他操作入口点没有定义,并从ACTION_SCRIPT属性被读取。测试这个程序,执行下面的步骤:
(1)编译demoagent2.cpp和生成文件。
makefile文件需要根据当地的编译器/连接路径和安装位置进行修改。输出将是一个可执行命名demoagent2。
(2)启动集群件
(3)用crsctl工具添加一个新的资源类型
$ crsctl add type test_type1 -basetype cluster_resource -attr \
“ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt” -attr \
“ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/to/demoActionScript”
适当的修改文件路劲,给集群件添加一个新的资源类型。在文本文件里会添加属性并当做一个参数传递给CRSCTL工具
(4)用CRSCTL工具添加新的资源到集群里。命令如下:
$ crsctl add resource r5 -type test_type1 -attr “PATH_NAME=/tmp/r1.txt”
$ crsctl add resource r6 -type test_type1 -attr “PATH_NAME=/tmp/r2.txt”
指定资源路径是必须的。
(5)用CRSCTL工具启动/停止资源
$ crsctl start res r5
$ crsctl start res r6
$ crsctl check res r5
$ crsctl stop res r6
Overview
此工具(以前称为瞬时问题检测工具)是用来检测和分析操作系统(OS)和集群资源相关的退化和失败,以带来更多的解释许多Oracle Clusterware和Oracle RAC的问题,如节点驱逐。
它连续不断地跟踪节点,进程和设备级操作系统资源的消耗。它收集并分析了集群范围内的数据。在实时模式下,达到临界值时,警报显示给操作者。分析根本原因,历史数据可以重现以了解在出现故障时发生了什么。这个工具安装非常简单,在zip文件里的README里有描述。最新的版本上传到OTN。下面的连接
http://www.oracle.com/technology/products/database/clustering/ipd_download_homepag e.html
为了一个节点列表上安装该工具中,运行以下基本步骤(更详细的信息,请阅读REAMDE):
– 解压包
– 创建用户crfuser:oinstall在所有节点上
– 确保crfuser的家目录在所有节点上是相同的
– 设置crfuser密码在所有节点上
– 身份登录crfuser和运行crfinst.pl用适当的选项
– 要完成安装,以root身份登录,并安装的所有节点上运行crfinst.pl-f
– 在Linux上 CRF_home设置为/ usr / lib目录/ oracrf
这个OS工具必须通过/etc/init.d/init.crfd启动。此命令启动相应osysmond程序,衍生为ologgerd守护进程。该ologgerd然后挑选一个副本节点(如>=2个节点),并通知该节点启动相应osysmond程序,衍生为ologgerd守护进程。
这个OS工具堆栈可以在一个节点上关闭如下:
# /etc/init.d/init.crfd disable
osysmond(每个节点上的一个收获进程)执行下面的操作来收集数据:
– 监控和定期采集系统指标
– 运行作为实时过程
– 违背了系统指标验证规则
– 基于阈值标记颜色编码的警报
– 将数据发送到主记录器守护进程
– 记录数据到本地磁盘失败的情况下发送
该osysmond将提醒对感知节点挂起(尽管有许多潜在的用户任务未充分利用的资源)
Oracle集群健康监控器附带有两个数据检索工具,一个是CRF GUI,这是主要的GUI显示。
crfgui连接到本地或远程主LOGGERD。它是自动检测LOGGERD集群内安装了图形用户界面,否则,群集节点必须在“-m”开关指定集群外运行。
该GUI提醒关键的资源使用率事件和感知系统挂起。启动它后,我们支持不同的GUI视图如集群视图,节点视图和设备视图。
Usage: crfgui [-m <node>] [-d <time>] [-r <sec>] [-h <sec>]
[-W <sec>] [-i] [-f <name>] [-D <int>]
-m <node> Name of the master node (tmp)
-d <time> Delayed at a past time point
-r <sec> Refresh rate
-h <sec> Highlight rate
-W <sec> Maximal poll time for connection
-I interactive with cmd prompt
-f <name> read from file, “.trc” added if no suffix given
-D <int> sets an internal debug level
oclumon
一个命令行工具包括在可用于查询Berkeley DB后端打印出到终端节点的特定指标为指定的时间周期的包。该工具还支持在查询过程中一个特定的时间内打印一个节点上的资源持续时间和状态。这些状态是基于为每个资源度量的预定义的阈值和被表示为红,橙,黄,绿,指示减小临界的顺序。例如,你可以要求显示节点“节点1”在CPU的最后1小时内保持红色状态多少秒。 Oclumon也可用于执行复杂的管理任务,如改变调试级别,查询工具的版本,改变度量数据库大小等。
该oclumon的使用帮助可以通过oclumon -h打印。要获得有关每个动词选项运行oclumon <动词> -h了解更多信息。
目前支持的动词:
showtrail, showobjects, dumpnodeview, manage, version, debug, quit 和help。
下面是可以传递给oclumon一些有用的属性的例子。对于oclumon的默认位置是/usr/lib/oracrf/bin/oclumon。
Showobjects
oclumon showobjects –n node –time “2009-10-07 15:11:00”
Dumpnodeview
oclumon dumpnodeview –n node
Showgaps
oclumon showgaps -n node1 -s “2009-10-07 02:40:00” \
-e “2009-10-07 03:59:00”
Number of gaps found = 0
Showtrail
oclumon showtrail -n node1 -diskid sde qlen totalwaittime \
-s “2009-07-09 03:40:00” -e “2009-07-09 03:50:00” \
-c “red” “yellow” “green”
Parameter=QUEUE LENGTH
2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
Parameter=TOTAL WAIT TIME
oclumon showtrail -n node1 -sys cpuqlen -s \
“2009-07-09 03:40:00” -e “2009-07-09 03:50:00” \
-c “red” “yellow” “green” Parameter=CPU QUEUELENGTH
Parameter=CPU QUEUELENGTH
2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN
2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN
2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN
2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN
Oracle集群11g第2版的Grid_home/bin/diagcollection.pl正在收集Oracle集群健康监测数据,以及如果发现它安装在群集,它是由Oracle建议的。
节点夯住或被提出后收集数据,执行下面的步骤来分析问题:
– 在IPD的所有者下执行’Grid_home/bin/diagcollection.pl –collect –ipd –incidenttime <inc time> — incidentduration <duration>’ 命令 ,LOGGERD node, where – incidenttime格式为 MM/DD/YYYY24HH:MM:SS, –incidentduration 格式为HH:MM
– 用/usr/lib/oracrf/bin/oclumon manage -getkey “MASTER=”命令辨认出OGGERD 节点. 在Grid_home/bin目录下启动11.2.0.2的oclumonStarting 。
– 最少收集时间前后30分钟的数据。masterloggerhost:$./bin/diagcollection.pl –collect –ipd –incidenttime 10/05/200909:10:11 –incidentduration 02:00 Starting with 11.2.0.2 and the CRS integrated IPD/OS the syntax to get the IPD data collected is “masterloggerhost:$./bin/diagcollection.pl –collect –crshome /scratch/grid_home_11.2/ –ipdhome /scratch/grid_home_11.2/ –ipd — incidenttime 01/14/201001:00:00 –incidentduration 04:00”
– IPD数据文件看起来像: ipdData_<hostname>_<curr time>.tar.gz ipdData_node1_20091006_2321.tar.gz
– 需要多长时间来运行diagcollect?
4 node cluster, 4 hour data – 10 min
32 node cluster, 1 hour data – 20 min
为了开启osysmond或loggerd 的调试,root用户执行‘oclumon debug log all allcomp:5’。这将打开调试的所有组件。
开启的11.2.0.2 的 IPD/CH日志文件会在: Grid_home/log/<hostname>/crfmond Grid_home/log/<hostname>/crfproxy Grid_home/log/<hostname>/crflogd
在开发环境上安装和启动IPD/OS更简单:
$ cd crfutl && make setup && runcrf
osysmond通常会立即启动,而这可能需要几秒钟(几分钟,如果你的I / O子系统比较慢)为ologgerd和oproxyd启动,由于Berkeley数据库(BDB)的初始化。第一个节点称之为“runcrf’将被配置为主。主后第一个节点运行’runcrf“将被配置为复制品。从那里,如果需要的话事情会移动。守护进程看出来的是:osysmond(所有节点),ologgerd(主服务器和副本节点),oproxyd(所有节点上)。
在开发环境中,IPD/ OS进程不以root或实时运行。
11.2.0.3 RAC环境中, 可能由于节点 cache temp space,导致虽然free temp space还有很多,但仍有节点因为无法分配temp space而出现ORA-1652错误。
具体可以参考如下note, 目前已经提供了补丁Bug 14383007 – Sort runs out of temp space in RAC even when temp space is available
Applies to:
Oracle Database – Enterprise Edition – Version 11.2.0.3 to 11.2.0.3 [Release 11.2]
Information in this document applies to any platform.
Symptoms
Temporary tablespace space allocation fails in RAC even when there is still free temp space.
ORA-12801: error signaled in parallel query server P017
ORA-01652: unable to extend temp segment by 640 in tablespace XY_TEMP
Cause
Unbalanced temp space distribution in RAC. One instance seems to consume and cache most of the temp space, causing another instance to hit the ora-1652.
SQL> select inst_id, tablespace_name, round((total_blocks*8192)/(
2 from gv$sort_segment
3 where tablespace_name=’XY_TEMP’
4 order by 1;
INST_ID TABLESPACE_NAME Space(GB)
———- —————————— ———-
1 XY_TEMP
2 XY_TEMP
3 XY_TEMP
4 XY_TEMP
5 XY_TEMP
6 XY_TEMP 1118.79 <<<<very unbalanced
7 XY_TEMP
8 XY_TEMP
This is reported in Bug 14383007 – sort runs out of temp space on 2 nodes even when temp space is available
This bug will be fixed in 11.2.0.4 (future release). Refer < Document 14383007.8> for more details.
Useful queries for debugging:
Collect the information every few seconds:
1. select * from gv$sort_segment
2. select sum(bytes), owner from gv$temp_extent_map group by owner;
3. select inst_id, blocks_cached, blocks_used, extents_cached, extents_used from GV$TEMP_EXTENT_POOL;
Solution
Workaround is:
Retry the operation.
One-off patch 14383007 has been provided for certain platform, please check My Oracle Support for patch detail.
Bug 14383007 Sort runs out of temp space in RAC even when temp space is available
Versions confirmed as being affected 11.2.0.3
Description
Temp space allocation fails with out-of-space ORA-1652 errors in RAC even when
there is still free temp space available.
Rediscovery Notes:
User might hit this issue if temp space allocation fails with out-of-space in
RAC even when there is still free temp space.
Copyright © 2024 · Genesis Framework · WordPress · Log in