Exadata一体机健康检查报告

Exadata数据库一体机软硬结合了ORACLE公司技术的精华，定期的健康检查也马虎不得。
Exadata的健康检查主要基于Oracle Support标准化工具Exachk，关于Exachk的详细介绍可以参考：
介绍 Exachk 的概览和最佳实践；定期使用 Exachk 收集 Exadata 机器的系统信息, 并结合 Oracle 最佳实践与客户当前的环境配置给出建议值, 可及时发现潜在问题, 消除隐患, 保障 Exadata 系统的稳定运行。
这里我们只给出Exachk的具体使用步骤：
$./exachk
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y
Checking ssh user equivalency settings on all nodes in cluster
./exachk: line 8674: [: 5120: unary operator expected
Space available on ware at /tmp is KB and required space is 5120 KB
Please make at least 10MB space available at above location and retry to continue.[y/n][y]?
需要设置RAT_CLUSTERNODES 指定检查的节点名
su - oracle

$export RAT_CLUSTERNODES="dm01db01-priv dm01db02-priv"

export RAT_DBNAMES="orcl,dbm"
$ ./exachk

[oracle@192 tmp]$ ./exachk

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y

Checking ssh user equivalency settings on all nodes in cluster

Node dm01db01-priv is configured for ssh user equivalency for oracle user

Node dm01db02-priv is configured for ssh user equivalency for oracle user

Searching out ORACLE_HOME for selected databases.

. . . . 

Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS

. . . . . . . . . . . . . . . . . . . /u01/app/11.2.0.3/grid/bin/cemutlo.bin: Failed to initialize communication with CSS daemon, error code 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
-------------------------------------------------------------------------------------------------------
                                                 Oracle Stack Status                            
-------------------------------------------------------------------------------------------------------
Host Name  CRS Installed  ASM HOME       RDBMS Installed  CRS UP    ASM UP    RDBMS UP  DB Instance Name
-------------------------------------------------------------------------------------------------------
192         Yes             Yes             Yes             Yes        No       Yes                
dm01db01-priv Yes             Yes             Yes             Yes        No       Yes                
dm01db02-priv Yes             Yes             Yes             Yes        No       Yes                
-------------------------------------------------------------------------------------------------------

root user equivalence is not setup between 192 and STORAGE SERVER dm01cel01.

1. Enter 1 if you will enter root password for each STORAGE SERVER when prompted.

2. Enter 2 to exit and configure root user equivalence manually and re-run exachk.

3. Enter 3 to skip checking best practices on STORAGE SERVER.

Please indicate your selection from one of the above options[1-3][1]:- 1

Is root password same on all STORAGE SERVER?[y/n][y]y

Enter root password for STORAGE SERVER :- 

97 of the included audit checks require root privileged data collection on DATABASE SERVER. If sudo is not configured or the root password is not available, audit checks which  require root privileged data collection can be skipped.

1. Enter 1 if you will enter root password for each on DATABASE SERVER host when prompted

2. Enter 2 if you have sudo configured for oracle user to execute root_exachk.sh script on DATABASE SERVER

3. Enter 3 to skip the root privileged collections on DATABASE SERVER

4. Enter 4 to exit and work with the SA to configure sudo on DATABASE SERVER or to arrange for root access and run the tool later.

Please indicate your selection from one of the above options[1-4][1]:- 1

Is root password same on all compute nodes?[y/n][y]y

Enter root password on DATABASE SERVER:- 

9 of the included audit checks require nm2user privileged data collection on INFINIBAND SWITCH .

1. Enter 1 if you will enter nm2user password for each INFINIBAND SWITCH when prompted

2. Enter 2 to exit and to arrange for nm2user access and run the exachk later.

3. Enter 3 to skip checking best practices on INFINIBAND SWITCH

Please indicate your selection from one of the above options[1-3][1]:- 3

*** Checking Best Practice Recommendations (PASS/WARNING/FAIL) ***

Log file for collections and audit checks are at
/tmp/exachk_040613_105703/exachk.log

Starting to run exachk in background on dm01db01-priv

Starting to run exachk in background on dm01db02-priv

=============================================================
                    Node name - 192                                
=============================================================
Collecting - Compute node PCI bus slot speed for infiniband HCAs
Collecting - Kernel parameters
Collecting - Maximum number of semaphore sets on system
Collecting - Maximum number of semaphores on system
Collecting - Maximum number of semaphores per semaphore set
Collecting - Patches for Grid Infrastructure 
Collecting - Patches for RDBMS Home 
Collecting - RDBMS patch inventory
Collecting - number of semaphore operations per semop system call
Preparing to run root privileged commands on DATABASE SERVER 192.

Starting to run root privileged commands in background on STORAGE SERVER dm01cel01

root@192.168.64.131's password: 

Starting to run root privileged commands in background on STORAGE SERVER dm01cel02

root@192.168.64.132's password: 

Starting to run root privileged commands in background on STORAGE SERVER dm01cel03

root@192.168.64.133's password: 
Collecting - Ambient Temperature on storage server 
Collecting - Exadata Critical Issue EX10 
Collecting - Exadata Critical Issue EX11 
Collecting - Exadata software version on storage server 
Collecting - Exadata software version on storage servers 
Collecting - Exadata storage server system model number  
Collecting - RAID controller version on storage servers 
Collecting - Verify Disk Cache Policy on storage servers 
Collecting - Verify Electronic Storage Module (ESM) Lifetime is within Specification  
Collecting - Verify Hardware and Firmware on Database and Storage Servers (CheckHWnFWProfile) [Storage Server] 
Collecting - Verify Master (Rack) Serial Number is Set [Storage Server] 
Collecting - Verify PCI bridge is configured for generation II on storage servers 
Collecting - Verify RAID Controller Battery Condition [Storage Server] 
Collecting - Verify RAID Controller Battery Temperature [Storage Server] 
Collecting - Verify There Are No Storage Server Memory (ECC) Errors  
Collecting - Verify service exachkcfg autostart status on storage server 
Collecting - Verify storage server disk controllers use writeback cache  
Collecting - verify asr exadata configuration check via ASREXACHECK on storage servers 
Collecting - Configure Storage Server alerts to be sent via email 
Collecting - Exadata Celldisk predictive failures 
Collecting - Exadata Critical Issue EX9 
Collecting - Exadata storage server root filesystem free space 
Collecting - HCA firmware version on storage server 
Collecting - OFED Software version on storage server 
Collecting - OSWatcher status on storage servers 
Collecting - Operating system and Kernel version on storage server 
Collecting - Scan storage server alerthistory for open alerts 
Collecting - Storage server flash cache mode 
Collecting - Verify Data Network is Separate from Management Network on storage server 
Collecting - Verify Ethernet Cable Connection Quality on storage servers 
Collecting - Verify Exadata Smart Flash Cache is created 
Collecting - Verify Exadata Smart Flash Log is Created 
Collecting - Verify InfiniBand Cable Connection Quality on storage servers 
Collecting - Verify Software on Storage Servers (CheckSWProfile.sh)  
Collecting - Verify average ping times to DNS nameserver 
Collecting - Verify celldisk configuration on disk drives 
Collecting - Verify celldisk configuration on flash memory devices 
Collecting - Verify griddisk ASM status 
Collecting - Verify griddisk count matches across all storage servers where a given prefix name exists 
Collecting - Verify storage server metric CD_IO_ST_RQ 
Collecting - Verify there are no griddisks configured on flash memory devices 
Collecting - Verify total number of griddisks with a given prefix name is evenly divisible of celldisks 
Collecting - Verify total size of all griddisks fully utilizes celldisk capacity 
Collecting - mpt_cmd_retry_count from /etc/modprobe.conf on Storage Servers 

Data collections completed. Checking best practices on 192.
--------------------------------------------------------------------------------------

 FAIL =>    CSS misscount should be set to the recommended value of 60
 FAIL =>    Database Server InfiniBand network MTU size is NOT 65520
 WARNING => Database has one or more dictionary managed tablespace
 WARNING => RDBMS Version is NOT 11.2.0.2 as expected
 FAIL =>    Storage Server alerts are not configured to be sent via email
 FAIL =>    Management network is not separate from data network
 WARNING => NIC bonding is NOT configured for public network (VIP)
 WARNING => NIC bonding is  not configured for interconnect
 WARNING => SYS.AUDSES$ sequence cache size < 10,000  WARNING => GC blocks lost is occurring
 WARNING => Some tablespaces are not using Automatic segment storage management.
 WARNING => SYS.IDGEN1$ sequence cache size < 1,000  WARNING => Interconnect is configured on routable network addresses
 FAIL =>    Some data or temp files are not autoextensible
 FAIL =>    One or more Ethernet network cables are not connected.
 WARNING => Multiple RDBMS instances discovered, observe database consolidation best practices
 INFO =>    ASM griddisk,diskgroup and Failure group mapping not checked.
 FAIL =>    One or more storage server has stateless alerts with null "examinedby" fields.
 WARNING => Standby is not opened read only with managed recovery in real time apply mode
 FAIL =>    Managed recovery process is not running
 FAIL =>    Flashback on PRIMARY is not configured
 WARNING => Standby is not in READ ONLY WITH APPLY mode
 FAIL =>    Flashback on STANDBY is not configured
 FAIL =>    No one high redundancy diskgroup configured
 INFO =>    Operational Best Practices
 INFO =>    Consolidation Database Practices
 INFO =>    Network failure prevention best practices
 INFO =>    Computer failure prevention best practices
 INFO =>    Data corruption prevention best practices
 INFO =>    Logical corruption prevention best practices
 INFO =>    Storage failures prevention best practices
 INFO =>    Database/Cluster/Site failure prevention best practices
 INFO =>    Client failover operational best practices
 FAIL =>    Some bigfile tablespaces do not have non-default maxbytes values set
 FAIL =>    Standby database is not in sync with primary database
 FAIL =>    Redo transport from primary to standby has more than 5 minutes or more lag
 FAIL =>    Standby database is not in sync with primary database
 FAIL =>    System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/dbhome_1
 FAIL =>    System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/orcl
 INFO =>    Software maintenance best practices
 FAIL =>    Operating system hugepages count does not satisfy total SGA requirements
 FAIL =>    Table AUD$[FGA_LOG$] should use Automatic Segment Space Management
 INFO =>    Database failure prevention best practices
 WARNING => Database Archivelog Mode should be set to ARCHIVELOG
 WARNING => Some tablespaces are not using Automatic segment storage management.
 WARNING => Database has one or more dictionary managed tablespace
 WARNING => Unsupported data types preventing Data Guard (transient logical standby or logical standby) rolling upgrade
Collecting patch inventory on  CRS HOME /u01/app/11.2.0.3/grid
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/dbhome_1 
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/orcl 

Copying results from dm01db01-priv and generating report. This might take a while. Be patient.

---------------------------------------------------------------------------------
                      CLUSTERWIDE CHECKS
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------

Detailed report (html) - /tmp/exachk_192_dbm_040613_105703/exachk_192_dbm_040613_105703.html

UPLOAD(if required) - /tmp/exachk_192_dbm_040613_105703.zip
最后将生成打包成zip的报告，可以定期上传给GCS。
生成报告的HTML版如下图：
Exadata一体机健康检查报告

Comment 取消回复