Exadata FAQ——Exadata IOPS是怎样计算的?

原文链接: http://www.dbaleet.org/exadata_how_to_caculate_iops/

Thomas Zhang 同学曾经提到一个很有意思的话题:Exadata datasheet的IOPS是怎样计算的?这个问题我想很多Exadata用户都会有同样的困惑,客户隐含的意思就是主机和存储我打过的交道也不少,你这个数据在我看来应该是有水分的,厂商嘛,都喜欢吹吹牛。

比如1/4配的Exadata,使用的是HC(High Capacity)的磁盘。也就是说3个存储节点,每个节点12个7200rpm 3TB SAS盘。官方给的数据是6000 IOPS。那么这6000是怎么得到的?下面简单的做一下推测:

从上图可以看到, HC和HP的IOPS是分开计算的。 为了说明问题简单的列一下表格:

 

Exadata Rack Disk Type Disk Count Disk Model IOPS
FULL(1/1) HP 14*12=168 15000rpm SAS 600G 50000
HC 7200rpm SAS 3T 28000
HALF(1/2) HP 7*12=68 15000rpm SAS 600G 25000
HC 7200rpm SAS 3T 14000
QUAR(1/4) HP 3*12=36 15000rpm SAS 600G 10800
HC 7200rpm SAS 3T 6000

 

简单的看一下可以发现其中的规律: datasheet上给出的总的IOPS实际上是叠加的。

例如HC 1/4配IOPS为6000,得出一个存储节点的IOPS为6000/3=2000, 一块高容量单盘的IOPS为 2000/12=166.67。HP 1/4配IOPS为10800, 得出一个存储节点的IOPS为10800/3=3600, 一块高性能单盘的IOPS为3600/12=300。所以可以得出结论: 一块高性能盘是按照300 IOPS来计算的,而一块高容量盘是按照166.67 IOPS来计算的。

那么现在的疑问就是到底转速在7200rpm SAS接口容量为3T盘的IOPS有没有166.67? 15000rpm SAS接口容量为600G盘的IOPS有没有300?

首先把这个问题交给wikipedia大神:传送们

在这里提到15000rpm的单块SAS盘的IOPS大概在175-210,而7200rpm的单块SATA盘的
IOPS在75-100, 我们取这个区间的最大值重新计算1/4 HC的总的IOPS只有3600, 相比官方宣称的6000少了40%。同样实际计算出来的HP也将近少了33%。为什么会存在如此大的差异?难道这个数值不准?那么同样我们使用另外一种计算方式来得到IOPS,以下采用一种比较流传甚光的方式来计算硬盘的IOPS:IOPS(每秒IO次数) = 1s/(寻道时间+旋转延迟+数据传输时间)

假设磁盘平均物理寻道时间为3ms, 磁盘转速为7200,10K,15Krpm,则磁盘IOPS理论最大值分别为:

 

IOPS(7200rpm)= 1000 / (3 + 60000/7200/2) = 140
IOPS (10000rpm) = 1000 / (3 + 60000/10000/2) = 167
IOPS (15000rpm)= 1000 / (3 + 60000/15000/2) = 200

 

从这种方式来看除了7200rpm的IOPS可以加权40%以外,15000rpm盘的IOPS几乎不变。综上,可以看到实际上Exadata的IOPS与官方宣称的IOPS相差接近30% 到底是什么原因导致了这种差异呢?oracle采用了另外一个名词 Database Disk IOPS(见上面的截图),那么Database Disk IOPS又是什么呢?来看以下官方的解释: Based on read IO requests of size 8K running SQL. Note that the IO size greatly affects Flash IOPS. Others quote IOPS based on 2K, 4K or smaller IOs and are not relevant for databases. Exadata Flash read IOPS are so high they are typically limited by database server CPU, not IO. This is especially true for the Storage Expansion Racks.  这里有亮点很重要的信息:1. 这个指标的衡量是基于读I/O请求计算的。2.是基于8K的I/O大小。言外之意就是早期的Exadata是为DW设计的,读操作对于DW系统更关键。另外就是数据库系统的I/O大小会在8K以上,小于8K的I/O请求在oracle database中是没有太多意义的。

实际上在Exadata安装的时候,有一个测试磁盘I/O性能的步骤,一般是在第9步——INFO: Step 9  RunCalibrate 。这个步骤会对Exadata Cell的磁盘进行IOPS和MBPS的测试。如果有硬盘的IOPS达不到指定的要求,在安装的时候就会报错。 例如有一种很常见的情况: 希捷的硬盘在室温低于20摄氏度的情况下,IOPS会变得较差。见Bug 9476044: CALIBRATE IOPS SUBSTANDARD。 这个问题属于希捷(Seagate)SAS盘的一个“特性”, 后来Exadata使用的日立(Hitachi)没有发现此问题。Exadata的硬盘供应商目前就只有这两家,鉴于此,一般我们并不建议将机房的空调对准Exadata吹来散热。

以下是其中一个客户使用的是希捷 7200rpm SAS接口3T 高容量盘calibrate的真实数据,限于篇幅,我仅仅截取了cel01上面的结果,其它cel的结果基本类似。

 

INFO: Running /usr/local/bin/dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root cellcli -e calibrate force to calibrate cells...
SUCCESS: Ran /usr/local/bin/dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root cellcli -e calibrate force and it returned: RC=0
cel01: Calibration will take a few minutes...
 cel01: Aggregate random read throughput across all hard disk luns: 1466 MBPS
 cel01: Aggregate random read throughput across all flash disk luns: 4183.67 MBPS
 cel01: Aggregate random read IOs per second (IOPS) across all hard disk luns: 2369
 cel01: Aggregate random read IOs per second (IOPS) across all flash disk luns: 157585
 cel01: Controller read throughput: 2020.61 MBPS
 cel01: Calibrating hard disks  ...(read only)
 cel01: Lun 0_0  on drive [20:0     ] random read throughput: 126.08 MBPS, and 194 IOPS
 cel01: Lun 0_1  on drive [20:1     ] random read throughput: 122.98 MBPS, and 190 IOPS
 cel01: Lun 0_10 on drive [20:10    ] random read throughput: 132.91 MBPS, and 200 IOPS
 cel01: Lun 0_11 on drive [20:11    ] random read throughput: 126.68 MBPS, and 199 IOPS
 cel01: Lun 0_2  on drive [20:2     ] random read throughput: 132.73 MBPS, and 204 IOPS
 cel01: Lun 0_3  on drive [20:3     ] random read throughput: 126.32 MBPS, and 201 IOPS
 cel01: Lun 0_4  on drive [20:4     ] random read throughput: 131.33 MBPS, and 202 IOPS
 cel01: Lun 0_5  on drive [20:5     ] random read throughput: 129.67 MBPS, and 202 IOPS
 cel01: Lun 0_6  on drive [20:6     ] random read throughput: 131.65 MBPS, and 201 IOPS
 cel01: Lun 0_7  on drive [20:7     ] random read throughput: 127.67 MBPS, and 200 IOPS
 cel01: Lun 0_8  on drive [20:8     ] random read throughput: 127.63 MBPS, and 201 IOPS
 cel01: Lun 0_9  on drive [20:9     ] random read throughput: 130.88 MBPS, and 201 IOPS
 cel01: Calibrating flash disks (read only, note that writes will be significantly slower) ...
 cel01: Lun 1_0  on drive [FLASH_1_0] random read throughput: 273.71 MBPS, and 20013 IOPS
 cel01: Lun 1_1  on drive [FLASH_1_1] random read throughput: 272.84 MBPS, and 20014 IOPS
 cel01: Lun 1_2  on drive [FLASH_1_2] random read throughput: 272.78 MBPS, and 19996 IOPS
 cel01: Lun 1_3  on drive [FLASH_1_3] random read throughput: 273.64 MBPS, and 19962 IOPS
 cel01: Lun 2_0  on drive [FLASH_2_0] random read throughput: 273.73 MBPS, and 20738 IOPS
 cel01: Lun 2_1  on drive [FLASH_2_1] random read throughput: 273.81 MBPS, and 20724 IOPS
 cel01: Lun 2_2  on drive [FLASH_2_2] random read throughput: 273.69 MBPS, and 20734 IOPS
 cel01: Lun 2_3  on drive [FLASH_2_3] random read throughput: 273.96 MBPS, and 20737 IOPS
 cel01: Lun 4_0  on drive [FLASH_4_0] random read throughput: 273.63 MBPS, and 19959 IOPS
 cel01: Lun 4_1  on drive [FLASH_4_1] random read throughput: 273.85 MBPS, and 19933 IOPS
 cel01: Lun 4_2  on drive [FLASH_4_2] random read throughput: 273.76 MBPS, and 19944 IOPS
 cel01: Lun 4_3  on drive [FLASH_4_3] random read throughput: 272.97 MBPS, and 19911 IOPS
 cel01: Lun 5_0  on drive [FLASH_5_0] random read throughput: 273.87 MBPS, and 20022 IOPS
 cel01: Lun 5_1  on drive [FLASH_5_1] random read throughput: 273.04 MBPS, and 20002 IOPS
 cel01: Lun 5_2  on drive [FLASH_5_2] random read throughput: 273.66 MBPS, and 19998 IOPS
 cel01: Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.77 MBPS, and 19991 IOPS
 cel01: CALIBRATE results are within an acceptable range.

 

 

 

从上面的日志中,我们清楚看到每个盘的Database IOPS在200左右,单个cell的IOPS为2369,  比官方提供的166.67和2000要略高。顺便说一下,早期的HC SAS盘实际值在180左右。

综上,我个人的看法是Exadata中说的IOPS特定指的是Database IOPS,也就是在限定在特定条件下得出来的IOPS。实际的IOPS应该会低不少,Exadata上的Oracle Database 使用ASM Normal冗余,实际可用Database IOPS也要减半。真实的IOPS用户可以使用其它专业第三方工具进行测试,有的人说使用oracle的orion (笑)。真实的IOPS多少,Oracle完全可以送交专业的SPC/SPC-1测试,这可是业界大名鼎鼎的权威测试。为什么Oracle不这么做呢?硬件控可以说Oracle忽悠,但是平心而论,Exadata从来不是靠堆硬件来获取高性能的。Exadata是一个工程系统,工程系统最重要的是平衡,从一些客户的使用情况来看,磁盘的I/O通常是一个瓶颈。如果您是OLTP系统,并且有大量很小很密集的I/O写操作,请不要使用高容量的X2(现在默认下单是X3了)。Oracle似乎已经意识到这个问题,并且在下一版的X3中间大量使用了flashdisk的技术来弥补物理盘本身IOPS瓶颈,因为flash的IOPS远比harddisk要高得多。

附录1:

Exadata使用的硬盘提供商及型号如下,读者可以自行google其详细参数:(注意2T盘已经停产)

 

600G 1500rpm HP disk:Seagate ST3600057SS, Hitachi HUS156060VLS600
3T 7200rpm HC disk: Seagate ST33000650SS
2T 7200rpm HC disk: Seagate ST32000444SS, Hitachi HUS723020ALS640

附录2:

测试Exadata上硬盘和闪存卡database IOPS的工具metric_iorm.pl,有时对Exadata I/O问题的诊断很有用:

Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)

以上

 

如果单纯从硬件的角度来看,我个人认为Oracle官方提供磁盘IOPS信息并不可靠,理由如下:
1. 这里提供的IOPS是一个最大值,也就是maximum value, 也就是说这个IOPS只是一个瞬间的极值,通常只能维持较短的时间,而不是能够一直保持的。
2. Others quote IOPS based on 2K, 4K or smaller IOs and are not relevant for databases 这句话显然是有问题的。尽管数据库块大小是8192字节,但是在Linux的OS blocksize是4096字节, Solaris的OS blocksize是512字节, 也就是说大于4096字节和512字节的块在操作系统层面是不识别的,最终是将数据库的块进行分拆,所以这里说2k, 4k的IO与database不相关显然是不准确的。另外磁盘的最小的读写单位是扇区,而绝大多数磁盘的扇区大小是4096字节。参看http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/ , 所以这里说的无关是不准确的。
3. 绝大多数的IOPS测试是基于4k的, 同样的F20的flash卡是提供4k随机读写的IOPS的指标的,为什么硬盘不能提供这个指标?

 

 

Exadata FAQ——谁偷走了我的空间?

原文链接: http://www.dbaleet.org/exadata_faq_who_stoled_my_space/

在Exadata安装完成以后,几乎每个客户都会问到一个问题:

” 我们公司购买的是你们的Exadata X2-2 1/4配, 使用的是600G高性能盘。现在有3个存储节点,每个节点有12个盘。 这样计算那么我应该得到这么多空间: 600G×12×3=21600GB, 也就是说裸盘一共有大约21.6TB的空间。然后使用了ASM做了normal冗余,最后得到净可用空间应该在10.8TB,扣掉一点开销,我认为最后可用空间应该在10个TB左右,  当初你们的销售告诉我们能用的空间也是10个T, 为什么现在我在ASM中看到可用空间只有6.5TB, 是不是弄错了呀,难道选成了High冗余呀? 什么? 没有? 难道是你们每个盘都预留了一点空间? 这样不行呀, 空间本来对于我们本来就非常紧张, 这个结果对于我们来说是不可接受的!”

 

ASMCMD [+] > lsdg
State    Type    Rebal  Sector  Block      AU  Total_MB   Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  4194304  15593472  15220648          5197824         5011412              0         N  DATA_DM01/
MOUNTED  NORMAL  N         512   4096  4194304    894720    893464           298240          297612              0         Y  DBFS_DG/
MOUNTED  NORMAL  N         512   4096  4194304   3896064   3879104          1298688         1290208              0         N  RECO_DM01/

 

 

事实上,这个对Exadata最常见的一个误区。本人被各种客户问到同样的问题不下10遍! 可见问题往往不是出在客户那里,而是Oracle在某个环节出现了一点小问题。我们来迅速看一下,客户说的情况,首先不假思索,迅速找到Exadata X2-2的datasheet。对于1/4配,我简单的截一个关于可用空间的图。如下:

可以看到,在Exadata的datasheet中提到的1/4配高性能的Exadata的裸磁盘空间确实是21.6TB, 而可用空间为9.5TB,非常接近10TB。并且标明了一个很小的注释3和4。

3. For raw capacity, 1 GB = 1 billion bytes. Capacity calculated using normal space terminology of 1 TB = 1024 * 1024 * 1024* 1024 bytes. Actual formatted capacity is less.

4 Actual space available for a database after mirroring (ASM normal redundancy) while also providing adequate space (one disk on Quarter and Half Racks and two disks on a Full Rack) to reestablish the mirroring protection after a disk failure.

实际上3和4已经能说明部分问题了,但早期并没有这两个注脚,且这两个小的注脚加到Exadata datasheet的时间并不长。3很明显说明磁盘的容量是按照磁盘厂商是按照1TB=1000 * 1000 * 1000* 1000 bytes, 而操作系统则是按照1TB=1024 * 1024 * 1024* 1024 bytes计算。4是说会1/4配会预留1个磁盘空间做冗余保证能顺利进行re-mirroring。 来看进一步的解释:

造成这个结果,主要是以下几个方面的因素导致的:
1. 硬盘厂商和操作系统计算的单位并不一致,操作系统的单位进制为1024,而硬盘厂商的单位进制为1000. 所以600G的盘实际上只有600×1000×1000×1000/1024/1024/1024=558G, 每个cell有12块盘,1/4配一共有3个cell节点,所以一共有36块558G的盘。所以实际总大小为20116G=19.65TB,这个所以19.65T才是实际真实可用的裸磁盘空间。

2. 每个cell的前两个盘各自预留了30G的partition用来安装操作系统,所以实际可用空间为20116G-30G×2×3=19936GB=19.47TB

3. 从注脚4可知,实际上Exadata还预留了一部分空间用于可以进行re-mirroring, 这部分空间大小在1/4配中接近一块盘的容量(1/4配的开销过大),除去ASM的normal redundancy, 实际可用的磁盘容量不到9.5TB。 具体计算方式为(19936G-600G)/2/1024= 9.44TB.

关于这里的Req_mir_free_MB和Usable_file_MB的含义参看文档:

Req_mir_free_MB Amount of space that must be available in the disk group to restore full redundancy after the most severe failure that can be tolerated by the disk group. This is the REQUIRED_MIRROR_FREE_MB column from the V$ASM_DISKGROUP view.

Usable_file_MB Amount of free space, adjusted for mirroring, that is available for new files. From the V$ASM_DISKGROUP view.

假设有这么一种场景:一个cell节点宕掉完全不可用, 如果此时使用的空间在这个Usable_file_MB以下,那么这个时候ASM能正常的完成rebalance, 如果超过了这个阈值,则会因为空间不足,剩余的两个cell节点无法完成rebalance。那么就表示有一部分数据此时只保留了一份(Normal正常情况下,所有的数据都有两份),并且无法完成rebalance。 这个时候如果另外两个节点中出现了坏盘,那么则有可能导致数据的丢失。

所以Usable_file_MB这个值只是一个在安全范围内能正常使用的空间。
实际上cell的可用空间在9T左右(用户可自行测试), 只是如果超过了6.5T, 那么Usable_file_MB会变为负数,也就是说此时如果一个cell宕机就会导致前面描述现象发生。

另外值得注意的是1/4配是无法使用为high redundancy 模式的, 因为1/4有3个cell节点,每个cell节点对应一个failure group。 使用high redundancy 模式需要有5个votedisk, 每个votedisk需要放置于不同的failure group, 也就是至少需要1/2配你才能使用高冗余的模式。

以上。

Exadata存储架构概述

原文链接: http://www.dbaleet.org/overview_of_exadata_storage_architecture/

本文旨在简单介绍Exadata存储的架构,通过阅读本文能对Exadata Storage有一个初步的了解。

Exadata的磁盘层次结构非常清晰, 自第向上依次是 Physicaldisk=>LUN=>celldisk=>griddisk=>ASM disk

 

 

注1:其中上图表示操作系统所在的磁盘(一般为前两块磁盘)的架构;下图表示非操作系统盘的存储架构。(剩余没有进行分区的十块盘)。

注2: griddisk在RDBMS层面对应的是ASM disk, griddisk和ASM disk实际是同一个东西,但是是分别站在Exadata Stroage和RDBMS的角度来看的。

注3: 当前Celldisk与Griddisk的对应关系为1:m,即一对多。但是Kevin Closson提到准确的应该是多对多的关系,但是为了理解上的简单,可以认为是celldisk和griddisk是一对多的关系。

we (Exadata development) considered supporting celldisk creation from HW RAID volumes 
but opted for a 1:1 relationship instead for many reasons. Griddisks are the virtualization of celldisks (the presentation form to ASM). 
To that end there is a M:M relationship between celldisks and griddisks.

下面通过分自底向上的方式分别介绍各层的情况。

首先是physicaldisk:

[root@dm01cel01 ~]#cellcli -e list physicaldisk

dm01cel01: 20:0 RETS0D normal
dm01cel01: 20:1 REXHAD normal
dm01cel01: 20:2 RE5VTD normal
dm01cel01: 20:3 RE5SYD normal
dm01cel01: 20:4 RDDTYD normal
dm01cel01: 20:5 RETB5D normal
dm01cel01: 20:6 RDDS0D normal
dm01cel01: 20:7 RDDULD normal
dm01cel01: 20:8 RDDPZD normal
dm01cel01: 20:9 REXS8D normal
dm01cel01: 20:10 RDDTBD normal
dm01cel01: 20:11 RDDT9D normal
dm01cel01: FLASH_1_0 1202M0CPA5 normal
dm01cel01: FLASH_1_1 1202M0CPA7 normal
dm01cel01: FLASH_1_2 1202M0CQKE normal
dm01cel01: FLASH_1_3 1202M0CPA6 normal
dm01cel01: FLASH_2_0 1202M0CQE6 normal
dm01cel01: FLASH_2_1 1202M0CQE0 normal
dm01cel01: FLASH_2_2 1202M0CQA3 normal
dm01cel01: FLASH_2_3 1202M0CQAL normal
dm01cel01: FLASH_4_0 1202M0CP0E normal
dm01cel01: FLASH_4_1 1202M0CP0D normal
dm01cel01: FLASH_4_2 1202M0CNXH normal
dm01cel01: FLASH_4_3 1202M0CP0A normal
dm01cel01: FLASH_5_0 1202M0CQAE normal
dm01cel01: FLASH_5_1 1202M0CQ9V normal
dm01cel01: FLASH_5_2 1202M0CQA0 normal
dm01cel01: FLASH_5_3 1202M0CQAD normal

从上面可以看到:

以20:*表示的是这个cell节点上的12块物理硬盘;

以FLASH_*-*的表示FLASH卡,每块FLASH卡有4个FMOD,一共有4块FLASH卡,所以能看到16块闪盘。

我们在操作系统层面使用操作系统工具df/fdisk来查看当前cell的操作系统,磁盘/闪盘信息。

[root@dm01cel01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md6 9.9G 3.6G 5.9G 38% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/md8 2.0G 647M 1.3G 34% /opt/oracle
/dev/md4 116M 60M 50M 55% /boot
/dev/md11 2.3G 130M 2.1G 6% /var/log/oracle

操作系统所用的分区为:md6,md8, md4, md11 。

[root@dm01cel01 ~]# mdadm -Q -D /dev/md6
/dev/md6:
Version : 0.90
Creation Time : ......
Raid Level : raid1
Array Size : 10482304 (10.00 GiB 10.73 GB)
Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 6
Persistence : Superblock is persistent

Update Time : ......
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
......

Number Major Minor RaidDevice State
0 8 6 0 active sync /dev/sda6
1 8 22 1 active sync /dev/sdb6

可以看到/dev/md6实际上是由/dev/sda6与/dev/sdb6组成的soft RAID。同样md8, md4, md11 分别是由/dev/sda8与/dev/sdb8,/dev/sda4与/dev/sdb4, /dev/sda11与/dev/sdb11组成的RAID。限于篇幅,这里不一一列出来。

[root@dm01cel01 ~]# fdisk -l

Disk /dev/sda: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          15      120456   fd  Linux raid autodetect
/dev/sda2              16          16        8032+  83  Linux
/dev/sda3              17       69039   554427247+  83  Linux
/dev/sda4           69040       72824    30403012+   f  W95 Ext'd (LBA)
/dev/sda5           69040       70344    10482381   fd  Linux raid autodetect
/dev/sda6           70345       71649    10482381   fd  Linux raid autodetect
/dev/sda7           71650       71910     2096451   fd  Linux raid autodetect
/dev/sda8           71911       72171     2096451   fd  Linux raid autodetect
/dev/sda9           72172       72432     2096451   fd  Linux raid autodetect
/dev/sda10          72433       72521      714861   fd  Linux raid autodetect
/dev/sda11          72522       72824     2433816   fd  Linux raid autodetect

Disk /dev/sdb: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          15      120456   fd  Linux raid autodetect
/dev/sdb2              16          16        8032+  83  Linux
/dev/sdb3              17       69039   554427247+  83  Linux
/dev/sdb4           69040       72824    30403012+   f  W95 Ext'd (LBA)
/dev/sdb5           69040       70344    10482381   fd  Linux raid autodetect
/dev/sdb6           70345       71649    10482381   fd  Linux raid autodetect
/dev/sdb7           71650       71910     2096451   fd  Linux raid autodetect
/dev/sdb8           71911       72171     2096451   fd  Linux raid autodetect
/dev/sdb9           72172       72432     2096451   fd  Linux raid autodetect
/dev/sdb10          72433       72521      714861   fd  Linux raid autodetect
/dev/sdb11          72522       72824     2433816   fd  Linux raid autodetect

Disk /dev/sdc: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

Disk /dev/sde: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sde doesn't contain a valid partition table

Disk /dev/sdf: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdf doesn't contain a valid partition table

Disk /dev/sdg: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdg doesn't contain a valid partition table

Disk /dev/sdh: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdh doesn't contain a valid partition table

Disk /dev/sdi: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdi doesn't contain a valid partition table

Disk /dev/sdj: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdj doesn't contain a valid partition table

Disk /dev/sdk: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdk doesn't contain a valid partition table

Disk /dev/sdl: 598.9 GB, 598999040000 bytes
255 heads, 63 sectors/track, 72824 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdl doesn't contain a valid partition table

Disk /dev/sdm: 4009 MB, 4009754624 bytes
126 heads, 22 sectors/track, 2825 cylinders
Units = cylinders of 2772 * 512 = 1419264 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdm1               1        2824     3914053   83  Linux

Disk /dev/md1: 731 MB, 731906048 bytes
2 heads, 4 sectors/track, 178688 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md11: 2492 MB, 2492137472 bytes
2 heads, 4 sectors/track, 608432 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md11 doesn't contain a valid partition table

Disk /dev/md2: 2146 MB, 2146697216 bytes
2 heads, 4 sectors/track, 524096 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/md8: 2146 MB, 2146697216 bytes
2 heads, 4 sectors/track, 524096 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md8 doesn't contain a valid partition table

Disk /dev/md7: 2146 MB, 2146697216 bytes
2 heads, 4 sectors/track, 524096 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md7 doesn't contain a valid partition table

Disk /dev/md6: 10.7 GB, 10733879296 bytes
2 heads, 4 sectors/track, 2620576 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md6 doesn't contain a valid partition table

Disk /dev/md5: 10.7 GB, 10733879296 bytes
2 heads, 4 sectors/track, 2620576 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md5 doesn't contain a valid partition table

Disk /dev/md4: 123 MB, 123273216 bytes
2 heads, 4 sectors/track, 30096 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md4 doesn't contain a valid partition table

Disk /dev/sdn: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdn doesn't contain a valid partition table

Disk /dev/sdo: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdo doesn't contain a valid partition table

Disk /dev/sdp: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdp doesn't contain a valid partition table

Disk /dev/sdq: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdq doesn't contain a valid partition table

Disk /dev/sdr: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdr doesn't contain a valid partition table

Disk /dev/sds: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sds doesn't contain a valid partition table

Disk /dev/sdt: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdt doesn't contain a valid partition table

Disk /dev/sdu: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdu doesn't contain a valid partition table

Disk /dev/sdv: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdv doesn't contain a valid partition table

Disk /dev/sdw: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdw doesn't contain a valid partition table

Disk /dev/sdx: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdx doesn't contain a valid partition table

Disk /dev/sdy: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdy doesn't contain a valid partition table

Disk /dev/sdz: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdz doesn't contain a valid partition table

Disk /dev/sdaa: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdaa doesn't contain a valid partition table

Disk /dev/sdab: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdab doesn't contain a valid partition table

Disk /dev/sdac: 24.5 GB, 24575868928 bytes
255 heads, 63 sectors/track, 2987 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdac doesn't contain a valid partition table

fdisk的输出至少包含如下信息:

  • 此cell为HP(high performance)的硬盘,单个磁盘的大小为600G;
  • 设备号sda和sdb为操作系统所在的磁盘, 这两个磁盘分别划分了11个partition;
  • 12块硬盘的对应设备编号为:/dev/sda到/dev/sdk;
  • 16个FDOM对应的设备编号为:/dev/sdn到/dev/sdac。

继续查看LUN的信息:

[root@dm01cel01 ~]#cellcli -e list LUN


dm01cel01: 0_0 0_0 normal
dm01cel01: 0_1 0_1 normal
dm01cel01: 0_2 0_2 normal
dm01cel01: 0_3 0_3 normal
dm01cel01: 0_4 0_4 normal
dm01cel01: 0_5 0_5 normal
dm01cel01: 0_6 0_6 normal
dm01cel01: 0_7 0_7 normal
dm01cel01: 0_8 0_8 normal
dm01cel01: 0_9 0_9 normal
dm01cel01: 0_10 0_10 normal
dm01cel01: 0_11 0_11 normal
dm01cel01: 1_0 1_0 normal
dm01cel01: 1_1 1_1 normal
dm01cel01: 1_2 1_2 normal
dm01cel01: 1_3 1_3 normal
dm01cel01: 2_0 2_0 normal
dm01cel01: 2_1 2_1 normal
dm01cel01: 2_2 2_2 normal
dm01cel01: 2_3 2_3 normal
dm01cel01: 4_0 4_0 normal
dm01cel01: 4_1 4_1 normal
dm01cel01: 4_2 4_2 normal
dm01cel01: 4_3 4_3 normal
dm01cel01: 5_0 5_0 normal
dm01cel01: 5_1 5_1 normal
dm01cel01: 5_2 5_2 normal
dm01cel01: 5_3 5_3 normal

其中:

0_0到0_11为12块磁盘的LUN号

1_0到1_3为第一块flash卡的LUN号,依次类推,flash卡的LUN号有时并不连续,是因为更换过flash卡导致的。

接下来再来看celldisk的信息:

[root@dm01cel01 ~]#cellcli -e list celldisk


dm01cel01: CD_00_dm01cel01 normal
dm01cel01: CD_01_dm01cel01 normal
dm01cel01: CD_02_dm01cel01 normal
dm01cel01: CD_03_dm01cel01 normal
dm01cel01: CD_04_dm01cel01 normal
dm01cel01: CD_05_dm01cel01 normal
dm01cel01: CD_06_dm01cel01 normal
dm01cel01: CD_07_dm01cel01 normal
dm01cel01: CD_08_dm01cel01 normal
dm01cel01: CD_09_dm01cel01 normal
dm01cel01: CD_10_dm01cel01 normal
dm01cel01: CD_11_dm01cel01 normal
dm01cel01: FD_00_dm01cel01 normal
dm01cel01: FD_01_dm01cel01 normal
dm01cel01: FD_02_dm01cel01 normal
dm01cel01: FD_03_dm01cel01 normal
dm01cel01: FD_04_dm01cel01 normal
dm01cel01: FD_05_dm01cel01 normal
dm01cel01: FD_06_dm01cel01 normal
dm01cel01: FD_07_dm01cel01 normal
dm01cel01: FD_08_dm01cel01 normal
dm01cel01: FD_09_dm01cel01 normal
dm01cel01: FD_10_dm01cel01 normal
dm01cel01: FD_11_dm01cel01 normal
dm01cel01: FD_12_dm01cel01 normal
dm01cel01: FD_13_dm01cel01 normal
dm01cel01: FD_14_dm01cel01 normal
dm01cel01: FD_15_dm01cel01 normal

可以看到在celldisk中磁盘与闪盘分别以前缀CD和FD代替了,并且从上面的信息来看:没一块磁盘对应一块celldisk,每一flash卡的FDOM模块也对应一块celldisk。

最后来看一下griddisk:

[root@dm01cel01 ~]#cellcli -e list griddisk


dm01cel01: DATA_CD_00_dm01cel01 active
dm01cel01: DATA_CD_01_dm01cel01 active
dm01cel01: DATA_CD_02_dm01cel01 active
dm01cel01: DATA_CD_03_dm01cel01 active
dm01cel01: DATA_CD_04_dm01cel01 active
dm01cel01: DATA_CD_05_dm01cel01 active
dm01cel01: DATA_CD_06_dm01cel01 active
dm01cel01: DATA_CD_07_dm01cel01 active
dm01cel01: DATA_CD_08_dm01cel01 active
dm01cel01: DATA_CD_09_dm01cel01 active
dm01cel01: DATA_CD_10_dm01cel01 active
dm01cel01: DATA_CD_11_dm01cel01 active
dm01cel01: DBFS_DG_CD_02_dm01cel01 active
dm01cel01: DBFS_DG_CD_03_dm01cel01 active
dm01cel01: DBFS_DG_CD_04_dm01cel01 active
dm01cel01: DBFS_DG_CD_05_dm01cel01 active
dm01cel01: DBFS_DG_CD_06_dm01cel01 active
dm01cel01: DBFS_DG_CD_07_dm01cel01 active
dm01cel01: DBFS_DG_CD_08_dm01cel01 active
dm01cel01: DBFS_DG_CD_09_dm01cel01 active
dm01cel01: DBFS_DG_CD_10_dm01cel01 active
dm01cel01: DBFS_DG_CD_11_dm01cel01 active
dm01cel01: RECO_CD_00_dm01cel01 active
dm01cel01: RECO_CD_01_dm01cel01 active
dm01cel01: RECO_CD_02_dm01cel01 active
dm01cel01: RECO_CD_03_dm01cel01 active
dm01cel01: RECO_CD_04_dm01cel01 active
dm01cel01: RECO_CD_05_dm01cel01 active
dm01cel01: RECO_CD_06_dm01cel01 active
dm01cel01: RECO_CD_07_dm01cel01 active
dm01cel01: RECO_CD_08_dm01cel01 active
dm01cel01: RECO_CD_09_dm01cel01 active
dm01cel01: RECO_CD_10_dm01cel01 active
dm01cel01: RECO_CD_11_dm01cel01 active

从上面的信息来看每一块celldisk被分为3块griddisk。分别对应于ASM的diskgroup中的三个磁盘组DATA, DBFS_DG, RECO。也就是说DATA 这个diskgroup只使用含有前缀DATA的griddisk,依此类推。在默认没有使用interleaving的情况下, DATA前缀的griddisk使用的是磁盘的最外侧磁道(最先创建,偏移量最小),所以速度最快,RECO前缀的griddisk使用的是磁盘的最内测磁道,所以速度最慢。(update: 如果有DBFS, DBFS在最内圈,速度比RECO慢)

Exadata内建的安全特性

原文链接: http://www.dbaleet.org/buildin_security_features_of_exadata/

 

 

Oracle在这篇白皮书(ORACLE EXADATA DATABASE MACHINE SECURITY OVERVIEW)中对Exadata的安全特性做了一个总体性的概括。不过先等一等,不要急于打开。为什么?因为它基本上什么都没说!!!我来简单概括一下吧:文中一共介绍了“Exadata的几大安全特性”:

1. Exadata可以使用Advanced Security Option来对数据进行加密解密。使用ASE算法的时候可以利用Intel XEON 5600 CPU内建的硬件级的加密/解密技术大幅提升加/解密的速度;

2. Exadata可以使用Oracle Database Vault对数据库进行访问控制;

3. Exadata可以使用Oracle Audit Vault对数据库进行审计;

4. Exadata可以使用Oracle Database Firewall防范数据库的非法访问和注入攻击;

5. Oracle ERP/SAP已经对Advanced Security Option, Oracle Database Vault, Oracle Audit Vault, Oracle Database Firewall进行了认证。

拜托, 老兄,这个也能算作Exadata的安全特性?被你打败了。。。

Exadata除了可以支持标准Oracle Database 11gR2的一些安全组件和安全特性以外,还提供了三大安全模式。它们分别是:Open Security Mode(开放安全模式), ASM-Scoped Security Mode (ASM范畴安全模式)以及Database-Scoped Security Mode(数据库范畴安全模式)。这三大模式是针对ASM/database访问cell节点的griddisk是否进行限制以及如何限制进行划分的。

 

 

Open Security Mode

最简单的肯定是Open Security Mode, 从open这个词就可以看出其安全的级别是最低的,换而言之就是不对Cell节点的griddisk做任何的访问限制,任何数据库客户端都可以访问到griddisk。Oracle官方手册上是这样说的:“Open security mode is useful for test or development databases where there are no security requirements.” 显然这种模式是对没有特殊安全需求的系统适用,主要针对开发和测试系统。Exadata默认创建的griddisk就是处于这种安全模式下的。

 

ASM-Scoped Security Mode

而ASM-Scoped Security Mode则是在Open Security Mode上的加强,表示Cell上的griddisk仅允许特定的ASM客户端进行访问。

a. 在DB节点上:

关闭牵涉到griddisk变更受影响的所有数据库和ASM实例

b. 在Cell节点上:

1. 在CellCLI模式下,使用CREATE KEY命令创建一个密钥, 例如:

CellCLI> CREATE KEY
66e12adb996805358bf82258587f5050

2. 在CellCLI模式下,使用ASSIGN KEY命令将这个密钥分配给特定的ASM实例(准确的说是ASM unique name, 下文以默认的+ASM为例);

CellCLI> ASSIGN KEY FOR '+ASM'='66e12adb996805358bf82258587f5050'

其中 cluster_name可以使用以下任何语句都可以查询得到:

SQL> SHOW PARAMETER db_unique_name;
SQL> SELECT name, value FROM V$PARAMETER WHERE name = 'db_unique_name';

3. 在CellCLI模式下, 将availableTo属性赋予所有包含ASM实例名的griddisk;

如果是使用此安全策略创建新的griddisk:

CellCLI> CREATE GRIDDISK ALL PREFIX=report, size=75G, availableTo='+ASM'

如果是修改已有的griddisk的安全策略:

CellCLI> ALTER GRIDDISK report_CD_01_cell01, report_CD_02_cell01, report_CD_03_cell01, report_CD_04_cell01, report_CD_05_cell01, report_CD_06_cell01 availableTo='+ASM'

注: 以上仅仅是针对cell01这个节点的griddisk进行限制,如果需要对所有的cell节点的griddisk进行限制,最好使用dcli进行处理。

c. DB节点

1. 使用ASM owner(一般为grid)创建/etc/oracle/cell/network-config/cellkey.ora 文件,权限为600。

key=66e12adb996805358bf82258587f5050
asm=+asm
realm=my_realm

2. 重新启动这些ASM实例和数据库实例。

Database-Scoped Security Mode

而Database Security Mode则是在ASM-Scoped Security Mode的进一步细化,表示Cell上的griddisk仅允许特定的数据库客户端进行访问。

 

a. 在DB节点上:

关闭牵涉到griddisk变更受影响的所有数据库和ASM实例

b. 在Cell节点上:

1. 在CellCLI模式下,使用CREATE KEY命令创建一个密钥, 例如:

CellCLI> CREATE KEY
51a826646ebe1f29e33c6ed7c4965c9a

CellCLI> CREATE KEY
bd0843beeed5e18e6664576cf9805b69

CellCLI> CREATE KEY
6679ef9ec02fa664582c3464d4b0191f

 

2. 在CellCLI模式下,使用ASSIGN KEY命令将这个密钥分配给特定的数据库实例(准确的说是database unique name)

CellCLI> ASSIGN KEY FOR
'ebs'='51a826646ebe1f29e33c6ed7c4965c9a',
'peoplesoft'='bd0843beeed5e18e6664576cf9805b69',
'siebel'='6679ef9ec02fa664582c3464d4b0191f'

 

3. 在CellCLI模式下, 将availableTo属性赋予所有包含ASM实例名的griddisk;

如果是使用此安全策略创建新的griddisk:

CellCLI> CREATE GRIDDISK report_CD_00_cell01, report_CD_01_cell01 size=75G, -
availableTo='+asm,ebs'
CellCLI> CREATE GRIDDISK report_CD_02_cell01, report_CD_03_cell01 size=75G, -
availableTo='+asm,peoplesoft'
CellCLI> CREATE GRIDDISK report_CD_04_cell01, report_CD_05_cell01 size=75G, -
availableTo='+asm,siebel'

如果是修改已有的griddisk的安全策略:

CellCLI> ALTER GRIDDISK report_CD_01_cell01, report_CD_02_cell01 -
availableTo='+asm,ebs'
CellCLI> ALTER GRIDDISK report_CD_03_cell01, report_CD_04_cell01 -
availableTo='+asm,peoplesoft'
CellCLI> ALTER GRIDDISK report_CD_05_cell01, report_CD_06_cell01 -
availableTo='+asm,siebel'

c. DB节点

1. 在/ORACLE_HOME/admin/ebs/pfile/ 目录下创建cellkey.ora 文件,所有者为oracle:dba, 权限为600。

key=51a826646ebe1f29e33c6ed7c4965c9a
asm=+asm
realm=my_realm

按照同样的步骤完成其它数据库的cellkey.ora的创建。

2. 重新启动这些ASM实例和数据库实例。

以上是对Exadata内建安全特性中的三种安全模式以及实施方法进行介绍,鉴于目前使用非Open Security Mode的客户比较少,仅作研究Exadata安全特性之用。如果需要在生产环境上实施,请进行全方位的测试。

以上。

 


© Steven Lee for Oracle Exadata & Best practices, 2013. |

Exadata Database Machine数据库一体机专题

 

Exadata model1

 

什么是Exadata?

 

Exadata是软硬件结合的数据库一体机, 在出厂前完成预配置,在运达用户现场后开封上电即可使用。

 

由SUN 提供的硬件!

由Oracle提供的软件, Database Server和 Exadata Storage Server software SAGE

Exadata的出现意味着大规模并行化,最高的RDBMS性能标杆,容错能力和可扩展能力。

 

 

 

 

 

 

 

 

 

 

exadata_v1Exadata的历史

 

版本 Version 1

在2008年OOW期间开始宣传, 是Oracle和HP合作开发的。 是当时世界上最快的数据仓库一体机。为顺序物理读提供了额外的性能优化,比其他硬件平台上的Oracle数据仓库快10倍。

 

 

 

 

 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exadata_V2

 

 

 

版本 Version 2

 

于2009年9月份发布, 由 Oracle和SUN 联合开发。  是当时世界上最快的OLTP一体机。为随机读提供了额外的性能优化。 比Version 1的Exadata在DW上快5倍。 引入注目的 新 Exadata Storage Software能力。

 

 

 

 

 

 

Exadata的最大卖点无意是 其Smart Scan Processing智能扫描技术

 

exadata smart scan1

 

 

Smart Scan Processing智能扫描技术的核心在于offloading,offload很难翻译,但你可以理解为将一部分扫描处理交给Exadata的存储节点来完成。

本来要扫描1GB的数据表,实际符合查询条件的仅10MB数据。 传统架构总是无法避免要让Database Server去扫描那1GB的数据。

 

而Exadata将这部分load off到Exadata Cell存储节点上,由cell来扫描那1GB数据,而仅仅返回10MB给Database Server,这样分工是由于存储节点更精于物理扫描也更接近物理磁盘。

 

真正实现Smart Scan智能扫描的是Oracle Exadata Storage Software 代号Sage, 开发时间估计在2006或更早就开始了!

 

 

 

 

 

Oracle Exadata Storage Software SAGE是 Exadata的灵魂, 是Oracle自主研发的能听懂数据库SQL语言的智能存储软件。 由于SAGE软件才是Exadata的灵魂,所以光靠堆积flashcard、Infiniband等硬件是无法山寨Exadata数据库一体机的。

 

Exadata目前的一用户包括 StartBucks 、Facebook、华为、中国移动、上海银行、工商银行、Apple苹果、三星电子、LG、法国巴黎银行、韩国电信、韩亚航空、澳大利亚联邦银行、日本软银集团、海尔、喜达屋集团、尼桑、PayPal、土耳其电信、神奈川県警察本部、株式会社三井住友銀行、中国华夏银行、中国人民人寿保险股份有限公司、深圳市人社局、青岛市人社局、乌鲁木齐市人社局、本溪市人社局、新疆电信、广东移动、辽宁移动、福建移动、神华集团、东风汽车、海尔集团、中冶赛迪重庆信息技术有限公司、上海研发公共服务平台、中远集装箱运输有限公司、内蒙古电网、启融普惠(北京)科技有限公司、印孚瑟斯技术有限公司和香港房屋署等等。

 

Exadata用户群

 

 

Exadata的硬件主要分成三个部分: 

  • Database Server 有时候也叫做Compute Node
  • Storage Server 也叫做Cell Node
  • Infiniband Switch 简写IB SW

如下图:

 

exadata arch

 

 

 

exadata arch2

 

 

 

Oracle Exadata X2-2 and X2-8的 Storage Servers

 

Storage Server Exadata

 

 

 

Oracle Exadata V2 Storage Servers

storage server v2

 

 

 

 

Exadata真机在SUN的装配流水线上

 

Exadata真机在装配流水线上

 

 

 

 

 

========================================================================分割线

 

Exadata Storage Server硬件配置

 

 

 

V2 X2-2
CPU Nehalem-EP Westmere-EP
DIMMs -compute node 8 GB at 1066 MHz 8 GB-LV dimms at 1333 MHz (post-RR)
HBA Niwot with BBU07 Niwot with BBU08
NIC (compute node only) No 10 GE ports Using Niantic 10 GE ports
Compute node HDD 146 GB SAS2 drives 300 GB SAS2 drives
Storage node -SAS 600 GB SAS2 15 KRPM Same as V2
Storage node -SATA 2TB SATA 7.2 KRPM 2TB FAT-SAS 7.2 KRPM
Aura (4x) V1.0 V1.0 (V1.1 post-RR)
IB CX CX2

 

 

 

Exadata 上的OS操作系统版本

compute note 数据库服务器操作系统, Solaris或者Linux

Oracle Enterprise Linux (OEL)和Solaris 11(x86)是目前 Exadata db服务器可选的2种OS类型, 当然Solaris仅在Exadata V2之后的机型上可选

用户可以在安装时选择他们更青睐的OS, Exadata 刷机的image 会有这2个版本。

出厂安装的image有linux 和solaris的双启动选项, 用户可以选择默认的启动OS 。

 

 

Exadata存储容量 对应不同的配置

 

FULL RACK Half RACK Quarter RACK One Cell
Raw Disk Capacity 432 TB 216 TB 96 TB 24TB
Raw Flash Capacity 6.75 TB 3.4 TB 1.5 TB 0.375 TB
Usable Mirrored Capacity 194 TB 97 TB 42.5 TB 10.75 TB
Usable Triple Mirrored Capacity 130 TB 65 TB 29 TB 7.25 TB

Exadata通过ILOM远程MOUNT ISO实现刷机Reimage

Exadata通过ILOM远程MOUNT ISO实现刷机Reimage

 

[gview file=”https://www.askmac.cn/wp-content/uploads/2013/05/Exadata刷机实验-MacleanLiu.pdf”]

【Exadata一体机】Exadata Cell监控最佳实践

  1. Verify cable connections via the following steps

Visually inspect all cables for proper connectivity.

 

确认缆线链接正常

 

 

 

[root@dm01db01 ~]# cat /sys/class/net/ib0/carrier

1

[root@dm01db01 ~]# cat /sys/class/net/ib1/carrier

1

 

确认输出是1

 

 

检查这些命令,

ls -l /sys/class/infiniband/*/ports/*/*errors*

 

 

/opt/oracle.SupportTools/ibdiagtools 目录包含了verify_topology 和infinicheck工具 运行并确认网络。下面是这些工具的信息:

 

[root@dm01db01 ~]# cd /opt/oracle.SupportTools/

[root@dm01db01 oracle.SupportTools]# ls

asrexacheck         defaultOSchoose.pl  firstconf                        make_cellboot_usb  PS4ES            sys_dirs.tar

CheckHWnFWProfile   diagnostics.iso     flush_cache.sh                   MegaSAS.log        reclaimdisks.sh

CheckSWProfile.sh   em                  harden_passwords_reset_root_ssh  ocrvothostd        setup_ssh_eq.sh

dbserver_backup.sh  exachk              ibdiagtools                      onecommand         sundiag.sh

 

 

[root@dm01db01 oracle.SupportTools]# cd ibdiagtools/

[root@dm01db01 ibdiagtools]# ls

cells_conntest.log    dcli                  ibqueryerrors.log  perf_cells.log0  perf_mesh.log1     subnet_cells.log  VERSION_FILE

cells_user_equiv.log  diagnostics.output    infinicheck        perf_cells.log1  perf_mesh.log2     subnet_hosts.log  xmonib.sh

checkbadlinks.pl      hosts_conntest.log    monitord           perf_cells.log2  README             topologies

cleanup_remote.log    hosts_user_equiv.log  netcheck           perf_hosts.log0  SampleOutputs.txt  topology-zfs

clearcounters.log     ibping_test           netcheck_scratch   perf_mesh.log0   setup-ssh          verify-topology

 

 

 

[root@dm01db01 ibdiagtools]# ./verify-topology -h

 

[ DB Machine Infiniband Cabling Topology Verification Tool ]

[Version IBD VER 2.c 11.2.3.1.1  120607]

Usage: ./verify-topology [-v|–verbose] [-r|–reuse (cached maps)]  [-m|–mapfile]

[-ibn|–ibnetdiscover (specify location of ibnetdiscover output)]

[-ibh|–ibhosts (specify location of ibhosts output)]

[-ibs|–ibswitches (specify location of ibswitches output)]

[-t|–topology [torus | quarterrack ] default is fattree]

[-a|–additional [interconnected_quarterrack]

[-factory|–factory non-exadata machines are treated as error]

 

Please note that halfrack is now redundant. Checks for Half Racks

are now done by default.

-t quarterrack

option is needed to be used only if testing on a stand alone quarterrack

-a interconnected_quarterrack

option is to be used only when testing on large multi-rack setups

-t fattree

option is the default option and not required to be specified

 

Example : perl ./verify-topology

Example : ././verify-topology -t quarterrack

Example : ././verify-topology -t torus

Example : ././verify-topology -a interconnected_quarterrack

——— Some Important properties of the fattree cabling topology————–

(1) Every internal switch must be connected to every external switch

(2) No 2 external switches must be connected to each other

——————————————————————————-

Please note that switch guid can be determined by logging in to a switch and

trying either of these commands, depending on availability –

>module-firmware show

OR

>opensm

 

 

 

[root@dm01db01 ibdiagtools]# ./verify-topology -t fattree

 

[ DB Machine Infiniband Cabling Topology Verification Tool ]

[Version IBD VER 2.c 11.2.3.1.1  120607]

External non-Exadata-image nodes found: check for ZFS if on T4-4 – else ignore

Leaf switch found: dmibsw03.acs.oracle.com (212846902ba0a0)

Spine switch found: 10.146.24.251 (2128469c74a0a0)

Leaf switch found: dmibsw02.acs.oracle.com (21284692d4a0a0)

Spine switch found: 10.146.24.252 (2128b7f744c0a0)

Spine switch found: dmibsw01.acs.oracle.com (21286cc7e2a0a0)

Spine switch found: 10.146.24.253 (2128b7ac44c0a0)

 

Found 2 leaf, 4 spine, 0 top spine switches

 

Check if all hosts have 2 CAs to different switches……………[SUCCESS]

Leaf switch check: cardinality and even distribution…………..[SUCCESS]

Spine switch check: Are any Exadata nodes connected …………..[SUCCESS]

Spine switch check: Any inter spine switch links………………[ERROR]

Spine switches 10.146.24.251 (2128469c74a0a0) & 10.146.24.252 (2128b7f744c0a0) should not be connected

[ERROR]

Spine switches 10.146.24.251 (2128469c74a0a0) & 10.146.24.253 (2128b7ac44c0a0) should not be connected

[ERROR]

Spine switches 10.146.24.252 (2128b7f744c0a0) & dmibsw01.acs.oracle.com (21286cc7e2a0a0) should not be connected

[ERROR]

Spine switches 10.146.24.252 (2128b7f744c0a0) & 10.146.24.253 (2128b7ac44c0a0) should not be connected

[ERROR]

Spine switches dmibsw01.acs.oracle.com (21286cc7e2a0a0) & 10.146.24.253 (2128b7ac44c0a0) should not be connected

 

Spine switch check: Any inter top-spine switch links…………..[SUCCESS]

Spine switch check: Correct number of spine-leaf links…………[ERROR]

Leaf switch dmibsw03.acs.oracle.com (212846902ba0a0) must be linked

to spine switch 10.146.24.252 (2128b7f744c0a0) with

at least 1 links…0 link(s) found

[ERROR]

Leaf switch dmibsw02.acs.oracle.com (21284692d4a0a0) must be linked

to spine switch 10.146.24.252 (2128b7f744c0a0) with

at least 1 links…0 link(s) found

[ERROR]

Spine switch 10.146.24.252 (2128b7f744c0a0) has fewer than 2 links to leaf switches.

It has 0

[ERROR]

Leaf switch dmibsw03.acs.oracle.com (212846902ba0a0) must be linked

to spine switch 10.146.24.253 (2128b7ac44c0a0) with

at least 1 links…0 link(s) found

[ERROR]

Leaf switch dmibsw02.acs.oracle.com (21284692d4a0a0) must be linked

to spine switch 10.146.24.253 (2128b7ac44c0a0) with

at least 1 links…0 link(s) found

[ERROR]

Spine switch 10.146.24.253 (2128b7ac44c0a0) has fewer than 2 links to leaf switches.

It has 0

 

Leaf switch check: Inter-leaf link check……………………..[ERROR]

Leaf switches dmibsw03.acs.oracle.com (212846902ba0a0) & dmibsw02.acs.oracle.com (21284692d4a0a0) have 0 links between them

They should have 7 links instead.

 

Leaf switch check: Correct number of leaf-spine links………….[SUCCESS]

 

 

 

 

确认硬件和固件

 

cd /opt/oracle.cellos/

[root@dm01db01 oracle.cellos]# ./CheckHWnFWProfile

 

[SUCCESS] The hardware and firmware profile matches one of the supported profiles

 

 

确认平台软件

 

 

 

 

 

[root@dm01db01 oracle.cellos]# cd /opt/oracle.SupportTools/

[root@dm01db01 oracle.SupportTools]# ./CheckSWProfile.sh

usage: ./CheckSWProfile.sh options

 

This script returns 0 when the platform and software on the

machine on which it runs matches one of the suppored platform and

software profiles. It will return nonzero value in all other cases.

The check is applicable both to Exadata Cells and Database Nodes

with Oracle Enterprise Linux (OEL) and RedHat Enterprise Linux (RHEL).

 

OPTIONS:

-h    Show this message

-s    Show supported platforms and software profiles for this machine

-c    Check this machine for supported platform and software profiles

-I <No space comma separated list of Infiniband switch names/ip addresses>

To check configuration for SPINE switch prefix the switch host name or

ip address with IS_SPINE.

Example: CheckSWProfile.sh -I IS_SPINEswitch1.company.com,switch2.company.com

Check for the software revision on the managed Infiniband switches

in the Database Machine. You will need to supply the password for

admin user.

-S <No space comma separated list of Infiniband switch names/ip addresses>

Example: CheckSWProfile.sh -S switch1.company.com,switch2.company.com

Prints the Serial number and Hardware version for the switches

in the Database Machine. You will need to supply the password for

admin user for Voltaire switches and root user for Sun switches.

 

 

[root@dm01db01 oracle.SupportTools]# ./CheckSWProfile.sh  -c

[INFO] Software checker check option is only available on Exadata cells.

 

[root@dm01db01 oracle.SupportTools]# ssh dm01cel01-priv

 

[root@dm01cel01 oracle.SupportTools]# ./CheckSWProfile.sh -c

 

[INFO] SUCCESS: Meets requirements of operating platform and InfiniBand software.

[INFO] Check does NOT verify correctness of configuration for installed software.

 

 

[root@dm01cel01 oracle.SupportTools]# cd /opt/oracle.cellos/

[root@dm01cel01 oracle.cellos]# ./CheckHWnFWProfile

[SUCCESS] The hardware and firmware profile matches one of the supported profiles

 

 

 

If hardware is replaced, rerun the /opt/oracle.cellos/CheckHWnFWProfile script.

Exadata一体机健康检查报告

Exadata数据库一体机软硬结合了ORACLE公司技术的精华,定期的健康检查也马虎不得。

Exadata的健康检查主要基于Oracle Support标准化工具Exachk,关于Exachk的详细介绍可以参考:

介绍 Exachk 的概览和最佳实践;定期使用 Exachk 收集 Exadata 机器的系统信息, 并结合 Oracle 最佳实践与客户当前的环境配置给出建议值, 可及时发现潜在问题, 消除隐患, 保障 Exadata 系统的稳定运行。

注册: https://oracleaw.webex.com/oracleaw/onstage/g.php?d=592264766&t=a

 

这里我们只给出Exachk的具体使用步骤:

 

$./exachk

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y

Checking ssh user equivalency settings on all nodes in cluster

./exachk: line 8674: [: 5120: unary operator expected

Space available on ware at /tmp is KB and required space is 5120 KB

Please make at least 10MB space available at above location and retry to continue.[y/n][y]?

 

 

需要设置RAT_CLUSTERNODES 指定检查的节点名

 

 

su - oracle

$export RAT_CLUSTERNODES="dm01db01-priv dm01db02-priv"

export RAT_DBNAMES="orcl,dbm"
$ ./exachk

[oracle@192 tmp]$ ./exachk

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.3/grid?[y/n][y]y

Checking ssh user equivalency settings on all nodes in cluster

Node dm01db01-priv is configured for ssh user equivalency for oracle user

Node dm01db02-priv is configured for ssh user equivalency for oracle user

Searching out ORACLE_HOME for selected databases.

. . . . 

Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS

. . . . . . . . . . . . . . . . . . . /u01/app/11.2.0.3/grid/bin/cemutlo.bin: Failed to initialize communication with CSS daemon, error code 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
-------------------------------------------------------------------------------------------------------
                                                 Oracle Stack Status                            
-------------------------------------------------------------------------------------------------------
Host Name  CRS Installed  ASM HOME       RDBMS Installed  CRS UP    ASM UP    RDBMS UP  DB Instance Name
-------------------------------------------------------------------------------------------------------
192         Yes             Yes             Yes             Yes        No       Yes                
dm01db01-priv Yes             Yes             Yes             Yes        No       Yes                
dm01db02-priv Yes             Yes             Yes             Yes        No       Yes                
-------------------------------------------------------------------------------------------------------

root user equivalence is not setup between 192 and STORAGE SERVER dm01cel01.

1. Enter 1 if you will enter root password for each STORAGE SERVER when prompted.

2. Enter 2 to exit and configure root user equivalence manually and re-run exachk.

3. Enter 3 to skip checking best practices on STORAGE SERVER.

Please indicate your selection from one of the above options[1-3][1]:- 1

Is root password same on all STORAGE SERVER?[y/n][y]y

Enter root password for STORAGE SERVER :- 

97 of the included audit checks require root privileged data collection on DATABASE SERVER. If sudo is not configured or the root password is not available, audit checks which  require root privileged data collection can be skipped.

1. Enter 1 if you will enter root password for each on DATABASE SERVER host when prompted

2. Enter 2 if you have sudo configured for oracle user to execute root_exachk.sh script on DATABASE SERVER

3. Enter 3 to skip the root privileged collections on DATABASE SERVER

4. Enter 4 to exit and work with the SA to configure sudo on DATABASE SERVER or to arrange for root access and run the tool later.

Please indicate your selection from one of the above options[1-4][1]:- 1

Is root password same on all compute nodes?[y/n][y]y

Enter root password on DATABASE SERVER:- 

9 of the included audit checks require nm2user privileged data collection on INFINIBAND SWITCH .

1. Enter 1 if you will enter nm2user password for each INFINIBAND SWITCH when prompted

2. Enter 2 to exit and to arrange for nm2user access and run the exachk later.

3. Enter 3 to skip checking best practices on INFINIBAND SWITCH

Please indicate your selection from one of the above options[1-3][1]:- 3

*** Checking Best Practice Recommendations (PASS/WARNING/FAIL) ***

Log file for collections and audit checks are at
/tmp/exachk_040613_105703/exachk.log

Starting to run exachk in background on dm01db01-priv

Starting to run exachk in background on dm01db02-priv

=============================================================
                    Node name - 192                                
=============================================================
Collecting - Compute node PCI bus slot speed for infiniband HCAs
Collecting - Kernel parameters
Collecting - Maximum number of semaphore sets on system
Collecting - Maximum number of semaphores on system
Collecting - Maximum number of semaphores per semaphore set
Collecting - Patches for Grid Infrastructure 
Collecting - Patches for RDBMS Home 
Collecting - RDBMS patch inventory
Collecting - number of semaphore operations per semop system call
Preparing to run root privileged commands on DATABASE SERVER 192.

Starting to run root privileged commands in background on STORAGE SERVER dm01cel01

root@192.168.64.131's password: 

Starting to run root privileged commands in background on STORAGE SERVER dm01cel02

root@192.168.64.132's password: 

Starting to run root privileged commands in background on STORAGE SERVER dm01cel03

root@192.168.64.133's password: 
Collecting - Ambient Temperature on storage server 
Collecting - Exadata Critical Issue EX10 
Collecting - Exadata Critical Issue EX11 
Collecting - Exadata software version on storage server 
Collecting - Exadata software version on storage servers 
Collecting - Exadata storage server system model number  
Collecting - RAID controller version on storage servers 
Collecting - Verify Disk Cache Policy on storage servers 
Collecting - Verify Electronic Storage Module (ESM) Lifetime is within Specification  
Collecting - Verify Hardware and Firmware on Database and Storage Servers (CheckHWnFWProfile) [Storage Server] 
Collecting - Verify Master (Rack) Serial Number is Set [Storage Server] 
Collecting - Verify PCI bridge is configured for generation II on storage servers 
Collecting - Verify RAID Controller Battery Condition [Storage Server] 
Collecting - Verify RAID Controller Battery Temperature [Storage Server] 
Collecting - Verify There Are No Storage Server Memory (ECC) Errors  
Collecting - Verify service exachkcfg autostart status on storage server 
Collecting - Verify storage server disk controllers use writeback cache  
Collecting - verify asr exadata configuration check via ASREXACHECK on storage servers 
Collecting - Configure Storage Server alerts to be sent via email 
Collecting - Exadata Celldisk predictive failures 
Collecting - Exadata Critical Issue EX9 
Collecting - Exadata storage server root filesystem free space 
Collecting - HCA firmware version on storage server 
Collecting - OFED Software version on storage server 
Collecting - OSWatcher status on storage servers 
Collecting - Operating system and Kernel version on storage server 
Collecting - Scan storage server alerthistory for open alerts 
Collecting - Storage server flash cache mode 
Collecting - Verify Data Network is Separate from Management Network on storage server 
Collecting - Verify Ethernet Cable Connection Quality on storage servers 
Collecting - Verify Exadata Smart Flash Cache is created 
Collecting - Verify Exadata Smart Flash Log is Created 
Collecting - Verify InfiniBand Cable Connection Quality on storage servers 
Collecting - Verify Software on Storage Servers (CheckSWProfile.sh)  
Collecting - Verify average ping times to DNS nameserver 
Collecting - Verify celldisk configuration on disk drives 
Collecting - Verify celldisk configuration on flash memory devices 
Collecting - Verify griddisk ASM status 
Collecting - Verify griddisk count matches across all storage servers where a given prefix name exists 
Collecting - Verify storage server metric CD_IO_ST_RQ 
Collecting - Verify there are no griddisks configured on flash memory devices 
Collecting - Verify total number of griddisks with a given prefix name is evenly divisible of celldisks 
Collecting - Verify total size of all griddisks fully utilizes celldisk capacity 
Collecting - mpt_cmd_retry_count from /etc/modprobe.conf on Storage Servers 

Data collections completed. Checking best practices on 192.
--------------------------------------------------------------------------------------

 FAIL =>    CSS misscount should be set to the recommended value of 60
 FAIL =>    Database Server InfiniBand network MTU size is NOT 65520
 WARNING => Database has one or more dictionary managed tablespace
 WARNING => RDBMS Version is NOT 11.2.0.2 as expected
 FAIL =>    Storage Server alerts are not configured to be sent via email
 FAIL =>    Management network is not separate from data network
 WARNING => NIC bonding is NOT configured for public network (VIP)
 WARNING => NIC bonding is  not configured for interconnect
 WARNING => SYS.AUDSES$ sequence cache size < 10,000  WARNING => GC blocks lost is occurring
 WARNING => Some tablespaces are not using Automatic segment storage management.
 WARNING => SYS.IDGEN1$ sequence cache size < 1,000  WARNING => Interconnect is configured on routable network addresses
 FAIL =>    Some data or temp files are not autoextensible
 FAIL =>    One or more Ethernet network cables are not connected.
 WARNING => Multiple RDBMS instances discovered, observe database consolidation best practices
 INFO =>    ASM griddisk,diskgroup and Failure group mapping not checked.
 FAIL =>    One or more storage server has stateless alerts with null "examinedby" fields.
 WARNING => Standby is not opened read only with managed recovery in real time apply mode
 FAIL =>    Managed recovery process is not running
 FAIL =>    Flashback on PRIMARY is not configured
 WARNING => Standby is not in READ ONLY WITH APPLY mode
 FAIL =>    Flashback on STANDBY is not configured
 FAIL =>    No one high redundancy diskgroup configured
 INFO =>    Operational Best Practices
 INFO =>    Consolidation Database Practices
 INFO =>    Network failure prevention best practices
 INFO =>    Computer failure prevention best practices
 INFO =>    Data corruption prevention best practices
 INFO =>    Logical corruption prevention best practices
 INFO =>    Storage failures prevention best practices
 INFO =>    Database/Cluster/Site failure prevention best practices
 INFO =>    Client failover operational best practices
 FAIL =>    Some bigfile tablespaces do not have non-default maxbytes values set
 FAIL =>    Standby database is not in sync with primary database
 FAIL =>    Redo transport from primary to standby has more than 5 minutes or more lag
 FAIL =>    Standby database is not in sync with primary database
 FAIL =>    System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/dbhome_1
 FAIL =>    System may be exposed to Exadata Critical Issue DB11 /u01/app/oracle/product/11.2.0.3/orcl
 INFO =>    Software maintenance best practices
 FAIL =>    Operating system hugepages count does not satisfy total SGA requirements
 FAIL =>    Table AUD$[FGA_LOG$] should use Automatic Segment Space Management
 INFO =>    Database failure prevention best practices
 WARNING => Database Archivelog Mode should be set to ARCHIVELOG
 WARNING => Some tablespaces are not using Automatic segment storage management.
 WARNING => Database has one or more dictionary managed tablespace
 WARNING => Unsupported data types preventing Data Guard (transient logical standby or logical standby) rolling upgrade
Collecting patch inventory on  CRS HOME /u01/app/11.2.0.3/grid
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/dbhome_1 
Collecting patch inventory on ORACLE_HOME /u01/app/oracle/product/11.2.0.3/orcl 

Copying results from dm01db01-priv and generating report. This might take a while. Be patient.

---------------------------------------------------------------------------------
                      CLUSTERWIDE CHECKS
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------

Detailed report (html) - /tmp/exachk_192_dbm_040613_105703/exachk_192_dbm_040613_105703.html

UPLOAD(if required) - /tmp/exachk_192_dbm_040613_105703.zip

 

 

最后将生成打包成zip的报告,可以定期上传给GCS。

生成报告的HTML版如下图:

exadata1 h ealth check

 

【Enterprise Manager 12c】如何在EM 12c中配置Exadata Infiniband告警邮件

EM 12c集中了Exadata的大量管理功能,这里我们介绍一下如何在EM 12c中配置Exadata Infiniband告警邮件?

 

  1. 首先需要将IB network加入到EM target中,点击Exadata machine => IB network => target setup => monitor setup
  2. 之后进入监视—>度量和搜集设置
  3. 还可以参考文档 How To Configure Notification Rules in 12c Enterprise Manager Cloud Control ? [ID 1368036.1]

Oracle从AWR中计算出Exadata效果的方法

本文地址:https://www.askmac.cn/archives/oracle%E4%BB%8Eawr%E4%B8%AD%E8%AE%A1%E7%AE%97%E5%87%BAexadata%E6%95%88%E6%9E%9C%E7%9A%84%E6%96%B9%E6%B3%95.html

AWR报告中Exadata特有的信息

 

Instance Activity统计(IO相关)

  • cell physical IO interconnect bytes

–DB服务器以及Cell之间交互的IO尺寸

  • cell physical IO bytes eligible for predicate offload

–成为Smart Scan对象的IO尺寸

  • cell physical IO interconnect bytes returned by smart scan

–通过Smart Scan从Cell返回的IO尺寸

  • cell IO uncompressed bytes

–通过cell处理的非压缩数据尺寸

  • cell physical IO bytes saved by storage index

–通过Storage index减少的IO尺寸

  • cell physical IO bytes saved during optimized file creation

–被卸载的数据文件相关处理的IO尺寸

  • cell flash cache read hits

–对于Smart Flash Cache所要求的次数

 

awr_exadata1

 

计算Exadata的效果的方法

 

  • IO减少效果( Cell offload效率)

–着眼于Read IO 计算

–跟是否压缩无关

 

IO减少效果=[ 1 – {(cell physical IO interconnect bytes returned by smart scan)/ (cell IO uncompressed bytes + cell physical IO bytes saved by storage index)} ] * 100

–计算Uncompress状态下的IO尺寸,以及通过storage index减少的IO减少尺寸的总计以及实际中通过smart scan传送到数据库服务器中的IO尺寸的比例。

exadata_io_less

 

  • 通过Storage Index减少Disk IO的效果

–跟是否压缩无关

(cell physical IO bytes saved by storage index / physical read total bytes) * 100

–对于整体的读入尺寸,可以用StorageIndex计算出减少的比例

exadata_storage_index_less_io

  • Flash Cache 命中率

–与是否压缩无关

Flash Cache 命中率=(cell flash cache read hit / physical read total IO requests) * 100

–计算读入要求的Flash Cache命中率

FlashCache命中率

 

  • IO减少效果

–不仅是Read IO ,计算出储存中可以卸载的write IO

–仅在非压缩的情况下可以计算

 

IO减少效果=[{physical read total bytes + (physical write total bytes) * 3 - cell physical IO interconnect bytes}/ {physical read total bytes + (physical write total bytes) * 3 } ] * 100

–将从合计IO尺寸中减去通过Interconnect的IO尺寸的值,作为存储中被卸载减少的IO尺寸来计算

 

※ 因为cell physical IO interconnect bytes 中包含了ASM的mirroring 的写入,所以要以physical write total bytes 三倍值来计算(二重化时为两倍)

※ 压缩时,也出现过比起实际的磁盘IO量,流入到Interconnect中的IO量变多的例子,这时无法应用这里所谈到的方法

exadata_IO减少效果

 

 

沪ICP备14014813号-2

沪公网安备 31010802001379号