Exadata vs Greenplum vs Netezza
从Exadata Storage Software 11.2.2.4开始引入了Exadata Smart Flash Logging的新特性,该特性允许LGWR进程将redo同时并行写入flash cache 和 disk controller 中, 只要在flash cache 和 disk controller中有任意一个率先写完成就会通知RDBMS数据库继续工作, 该特性着眼于改善了Exadata对Redo写入的响应时间和吞吐量。(The feature allows redo writes to be written to both flash cache and disk controller cache, with an acknowledgement sent to the RDBMS as soon as either of these writes complete; this improves response times and thoughput.);特别有利于改善log file sync等待事件是主要性能瓶颈的系统。
当频繁写redo重做日志的IO成为Exadata一体机的主要性能瓶颈的时候,Exadata开发部门自然想到了通过DBM上已有的flashcache来减少响应时间的办法。但是又要保证不会因为flashcache的引入而导致redo关键日志信息的丢失:
The main problem that number of writes in redo logs is very high, even there are no activity in database. Therefore Flash Cache disk will reach his write limit very fast – some days or months (I am not see exact test results). In this way you will lost flash cache disk and all data on them. But losing redo logs is very unpleasant case of database unavailability, which can lie in big downtime and possible data loss.
As you already know, 11.2.2.4.0 introduced the Smart Flash Log feature. For customers that are not in 11.2.2.4.0, don’t suggest putting the redo logs on a diskgroup that uses griddisks carved from flashdisks. There are different issues when using redo logs on the flashcache in previous versions and those should be avoided.
只要是安装过Exadata Storage Software 11.2.2.4补丁的系统都会隐式地启用该Exadata Smart Flash Logging特性,但是它同时也要求数据库版本要大于Database 11.2.0.2 Bundle Patch 11。
Metalink目前没有介绍如何在已经启用Exadata Smart Flash Logging的DBM上禁用(disable)该特性。
实际每个cell会分配512MB的flashcache用于Smart Flash Logging,因此现在每个cell的可用flash空间为 364.75Gb 。
不仅局限于Online Redo Log可以受益于Smart Flash Logging,Standby Redo Log 也可以从该特性中得到性能提升,前提是满足必要的软件版本组合cell patch 11.2.2.4 and Database 11.2.0.2 Bundle Patch 11 or greate。
可以通过CellCLI 命令行了解现有的Smart Flash Logging配置,若有输出则说明配置了Smart Flash Logging。
CellCLI> LIST FLASHLOG DETAIL
更多信息可以参考文档”Exadata Smart Flash Logging Explained”,引用如下:
Smart Flash Logging works as follows. When receiving a redo log write request, Exadata will do parallel writes to the on-disk redo logs as well as a small amount of space reserved in the flash hardware. When either of these writes has successfully completed the database will be immediately notified of completion. If the disk drives hosting the logs experience slow response times, then the Exadata Smart Flash Cache will provide a faster log write response time. Conversely, if the Exadata Smart Flash Cache is temporarily experiencing slow response times (e.g., due to wear leveling algorithms), then the disk drive will provide a faster response time. Given the speed advantage the Exadata flash hardware has over disk drives, log writes should be written to Exadata Smart Flash Cache, almost all of the time, resulting in very fast redo write performance. This algorithm will significantly smooth out redo write response times and provide overall better database performance. The Exadata Smart Flash Cache is not used as a permanent store for redo data – it is just a temporary store for the purpose of providing fast redo write response time. The Exadata Smart Flash Cache is a cache for storing redo data until this data is safely written to disk. The Exadata Storage Server comes with a substantial amount of flash storage. A small amount is allocated for database logging and the remainder will be used for caching user data. The best practices and configuration of redo log sizing, duplexing and mirroring do not change when using Exadata Smart Flash Logging. Smart Flash Logging handles all crash and recovery scenarios without requiring any additional or special administrator intervention beyond what would normally be needed for recovery of the database from redo logs. From an end user perspective, the system behaves in a completely transparent manner and the user need not be aware that flash is being used as a temporary store for redo. The only behavioral difference will be consistently low latencies for redo log writes. By default, 512 MB of the Exadata flash is allocated to Smart Flash Logging. Relative to the 384 GB of flash in each Exadata cell this is an insignificant investment for a huge performance benefit. This default allocation will be sufficient for most situations. Statistics are maintained to indicate the number and frequency of redo writes serviced by flash and those that could not be serviced, due to, for example, insufficient flash space being allocated for Smart Flash Logging. For a database with a high redo generation rate, or when many databases are consolidated on to one Exadata Database Machine, the size of the flash allocated to Smart Flash Logging may need to be enlarged. In addition, for consolidated deployments, the Exadata I/O Resource Manager (IORM) has been enhanced to enable or disable Smart Flash Logging for the different databases running on the Database Machine, reserving flash for the most performance critical databases.
以及<Exadata Smart Flash Cache Features and the Oracle Exadata Database Machine>官方白皮书,公开的文档地址:
[gview file=”http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf”]
v2 Full Rack | x2-2 Full Rack | x2-8 Full Rack | |
Database servers | 8 x Sun Fire x4170 1U | 8 x Sun Fire x4170 M2 1U | 2 x Sun Fire x4800 5U |
Database CPUs | Xeon E5540 quad core 2.53GHz | Xeon X5670 six cores 2.93GHz | Xeon X7560 eight cores 2.26GHz |
database cores | 64 | 96 | 128 |
database RAM | 576GB | 768GB | 2TB |
Storage cells | 14 x SunFire X4275 | 14 x SunFire X4270 M2 | 14 x SunFire X4270 M2 |
storage cell CPUs | Xeon E5540 quad core 2.53GHz | Xeon L5640 six cores 2.26GHz | Xeon L5640 six cores 2.26GHz |
storage cells CPU cores | 112 | 168 | 168 |
IO performance & capacity | 15K RPM 600GB SAS or 2TB SATA 7.2K RPM disks | 15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity) | 15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity) |
Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref) | Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref) | ||
Flash Cache | 5.3TB | 5.3TB | 5.3TB |
Database Servers networking | 4 x 1GbE x 8 servers = 32 x 1GbE | 4 x 1GbE x 8 servers + 2 x 10GbE x 8 servers = 32 x 1Gb + 16 x 10GbEE | 8 x 1GbE x 2 servers + 8 x 10GbE x 2 servers = 16 x 1Gb + 16 x 10GbEE |
InfiniBand Switches | QDR 40Gbit/s wire | QDR 40Gbit/s wire | QDR 40Gbit/s wire |
InfiniBand ports on database servers (total) | 2 ports x 8 servers = 16 ports | 2 ports x 8 servers = 16 ports | 8 ports x 2 servers = 16 ports |
Database Servers OS | Oracle Linux only | Oracle Linux (possible Solaris later, still unclear) | Oracle Linux or Solaris x86 |
之前有同事问我Exadata用的是什么操作系统这个问题?
最早Oracle与HP合作的Exadata V1采用的是Oracle Enterprise Linux,而Oracle-Sun Exadata V2则目前还仅提供OEL,但是已经通过了Solaris -11 Express在 Exadata V2上的测试, 所以很快Exadata V2将会有Solaris的选择。
目前现有的Exadata X2-2 和 X2-8 绝大多数采用2个OEL 5的小版本:
较早出厂的使用OEL 5.3 # cat /etc/enterprise-release Enterprise Linux Enterprise Linux Server release 5.3 (Carthage) 近期出场的使用OEL 5.5 # cat /etc/enterprise-release Enterprise Linux Enterprise Linux Server release 5.5 (Carthage) # uname -a Linux vrh1.us.oracle.com 2.6.18-128.1.16.0.1.el5 #1 SMP Tue x86_64 x86_64 x86_64 GNU/Linux
The IB should be one of the compatible cards specified in Note 888828.1
If you build a backup server machine it is best tro build as close a clone of the Exadata Compute nodes as you can get.
I.e. install OEL 5 Update 5 and one of the IB cards specified in the note and you will have the correct ofed versions and kernel
This will guarantee interoperabilty and correct operation with the kernel and ofed drivers
From the doc
InfiniBand OFED Software
Exadata Storage Servers and database servers will interoperate with different InfiniBand OFED software versions, however, Oracle recommends that all versions be the same unless performing a rolling upgrade. Review Note 1262380.1 for database server software and firmware guidelines.
InfiniBand HCA
Exadata Storage Servers and database servers will interoperate with different InfiniBand host channel adapter (HCA) firmware versions, however, Oracle recommends that all versions be the same unless performing a rolling upgrade. Review Note 1262380.1 for database server software and firmware guidelines.
For a complete list of the Oracle QDR Infinband adaptors see here:
http://www.oracle.com/technetwork/documentation/oracle-net-sec-hw-190016.html#infinibandadp
For the compute nodes all firmware updates must be done via the bundle patches descibed in Doc 888828.1
So I would advise upgrading to the latest supported bundel patch.
For you backup server choose the same model card that came with the X2 compute nodes.
Install Oracle Eterprise Linux Release 5 Update 5
Upgrade the firmware to the same firmware an on the X2 or higher if not already the same,
Database Machine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions [ID 888828.1]
Hybrid Columnar Compression混合列压缩是Exadata数据库一体机的核心功能之一,与普通的高级压缩特性(advanced compression)不同,Hybrid columnar compression (HCC) 仅仅在Exadata平台上可用。使用HCC的情况下数据压缩存放在CU(compression unit压缩单位中),一个CU单位包括多个数据库块,这是出于单数据块不利于以列值压缩算法的考量所决定的,当一个CU包含多个block时可以实现较优的列值压缩算法。
同时对于普通的INSERT/UPDATE操作,需要造成对行级数据的压缩降级,即在经历UPDATE/INSERT后原本HCC压缩的行可能变成普通高级压缩的水平。
hybrid columnar compression与数据仓库行为的批量初始化导入(bulk initial load)配合,直接路径导入(direct load)例如ALTER TABLE MOVE, IMPDP或直接路径插入(append INSERT),使用HCC的前提是这些数据将不会被频繁修改或从不被修改。
当你更新混合列压缩启动的表中的数据行时,相关整个的压缩单位CU中的数据将被锁住。 被更新的这些数据将不得不从原HCC压缩级别降级到例如无压缩或for OLTP压缩的水准。
我们来看以下例子:
SQL*Plus: Release 11.2.0.2.0 Production on Wed Sep 12 06:14:53 2012 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production With the Partitioning, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options SQL> grant dba to scott; Grant succeeded. SQL> conn scott/oracle Connected. SQL> SQL> create table hcc_maclean tablespace users compress for query high as select * from dba_objects; Table created. 1* select rowid,owner,object_name,dbms_rowid.rowid_block_number(rowid) from hcc_maclean where owner='MACLEAN' SQL> / ROWID OWNER OBJECT_NAME DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) ------------------------------ ------------------------------ -------------------- ------------------------------------ AAAThuAAEAAAHTJAOI MACLEAN SALES 29897 AAAThuAAEAAAHTJAOJ MACLEAN MYCUSTOMERS 29897 AAAThuAAEAAAHTJAOK MACLEAN MYCUST_ARCHIVE 29897 AAAThuAAEAAAHTJAOL MACLEAN MYCUST_QUERY 29897 AAAThuAAEAAAHTJAOh MACLEAN COMPRESS_QUERY 29897 AAAThuAAEAAAHTJAOi MACLEAN UNCOMPRESS 29897 AAAThuAAEAAAHTJAOj MACLEAN CHAINED_ROWS 29897 AAAThuAAEAAAHTJAOk MACLEAN COMPRESS_QUERY1 29897 8 rows selected. select dbms_rowid.rowid_block_number(rowid),dbms_rowid.rowid_relative_fno(rowid) from hcc_maclean where owner='MACLEAN'; session A: update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where rowid='AAAThuAAEAAAHTJAOI'; session B: update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where rowid='AAAThuAAEAAAHTJAOJ'; SQL> select sid,wait_event_text,BLOCKER_SID from v$wait_chains; SID WAIT_EVENT_TEXT BLOCKER_SID ---------- ---------------------------------------------------------------- ----------- 13 enq: TX - row lock contention 136 136 SQL*Net message from client 可以看到session A block B,这验证了HCC压缩后update row所在CU会造成整个CU被锁住的说法 SQL> alter system checkpoint; System altered. SQL> / System altered. SQL> alter system dump datafile 4 block 29897 2 ; Block header dump: 0x010074c9 Object id on Block? Y seg/obj: 0x1386e csc: 0x00.1cad7e itc: 3 flg: E typ: 1 - DATA brn: 0 bdba: 0x10074c8 ver: 0x01 opc: 0 inc: 0 exflg: 0 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0xffff.000.00000000 0x00000000.0000.00 C--- 0 scn 0x0000.001cabfa 0x02 0x000a.00a.00000430 0x00c051a7.0169.17 ---- 1 fsc 0x0000.00000000 0x03 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000 avsp=0x14 tosp=0x14 r0_9ir2=0x0 mec_kdbh9ir2=0x0 76543210 shcf_kdbh9ir2=---------- 76543210 flag_9ir2=--R----- Archive compression: Y fcls_9ir2[0]={ } 0x16:pti[0] nrow=1 offs=0 0x1a:pri[0] offs=0x30 block_row_dump: tab 0, row 0, @0x30 tl: 8016 fb: --H-F--N lb: 0x2 cc: 1 ==>整个CU指向ITL 0x02 nrid: 0x010074ca.0 col 0: [8004] Compression level: 02 (Query High) Length of CU row: 8004 kdzhrh: ------PC CBLK: 1 Start Slot: 00 NUMP: 01 PNUM: 00 POFF: 7984 PRID: 0x010074ca.0 CU header: CU version: 0 CU magic number: 0x4b445a30 CU checksum: 0xf8faf86e CU total length: 8694 CU flags: NC-U-CRD-OP ncols: 15 nrows: 995 algo: 0 CU decomp length: 8487 len/value length: 100111 row pieces per row: 1 num deleted rows: 1 deleted rows: 904, START_CU:
我们可以使用如下方式衡量row的压缩情况:
SQL> select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN','AAAThuAAEAAAHTJAOk') from dual; DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN','AAATHUAAEAAAHTJAOK' -------------------------------------------------------------------------------- 4
COMP_NOCOMPRESS CONSTANT NUMBER := 1;
COMP_FOR_OLTP CONSTANT NUMBER := 2;
COMP_FOR_QUERY_HIGH CONSTANT NUMBER := 4;
COMP_FOR_QUERY_LOW CONSTANT NUMBER := 8;
COMP_FOR_ARCHIVE_HIGH CONSTANT NUMBER := 16;
COMP_FOR_ARCHIVE_LOW CONSTANT NUMBER := 32;
COMP_RATIO_MINROWS CONSTANT NUMBER := 1000000;
COMP_RATIO_ALLROWS CONSTANT NUMBER := -1;
上表列出了压缩类型的常数值,例如COMP_FOR_QUERY_HIGH是4,COMP_FOR_QUERY_LOW 是8
这里我们从上述查询GET_COMPRESSION_TYPE指定rowid的情况下得到的是4说明该列以COMP_FOR_QUERY_HIGH形式压缩:
SQL> update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where owner='MACLEAN'; 8 rows updated. SQL> commit; Commit complete. SQL> select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',rowid) from HCC_MACLEAN where owner='MACLEAN'; DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',ROWID) ------------------------------------------------------------------ 1 1 1 1 1 1 1 1 8 rows selected.
以上更新一定量数据后可以看到COMPRESSION_TYPE由COMP_FOR_QUERY_HIGH降级为COMP_NOCOMPRESS,这说明这张表虽然compress for query high但部分数据在更新后实际不再被压缩。
在11g中这些非压缩态复萌的数据行不会自动升级成HCC状态。必要的时候手动作 ALTER TABLE MOVE或在线重定义以便将非压缩态的数据转换回HCC状态。
SQL> ALTER TABLE hcc_MACLEAN move COMPRESS FOR ARCHIVE HIGH; Table altered. SQL> select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',rowid) from HCC_MACLEAN where owner='MACLEAN'; DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',ROWID) ------------------------------------------------------------------ 16 16 16 16 16 16 16 16 8 rows selected.
当然这是Oracle-Sun Exadata V2 数据库一体机的竞争者,Netezza公司制作的一个小视频,但并非空穴来风!
很多人告诉我有了Exadata这么智能的一体机后都可以不需要dba了,那么这是事实吗?
就我个人的体验,部署了Exadata之后DBA花在管理上的时间更多了。
Exadata上复杂的数据库升级步骤:
Infiniband Switch – > Exadata Storage Software – > Exadata Host Minimal Patch -> Grid Infrastructure – > Database
客户视角的Exadata 详细升级过程:
升级所涉及的多层Component :
这升级步骤不是人写的,有木有?!
Exadata上的 Bundle Patch (Replaces recalled patch 8626760) 是可以recall 召回的有木有?!
#别去搜这个patch,说了recall了就是不会让你能找到了!
号称出厂即最优的配置,结果连简单的ulimit-memlock参数设置都有问题的情况,有木有?
昨天客户语重心长的告诉我,他们准备把Exadata V2 上的核心应用迁移走,客户在09年就开始用Exadata,是不是国内第一家我不知道,但至少应该是用于生产的第一批。但是这2年来因为Exadata折 腾了无数次,现在终于不想折腾了。实际是用户在测试Exadata时总是能看到其优点,但实际使用时只看到缺点了!
历数Exadata V2数据库一体机的几大致命缺点
事情要从大约2个月前的一起事故说起,有一套部署在Oracle-Sun Exadata V2 Database Machine上的4节点11.2.0.1 RAC数据库,其中一个节点的RAC关键后台进程LMS报ORA-00600[kjbmprlst:shadow]错误,随后LMS后台进程将该节点上的实例终止。其他节点上的CRS软件检测到该意外终止后,数据库进入全局资源的重新配置过程(Reconfiguration),Reconfiguration在所有剩余节点上都顺利完成了。
但是随后其中一个节点的告警日志中持续出现”Process W000 died, see its trace file”,似乎是实例无法得到分配新进程的必要资源,同时应用程序出现无法登陆该节点上实例的情况,本来4节点的RAC数据库,因为ORA-00600挂了一个,现在又有一个节点登不上,一下变得只剩下一半性能。
随后我赶到了问题现场,继续诊断问题,并发现了以下症状,在此一一列举:
1.尝试远程登录该实例,但是失败,出现ORA-12516 TNS:listener could not find available handler with matching protocol stack”错误。反复登录会出现以下信息:
Linux Error: 12: Cannot allocate memory Additional information: 1 ORA-01034: ORACLE not available
2.确认过ORACLE_SID、ORACLE_HOME等多环境变量后使用”sqlplus / as sysdba”登录却返回”Connected to an idle instance.”(这一点最为蹊跷),无法以sysdba登录就无法收集必要的诊断信息,这个虽然可以通过gdb等手段做systemstate dump,但是暂时绕过
3. 后台进程W000由SMCO进程启动, SMCO进程的日志如下,所报状态为KSOSP_SPAWNED:
Process W000 is dead (pid=2648 req_ver=3812 cur_ver=3812 state=KSOSP_SPAWNED). *** 2011-07-08 02:44:32.971 Process W000 is dead (pid=2650 req_ver=3813 cur_ver=3813 state=KSOSP_SPAWNED).
4. 确认组成instance的内存和后台进程均存活,且仍有日志产生
[oracle@maclean04 trace]$ ipcs -m ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 644 72 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0xac5ffd78 491524 oracle 660 4096 0 0x96c5992c 1409029 oracle 660 4096 0 [oracle@maclean04 trace]$ ls -l /dev/shm total 34839780 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_0 -rw-r----- 1 oracle oinstall 0 Jun 7 07:19 ora_maclean4_1409029_1 -rw-r----- 1 oracle oinstall 0 Jun 7 07:19 ora_maclean4_1409029_10 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_100 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_101 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_102 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_103 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_104 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_105 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_106 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_107 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_108 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_109 -rw-r----- 1 oracle oinstall 0 Jun 7 07:19 ora_maclean4_1409029_11 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_110 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_111 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_112 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_113 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_114 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_115 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_116 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_117 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_118 -rw-r----- 1 oracle oinstall 268435456 Jun 7 07:19 ora_maclean4_1409029_119 -rw-r----- 1 oracle oinstall 0 Jun 7 07:19 ora_maclean4_1409029_12 ....................... [oracle@maclean04 trace]$ ps -ef|grep ora_ oracle 5466 1 0 Jul03 ? 00:00:18 ora_pz99_maclean4 oracle 14842 10564 0 19:54 pts/9 00:00:00 grep ora_ oracle 18641 1 0 Jun08 ? 00:00:02 ora_q002_maclean4 oracle 23932 1 0 Jun07 ? 00:04:26 ora_pmon_maclean4 oracle 23934 1 0 Jun07 ? 00:00:06 ora_vktm_maclean4 oracle 23938 1 0 Jun07 ? 00:00:00 ora_gen0_maclean4 oracle 23940 1 0 Jun07 ? 00:00:06 ora_diag_maclean4 oracle 23942 1 0 Jun07 ? 00:00:00 ora_dbrm_maclean4 oracle 23944 1 0 Jun07 ? 00:01:01 ora_ping_maclean4 oracle 23946 1 0 Jun07 ? 00:00:16 ora_psp0_maclean4 oracle 23948 1 0 Jun07 ? 00:00:00 ora_acms_maclean4 oracle 23950 1 0 Jun07 ? 02:27:29 ora_dia0_maclean4 oracle 23952 1 0 Jun07 ? 01:19:42 ora_lmon_maclean4 oracle 23954 1 0 Jun07 ? 02:23:59 ora_lmd0_maclean4 oracle 23956 1 5 Jun07 ? 1-13:50:36 ora_lms0_maclean4 oracle 23960 1 4 Jun07 ? 1-12:44:25 ora_lms1_maclean4 oracle 23964 1 0 Jun07 ? 00:00:00 ora_rms0_maclean4 oracle 23966 1 0 Jun07 ? 00:00:00 ora_lmhb_maclean4 oracle 23968 1 0 Jun07 ? 01:58:35 ora_mman_maclean4 oracle 23970 1 0 Jun07 ? 06:28:39 ora_dbw0_maclean4 oracle 23972 1 0 Jun07 ? 06:27:08 ora_dbw1_maclean4 oracle 23974 1 2 Jun07 ? 16:49:56 ora_lgwr_maclean4 oracle 23976 1 0 Jun07 ? 00:20:48 ora_ckpt_maclean4 oracle 23978 1 0 Jun07 ? 00:07:03 ora_smon_maclean4 oracle 23980 1 0 Jun07 ? 00:00:00 ora_reco_maclean4 oracle 23982 1 0 Jun07 ? 00:00:00 ora_rbal_maclean4 oracle 23984 1 0 Jun07 ? 00:01:00 ora_asmb_maclean4 oracle 23986 1 0 Jun07 ? 00:08:15 ora_mmon_maclean4 oracle 23988 1 0 Jun07 ? 00:18:19 ora_mmnl_maclean4 oracle 23992 1 0 Jun07 ? 00:00:00 ora_d000_maclean4 oracle 23994 1 0 Jun07 ? 00:00:00 ora_s000_maclean4 oracle 23996 1 0 Jun07 ? 00:00:00 ora_mark_maclean4 oracle 24065 1 0 Jun07 ? 01:16:54 ora_lck0_maclean4 oracle 24067 1 0 Jun07 ? 00:00:00 ora_rsmn_maclean4 oracle 24079 1 0 Jun07 ? 00:01:02 ora_dskm_maclean4 oracle 24174 1 0 Jun07 ? 00:08:18 ora_arc0_maclean4 oracle 24188 1 0 Jun07 ? 00:08:19 ora_arc1_maclean4 oracle 24190 1 0 Jun07 ? 00:00:59 ora_arc2_maclean4 oracle 24192 1 0 Jun07 ? 00:08:12 ora_arc3_maclean4 oracle 24235 1 0 Jun07 ? 00:00:00 ora_gtx0_maclean4 oracle 24237 1 0 Jun07 ? 00:00:00 ora_rcbg_maclean4 oracle 24241 1 0 Jun07 ? 00:00:00 ora_qmnc_maclean4 oracle 24245 1 0 Jun07 ? 00:00:00 ora_q001_maclean4 oracle 24264 1 0 Jun07 ? 00:08:28 ora_cjq0_maclean4 oracle 25782 1 0 Jun07 ? 00:00:00 ora_smco_maclean4
5.确认在问题发生时系统中仍有大量的空闲内存且未发生大量的SWAP,此外/dev/shm共享内存目录仍有27G的空闲。
6.在其他节点上查询全局动态性能视图gv$resource_limit发现当前故障节点上的登录进程总数上限仅为404,并不多。
SQL> select * from v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production PL/SQL Release 11.2.0.1.0 - Production CORE 11.2.0.1.0 Production TNS for Linux: Version 11.2.0.1.0 - Production NLSRTL Version 11.2.0.1.0 - Production SQL> select * from global_name; GLOBAL_NAME -------------------------------------------------------------------------------- www.askmac.cn SQL> select * from gv$resource_limit where inst_id=4; RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE ------------------------------ ------------------- --------------- ------------------------------ ------------- processes 50 404 1500 1500 sessions 61 616 2272 2272 enqueue_locks 849 1599 31062 31062 enqueue_resources 846 1007 15016 UNLIMITED ges_procs 47 399 1503 1503 ges_ress 65943 109281 67416 UNLIMITED ges_locks 23448 37966 92350 UNLIMITED ges_cache_ress 7347 14716 0 UNLIMITED ges_reg_msgs 337 5040 3730 UNLIMITED ges_big_msgs 26 502 3730 UNLIMITED ges_rsv_msgs 0 1 1000 1000 gcs_resources 2008435 2876561 3446548 3446548 gcs_shadows 1888276 2392064 3446548 3446548 dml_locks 0 0 9996 UNLIMITED temporary_table_locks 0 45 UNLIMITED UNLIMITED transactions 0 0 2499 UNLIMITED branches 0 2 2499 UNLIMITED cmtcallbk 0 3 2499 UNLIMITED max_rollback_segments 109 129 2499 65535 sort_segment_locks 0 14 UNLIMITED UNLIMITED k2q_locks 0 2 4544 UNLIMITED max_shared_servers 1 1 UNLIMITED UNLIMITED parallel_max_servers 1 19 160 3600
7. Exadata节点系统内核参数文件sysctl.conf中的配置正确:
# Controls the maximum shared segment size, in bytes kernel.shmmax = 68719476736 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 4294967296 ########### BEGIN DO NOT REMOVE Added by Oracle Exadata ########### kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 # bug 8311668 file-max and aio-max-nr fs.file-max = 6815744 # DB install guide says the above fs.aio-max-nr = 1048576 # 8976963 net.ipv4.neigh.bond0.locktime=0 net.ipv4.ip_local_port_range = 9000 65500 # DB install guide says the above net.core.rmem_default = 4194304 net.core.wmem_default = 262144 net.core.rmem_max = 4194304 net.core.wmem_max = 2097152 # The original DB deployment was net.core.wmem_max = 1048586 but IB works # best for Exadata at the above net.core settings # bug 8268393 remove vm.nr_hugepages = 2048 # bug 8778821 system reboots after 60 sec on panic kernel.panic=60 ########### END DO NOT REMOVE Added by Oracle Exadata ########### ########### BEGIN DO NOT REMOVE Added by Oracle Exadata ########### kernel.shmmax = 64547735961 kernel.shmall = 15758724 ########### END DO NOT REMOVE Added by Oracle Exadata ###########
8. 至此问题还是显得扑朔迷离,主要后台进程和SGA内存的完好,而且操作系统上也仍有大量空闲内存,实例上的资源也没有达到一个临界点。到底是什么造成了无法分配新进程!?
出于谨慎我最后还是检查了系统上的/
但是当我实际打开这个文件后我立即意识到这个配置有问题,似乎少了点什么,以下为该Exadata上的limits.conf文件:
########### BEGIN DO NOT REMOVE Added by Oracle Deployment Scripts ########### oracle soft nproc 2047 oracle hard nproc 16384 oracle soft nofile 65536 oracle hard nofile 65536 ########### END DO NOT REMOVE Added by Oracle Deployment Scripts ###########
显然上述limits.conf中缺少了对memlock参数的设置,在不设置memlock参数的情况下使用缺省的memlock为32,以下为Exadata host上的ulimit输出:
[oracle@maclean4 shm]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 606208
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 2047
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
可以观察到这里的max locked memory确实是缺省的32,而Oracle所推荐的memlock参数却要远大于32。
在Oracle validated Configuration中经过验证的memlock推荐值为50000000,关于Oracle Validated Configuration详见拙作<Understand Oracle Validated Configurations>。
[oracle@rh2 ~]$ cat /etc/security/limits.conf # Oracle-Validated setting for nofile soft limit is 131072 oracle soft nofile 131072 # Oracle-Validated setting for nofile hard limit is 131072 oracle hard nofile 131072 # Oracle-Validated setting for nproc soft limit is 131072 oracle soft nproc 131072 # Oracle-Validated setting for nproc hard limit is 131072 oracle hard nproc 131072 # Oracle-Validated setting for core soft limit is unlimited oracle soft core unlimited # Oracle-Validated setting for core hard limit is unlimited oracle hard core unlimited # Oracle-Validated setting for memlock soft limit is 50000000 oracle soft memlock 50000000 # Oracle-Validated setting for memlock hard limit is 50000000 oracle hard memlock 50000000
搜索Mos可以发现Note[Ora-27102: Out Of Memory: Linux Error: 12: Cannot Allocate Memory with LOCK_SGA=TRUE [ID 401077.1]:指出了因max locked memory过小可能引发Linux Error: 12: Cannot Allocate Memory内存无法分配的问题。
因为修改limits.conf配置文件对已经启动的实例是无效的,所以我们无法通过纠正参数来解决现有的问题。
实际我采用了释放一些资源的方法来workaround了这个问题,通过以下脚本将实例内的所有前台服务进程杀死以释放资源。
ps -ef|grep $SID|grep LOCAL=NO|grep -v grep| awk ‘{print $2}’|xargs kill -9
完成以上命令后出现了终端有点卡的现象,之后恢复正常。尝试使用sysdba本地和远程登录实例均成功,应用的链接也恢复正常。
虽然修复了问题,但是还需要和客户做详尽的说明。我在邮件中阐明了该Exadata一体机上配置文件存在的问题,并提出了几点建议:
1.要求Oracle Support确认该/etc/security/limits.conf中的配置是否合理,是否需要修改
2.设置vm.min_free_kbytes = 51200 内核参数,避免因空闲内存不足引起的性能问题
3.安装OSWatcher监控软件,监控必要的系统资源
客户对我的说法也比较信服,但还是将邮件抄送了原厂Exadata一体机的售前人员。
之后售前人员也曾联系过我,我也做了相同的说明。但原厂售前认为在Exadata一体机是在Oracle美国原厂进行配置安装的,在配置上肯定是最优的,而且该limits.conf中的memlock参数的当前值(32)和推荐值(50000000)之间有如此大的差距,他们认为美国原厂的部署人员不可能犯这么低级的错误。
所以实际他们对我对该memlock参数的说明持一种怀疑的态度,我的建议是就该memlock参数和MOS进行进一步的沟通,以确认该问题。当然这不是我需要完成的工作了。因为对该memlock参数存在分歧,所以短期内也没有修改该参数。
这个case就这样过去了,时间过得很快,转眼已经2个月了。恰巧最近有升级Exadata上数据库到11.2.0.2的项目,所以翻阅了相关patch的readme文档,因为升级RAC到11.2.0.2的前提是Exadata Storage Server Software、InfiniBand Switch Software Version软件版本能够兼容,所以查阅了其兼容列表:
The following table lists the Exadata Storage Server software versions that are compatible with each supported Oracle Database 11g Release 2 software version.
Oracle Database Software version | Required Exadata Storage Server Software version |
---|---|
11g Release 2 (11.2.0.2.0) Patch Set 1 | 11.2.2.x |
11g Release 2 (11.2.0.1.0) | 11.2.2.x 11.2.1.x |
The following table lists the InfiniBand Switch software versions that are compatible with each supported Exadata Storage Server software version.
Exadata Storage Server Software version | Required InfiniBand Switch software version |
---|---|
11.2.2.2.2 and later | Exadata Database Machine – Sun Datacenter InfiniBand Switch 36 Switch software version 1.1.3-2 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D Switch software 5.1.1 build ID 872 (ISR 9024D-M only) Switch firmware 1.0.0 or higher |
11.2.2.2.0 or earlier | Exadata Database Machine – Sun Datacenter InfiniBand Switch 36 Switch software version 1.0.1-1 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D Switch software 5.1.1 build ID 872 (ISR 9024D-M only) Switch firmware 1.0.0 or higher |
为了将Exadata上的RAC数据库升级到11.2.0.2,首先要将Exadata Storage Server Software升级到11.2.2.x,Oracle官方目前推荐的版本是11.2.2.3.2。
所以随后我也翻阅了Exadata Storage Server Software 11.2.2.3.2 的update readme文档,即<Oracle Exadata Database Machine README for patch 12577723 (Support note 1323958.1)>。
该Patch的升级主要分成”Applying the Patch to Exadata Cells”和”Applying the Patch to the Database Server” 2个阶段,即不仅需要在Exadata Cell上实施补丁,还需要在Database节点上实施一个小补丁。
查看”Applying the Patch to the Database Server”章节可以发现存在这样一个步骤:
Repeat the following steps for each database host. If you are taking deployment-wide downtime for the patching, then these steps may be performed in parallel on all database hosts.
Note:
|
WARNING:
|
let -i x=($((`cat /proc/meminfo | grep 'MemTotal:' | awk '{print $2}'` * 3 / 4))); echo $x
/etc/security/limits.conf
file to update or add following limits for the database owner (orauser
) and the grid infrastructure user (griduser
). Your deployment may use the same operating system user for both and it may be named as oracle
user. Adjust the following as needed.
########## BEGIN DO NOT REMOVE Added by Oracle ########### orauser soft core unlimited orauser hard core unlimited orauser soft nproc 131072 orauser hard nproc 131072 orauser soft nofile 131072 orauser hard nofile 131072 orauser soft memlock <value of x from step 01.a> orauser hard memlock <value of x from step 01.a> griduser soft core unlimited griduser hard core unlimited griduser soft nproc 131072 griduser hard nproc 131072 griduser soft nofile 131072 griduser hard nofile 131072 griduser soft memlock <value of x from step 01.a> griduser hard memlock <value of x from step 01.a> ########### END DO NOT REMOVE Added by Oracle ###########
以上可以看到在正式实施Patch to Database server前做了一个补救措施,那就是为oracle和grid用户添加memlock参数,这里的memlock参数是通过将/proc/meminfo中的MemTotal取75%获得,在<Exadata Server Hardware Details>中我列出了Exadata Database Host的一些硬件参数,其中总内存MemTotal一般为70GB(74027752 kB),换算过来74027752*75%=55520814,也就是说Oracle实际推荐在Exadata上使用的memlock参数应当为55520814,甚至要高于我之前所说的50000000的验证值。
至此该问题终于真相大白!而我们也可以从中学到很多东西:
1.首先我大胆的猜测,实际部署Sun Exadata Machine的因该是Oracle硬件部门,也就是以前Sun的部门。实际在部署过程中,部门与部门之间的充分交流是很重要的,而这里09年匆匆上线的Oracle-Sun Exadata V2显然没有做好,而直到2011 5月发布的Oracle Exadata Database Machine 11g Release 2 (11.2) 11.2.2.3.2 patch 12577723中才反应并解决了该问题
2.IT始终是以人为本,不管是多么高端的服务器、多么先进的技术,如果没有与之相匹配的人和团队来驾驭的话,那么至多只能发挥出50%的效益,在人员对先进技术极端不熟悉的情况下,智能化只是空谈!
Copyright © 2024 · Genesis Framework · WordPress · Log in