failover

10.2.0.4以后vip不会自动relocate back回原节点

2013/12/19 by admin Leave a Comment

10.2.0.4以后vip不会自动relocate back回原节点，原因是ORACLE开发人员发现在实际使用中会遇到这样的情况： relocate back回原节点需要停止VIP并在原始节点再次启动该VIP，但是如果原始节点上的公共网络仍不可用，则这个relocate的尝试将再次失败而failover到第二节点。在此期间VIP将不可用，所以从10.2.0.4和11.1开始，默认的实例检查将不会自动relocate vip到原始节点。

详细见下面的Note介绍：

Applies to:
Oracle Server – Enterprise Edition – Version 10.2.0.4 to 11.1.0.7 [Release 10.2 to 11.1]
Information in this document applies to any platform.
Symptoms

Starting from 10.2.0.4 and 11.1, VIP does not fail-over back to the original node even after the public network problem is resolved. This behavior is the default behavior in 10.2.0.4 and 11.1 and is different from that of 10.2.0.3
Cause

This is actually the default default behavior in 10.2.0.4 and 11.1

In 10.2.0.3, on every instance check, the instance attempted to relocate the VIP back to the preferred node (original node), but that required stopping the VIP and then attempt to restart the VIP on the original node. If the public network on the original node is still down, then the attempt to relocate VIP to the original node will fail and the VIP will fail-over back to the secondary node. During this time, the VIP is not available, so starting from 10.2.0.4 and 11.1, the default behavior is that the instance check will not attempt to relocate the VIP back to the original node.
Solution

If the default behavior of 10.2.0.4 and 11.1 is not desired and if there is a need to have the VIP relocate back to the original node automatically when the public network problem is resolved, use the following workaround
Uncomment the line
ORA_RACG_VIP_FAILBACK=1 && export ORA_RACG_VIP_FAILBACK

in the racgwrap script in $ORACLE_HOME/bin

With the above workaround, VIP will relocate back to the original node when CRS performs the instance check, so in order for the VIP to relocate automatically, the node must have at least one instance running.

The instance needs to be restarted or CRS needs to be restarted to have the VIP start relocating back to the original node automatically if the change is being made on the existing cluster.

Relying on automatic relocation of VIP can take up to 10 minutes because the instance check is performed once every 10 minutes. Manually relocating the VIP is only way to guarantee quick relocation of VIP back to the original node.
To manually relocate the VIP, start the nodeapps by issuing
srvctl start nodeapps -n <node name>

Starting the nodeapps does not harm the online resources such as ons and gsd.

Filed Under: Oracle, Oracle RAC Tagged With: failover, vip

Script:RAC Failover检验脚本loop.sh

2011/08/30 by admin Leave a Comment

以下脚本可以用于验证RAC中FAILOVER的可用性:

loop.sh
  nohup sqlplus su/su@failover @verify.sql &
     sleep 1
  nohup sqlplus su/su@failover @verify.sql &
     sleep 1
  nohup sqlplus su/su@failover @verify.sql &
     sleep 1
  nohup sqlplus su/su@failover @verify.sql &
     sleep 1

verify.sql (检验SQL)
  REM  set pagesize 1000
  REM  the following query is for TAF connection verification
  col sid format 999
  col serial# format 9999999
  col failover_type format a13
  col failover_method format a15
  col failed_over format a11
  select sid, serial#, failover_type, failover_method, failed_over
    from v$session where username = 'SU';

  REM  the following query is for load balancing verification
  select instance_name from v$instance;
  exit

  REM you can also combine two queries:
  col inst_id format 999
  col sid format 999
  col serial# format 9999999
  col failover_type format a13
  col failover_method format a15
  col failed_over format a11

  select inst_id, sid, serial#, failover_type, failover_method,
         failed_over from gv$session where username = 'SU';

  REM  a simple select to see the distribution of users when testing 
  REM  connection load balancing

  select inst_id, count(*) from gv$session group by inst_id;

用法：
./loop.sh

Filed Under: Oracle, Oracle RAC, Oracle脚本script Tagged With: failover, Oracle脚本script, RAC

GoldenGate实现Live Standby主备库切换(2)

2010/12/21 by admin Leave a Comment

在《GoldenGate实现Live Standby主备库切换(1)》中我们介绍了如何针对GoldenGate Live standby环境执行计划内的Switchover切换。除去计划内的主备切换，实际生产中更多的故障切换发生在主机故障或主库不可用的情况下,这种情况下一般我们已经无法在Primary上停止应用及extract了；当我们在这样的情况下failover到Standby上后如同在DataGuard环境下一样即便Primary上的数据库恢复了我们也无法直接进行回切了，需要做的是重新配置Primary上的OGG并将Standby上的数据以initial load的形式还原回去，在数据重新同步后才能再切换到Primary上。下面我们就来介绍如何在计划外的情况下从主库failover到备库，并尝试回切:

1.
使用lag replicat命令了解standby上的replicat的延迟情况，若返回"At EOF (end of file)"则说明replicat已应用所有trail中的数据到备库上。
GGSCI (rh3.oracle.com) 1> info all

Program     Status      Group       Lag           Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     STOPPED     EXTSTD2     00:00:00      23:42:47    
EXTRACT     STOPPED     PUMPSTD2    00:00:00      23:41:29    
REPLICAT    RUNNING     REPSTD1     00:00:00      00:00:00  

GGSCI (rh3.oracle.com) 5> lag replicat repstd1

Sending GETLAG request to REPLICAT REPSTD1 ...
Last record lag: 5 seconds.
At EOF, no more records to process.

2.
停止standby上的replicat
GGSCI (rh3.oracle.com) 6> stop replicat repstd1

Sending STOP request to REPLICAT REPSTD1 ...
Request processed.

3.
在standby上执行必要的赋予DML权限，启动triggers触发器和cascade delete约束的脚本

4.
启动standby上的extract,
在此之前先确认Standby上的data pump group不被启动，以保证trail文件堆积在standby上
GGSCI (rh3.oracle.com) 15> info all
Program     Status      Group       Lag           Time Since Chkpt
MANAGER     RUNNING                                           
EXTRACT     STOPPED     EXTSTD2     00:00:00      24:04:16    
EXTRACT     STOPPED     PUMPSTD2    00:00:00      24:02:57    
REPLICAT    STOPPED     REPSTD1     00:00:00      00:00:06    

GGSCI (rh3.oracle.com) 16> start extstd2
Sending START request to MANAGER ...
EXTRACT EXTSTD2 starting

5.
此时可以将应用切换到standby上了

==============================================================================
以上步骤完成了故障切换到Standby的过程，接下来我们尝试将应用还原到primary上
1.如果原primary主机已损毁则需要重装Oracle软件，并重建Primary系统上的Goldengate软件目录
2.从primary端启动GGSCI命令
3.删除primary上相关的extract及EXTTRAIL,并重建

GGSCI (rh2.oracle.com) 6> delete extract extstd1
Deleted EXTRACT EXTSTD1.

GGSCI (rh2.oracle.com) 7> delete exttrail /d01/ext/cl

GGSCI (rh2.oracle.com) 14> add extract extstd1,tranlog,begin now
EXTRACT added.

GGSCI (rh2.oracle.com) 15> add exttrail /d01/ext/cl,megabytes 100,extract extstd1
EXTTRAIL added.

4.
在primary上启动Manager
GGSCI (rh2.oracle.com) 18> start Manager
Manager started.

5.
接着在primary上执行disable trigger触发器和cascade delete约束的脚本

6.
在standby上对执行热备份(逻辑，物理的均可);并记录该热备的结束时间

7.
使用standby上的热备份来完成primary上的initial load后，再以HANDLECOLLISIONS选项启动Standby上的replicat
GGSCI (rh2.oracle.com) 22> view params repstd2

-- Identify the Replicat group:
REPLICAT repstd2
-- State that source and target definitions are identical:
ASSUMETARGETDEFS
-- Specify database login information as needed for the database:
userid maclean, password maclean
HANDLECOLLISIONS
-- Specify tables for delivery:
MAP clinic.*, TARGET clinic.*;
-- Exclude specific tables from delivery if needed:
-- MAPEXCLUDE 


GGSCI (rh2.oracle.com) 23> start replicat repstd2

Sending START request to MANAGER ...
REPLICAT REPSTD2 starting

8.并启动standby上的data pump group，将堆积的trail文件传输到Primary上
GGSCI (rh3.oracle.com) 19> start pumpstd2
Sending START request to MANAGER ...
EXTRACT PUMPSTD2 starting

9.使用info replicat观察primary上的replicat,观察其进度是否已晚于完成初始化导出的时间


10.
禁用primary上目前使用的HANDLECOLLISIONS选项
GGSCI (rh2.oracle.com) 26> send replicat repstd2,NOHANDLECOLLISIONS

11.
关闭之前切换到Standby上的一切应用

12.
12.若需要进行数据验证则关闭Standby上的extract、pump及Primary上的replicat:
GGSCI (rh2.oracle.com) 31> lag replicat repstd2
Sending GETLAG request to REPLICAT REPSTD2 ...
Last record lag: 3 seconds.
At EOF, no more records to process.

GGSCI (rh3.oracle.com) 28> stop extstd2
Sending STOP request to EXTRACT EXTSTD2 ...
Request processed.


GGSCI (rh3.oracle.com) 26> stop pumpstd2
Sending STOP request to EXTRACT PUMPSTD2 ...
Request processed.

GGSCI (rh2.oracle.com) 34> stop replicat repstd2
Sending STOP request to REPLICAT REPSTD2 ...
Request processed.

/* 使用Oracle GoldenGate Veridata等工具验证数据一致性，
   若不一致则修复
*/

standby库上:
SQL> select sum(t2) from tv;

   SUM(T2)
----------
5355944997

primary库上:
SQL> select sum(t2) from tv;

   SUM(T2)
----------
5355944997

13.在primary系统上赋予应用相关DML权限，启用触发器及删除约束

14.
修改primary系统上的extract group的begin time为当前，启动Primary到Standby的extract、pump及replicat
GGSCI (rh2.oracle.com) 36> alter extstd1 ,begin now
EXTRACT altered.


GGSCI (rh2.oracle.com) 52> start extract extstd1

Sending START request to MANAGER ...
EXTRACT EXTSTD1 starting


GGSCI (rh2.oracle.com) 53> start extract pumpstd1

Sending START request to MANAGER ...
EXTRACT PUMPSTD1 starting


GGSCI (rh3.oracle.com) 3> start repstd1

Sending START request to MANAGER ...
REPLICAT REPSTD1 starting

此时系统切换回原始的primary->standby状态.
That's great!

Filed Under: GoldenGate, Oracle Tagged With: failover, GG, live standby

GoldenGate实现Live Standby主备库切换(1)

2010/12/20 by admin Leave a Comment

Oracle Goldengate目前支持主被动式的双向配置，换而言之OGG可以将来自于激活的主库的数据变化完全复制到从库中，从库在不断同步数据的同时已经为计划内的和计划外的outages做好了故障切换的准备，也就是我们说的Live Standby。这里我们重点介绍一下配置Oracle Goldengate Live Standby系统的步骤，和具体的故障切换过程。

SQL> conn clinic/clinic
Connected.
SQL> drop table tv;
create table tv (t1 int primary key,t2 int,t3 varchar2(30));
Table dropped.

SQL> 

Table created.

SQL> drop sequence seqt1;

create sequence seqt1 start with 1 increment by 1;
Sequence dropped.

SQL> SQL>
Sequence created.

declare
  rnd number(9,2);
begin
   for i in 1..100000 loop
     insert into tv values(seqt1.nextval,i*dbms_random.value,'MACLEAN IS TESTING');
     commit;
   end loop;
end;
/

/* 以上脚本在primary主库的某个应用账户下创建了测试用的数据，
    接着我们可以使用各种工具将数据初始化到从库中，如果在这个过程中
    希望实时在线数据迁移的话，可以参考《Goldengate实现在线数据迁移》
*/

/* 注意我们在Live Standby的环境中往往需要复制sequence序列，以保证切换到备库时业务可以正常进行  */

/* 初始化备库数据后，确保已与主库完全一致 */
primary :
SQL> select sum(t2) from tv;

   SUM(T2)
----------
2498624495

SQL> select last_number from user_sequences;

LAST_NUMBER
-----------
     100001

standby:
SQL> select sum(t2) from tv;

   SUM(T2)
----------
2498624495

SQL> select last_number from user_sequences;

LAST_NUMBER
-----------
     100001

以上完成准备工作后，我们可以进入到正式配置Goldengate live stanby的阶段，包括以下步骤:

配置由主库到备库的extract、replicat、data pump，该步骤同普通的单向复制没有太大的区别
配置由备库到主库的extract、replicat、data pump
启动由主库到备库的extract、replicat、data pump

接下来我们会实践整个配置过程:

1.
创建由主库到备库的extract、data pump、replicat
GGSCI (rh2.oracle.com) 10> dblogin userid maclean
Password: 
Successfully logged into database.

GGSCI (rh2.oracle.com) 11> add trandata clinic.*
Logging of supplemental redo data enabled for table CLINIC.TV


GGSCI (rh2.oracle.com) 4> add extract extstd1,tranlog,begin now
EXTRACT added.

GGSCI (rh2.oracle.com) 5> add exttrail /d01/ext/cl,megabytes 100,extract extstd1
EXTTRAIL added.

GGSCI (rh2.oracle.com) 7> view params extstd1

-- Identify the Extract group:
EXTRACT extstd1
-- Specify database login information as needed for the database:
userid maclean, password maclean
-- Specify the local trail that this Extract writes to:
EXTTRAIL /d01/ext/cl
-- Specify sequences to be captured:
SEQUENCE clinic.seqt1;
-- Specify tables to be captured:
TABLE clinic.*;
-- Exclude specific tables from capture if needed:
-- TABLEEXCLUDE 

GGSCI (rh2.oracle.com) 17> add extract pumpstd1,exttrailsource /d01/ext/cl,begin now
EXTRACT added.

GGSCI (rh2.oracle.com) 98> add rmttrail /d01/rmt/cl,megabytes 100,extract pumpstd1
RMTTRAIL added.

GGSCI (rh2.oracle.com) 129> view params pumpstd1
-- Identify the data pump group:
EXTRACT pumpstd1
userid maclean, password maclean
-- Specify database login information as needed for the database:
userid maclean, password maclean
RMTHOST rh3.oracle.com, MGRPORT 7809
-- Specify the remote trail on the standby system:
RMTTRAIL /d01/rmt/cl
-- Pass data through without mapping, filtering, conversion:
PASSTHRU
sequence clinic.seqt1;
Table clinic.*;


在备库上配置由主库到备库的replicat:

GGSCI (rh3.oracle.com) 4> add replicat repstd1,exttrail /d01/rmt/cl,begin now
REPLICAT added.

GGSCI (rh3.oracle.com) 49> view params repstd1
-- Identify the Replicat group:
REPLICAT repstd1
-- State that source and target definitions are identical:
ASSUMETARGETDEFS
-- Specify database login information as needed for the database:
userid maclean, password maclean
-- Specify tables for delivery:
MAP clinic.*, TARGET clinic.*;
-- Exclude specific tables from delivery if needed:
-- MAPEXCLUDE 

2.
创建由备库到主库的extract、data pump、replicat

GGSCI (rh3.oracle.com) 51> dblogin userid maclean
Password: 
Successfully logged into database.

GGSCI (rh3.oracle.com) 52> add trandata clinic.*
Logging of supplemental redo data enabled for table CLINIC.TV.

/* 不要忘记在备库端的相关表加上追加日志 */

GGSCI (rh3.oracle.com) 53> add extract extstd2,tranlog,begin now
EXTRACT added.

GGSCI (rh3.oracle.com) 54> add exttrail /d01/ext/cl,megabytes 100,extract extstd2
EXTTRAIL added.

GGSCI (rh3.oracle.com) 58> view params extstd2
-- Identify the Extract group:
EXTRACT extstd2
-- Specify database login information as needed for the database:
userid maclean, password maclean
-- Specify the local trail that this Extract writes to:
EXTTRAIL /d01/ext/cl
-- Specify sequences to be captured:
SEQUENCE clinic.seqt1;
-- Specify tables to be captured:
TABLE clinic.*;
-- Exclude specific tables from capture if needed:
-- TABLEEXCLUDE 

GGSCI (rh3.oracle.com) 59> add extract pumpstd2,exttrailsource /d01/ext/cl,begin now
EXTRACT added.

GGSCI (rh3.oracle.com) 60> add rmttrail /d01/rmt/cl,megabytes 100,extract pumpstd2
RMTTRAIL added.

GGSCI (rh3.oracle.com) 63> view params pumpstd2

-- Identify the data pump group:
EXTRACT pumpstd2
userid maclean, password maclean
-- Specify database login information as needed for the database:
userid maclean, password maclean
RMTHOST rh2.oracle.com, MGRPORT 7809
-- Specify the remote trail on the standby system:
RMTTRAIL /d01/rmt/cl
-- Pass data through without mapping, filtering, conversion:
PASSTHRU
sequence clinic.seqt1;
Table clinic.*;

在主库上配置replicat:


GGSCI (rh2.oracle.com) 136> add replicat repstd2,exttrail /d01/rmt/cl,begin now,checkpointtable maclean.ck
REPLICAT added.

GGSCI (rh2.oracle.com) 138> view params repstd2

-- Identify the Replicat group:
REPLICAT repstd2
-- State that source and target definitions are identical:
ASSUMETARGETDEFS
-- Specify database login information as needed for the database:
userid maclean, password maclean
-- Specify tables for delivery:
MAP clinic.*, TARGET clinic.*;
-- Exclude specific tables from delivery if needed:
-- MAPEXCLUDE 

3.
完成以上OGG配置后，可以启动主库到备库的extract、pump、以及replicat:
GGSCI (rh2.oracle.com) 141> start extstd1
Sending START request to MANAGER ...
EXTRACT EXTSTD1 starting


GGSCI (rh2.oracle.com) 142> start pumpstd1
Sending START request to MANAGER ...
EXTRACT PUMPSTD1 starting



GGSCI (rh3.oracle.com) 70> start repstd1
Sending START request to MANAGER ...
REPLICAT REPSTD1 starting

/* 如果你是在offline状态下配置的话，那么此时可以启用应用了*/

接下来我们尝试做有计划的主备库切换演练:

1.
首先停止一切在主库上的应用，这一点和DataGuard Switchover一样。在保证没有活动事务的情况下，才能切换干净。
2.
在主库端使用LAG等命令了解extract的延迟，若返回如"At EOF, no more records to process"的信息，则说明所有事务均已被抽取。
GGSCI (rh2.oracle.com) 144> lag extstd1
Sending GETLAG request to EXTRACT EXTSTD1 ...
Last record lag: 0 seconds.
At EOF, no more records to process.

在EOF的前提下关闭extract:
GGSCI (rh2.oracle.com) 146> stop extstd1 
Sending STOP request to EXTRACT EXTSTD1 ...
Request processed.

3.
同样对pump使用LAG命令，若返回如"At EOF, no more records to process"的信息，则说明已抽取的数据都被发送到备库了。
GGSCI (rh2.oracle.com) 147> lag pumpstd1
Sending GETLAG request to EXTRACT PUMPSTD1 ...
Last record lag: 3 seconds.
At EOF, no more records to process.

在EOF的前提下，关闭data pump
GGSCI (rh2.oracle.com) 148> stop pumpstd1
Sending STOP request to EXTRACT PUMPSTD1 ...
Request processed.

3.
检查备库端replicat的同步情况，如返回"At EOF, no more records to process."，则说明所有记录均被复制。
GGSCI (rh3.oracle.com) 71> lag repstd1
Sending GETLAG request to REPLICAT REPSTD1 ...
Last record lag: 5 seconds.
At EOF, no more records to process.

在EOF的前提下关闭replicat

GGSCI (rh3.oracle.com) 72> stop repstd1
Sending STOP request to REPLICAT REPSTD1 ...
Request processed.

4.
紧接着我们可以在备库上为业务应用用户赋予必要的insert、update、delete权限，启用各种触发器trigger及cascade delete约束等；
以上手段在主库上对应的操作是收回应用业务的权限，disable掉各种触发器及cascade delete约束，
之所以这样做是为了保证在任何时候扮演备库角色的数据库均不应当接受任何除了OGG外的手动的或者应用驱动的业务数据变更，
以保证主备库间的数据一致。

5.
修改原备库上的extract的启动时间到现在，已保证它不去抽取那些之前的重做日志

GGSCI (rh3.oracle.com) 75> alter extstd2 ,begin now
EXTRACT altered.

GGSCI (rh3.oracle.com) 76> start extstd2

Sending START request to MANAGER ...
EXTRACT EXTSTD2 starting


若之前没有启动由备库到主库的pump和replicat的话可以在此时启动:

GGSCI (rh3.oracle.com) 78> start pumpstd2

Sending START request to MANAGER ...
EXTRACT PUMPSTD2 starting

GGSCI (rh2.oracle.com) 161> start repstd2

Sending START request to MANAGER ...
REPLICAT REPSTD2 starting

6.此时我们可以正式启动在原备库现在的主库上的应用了


接下来我们尝试回切到原主库上:
1.前提步骤与之前的切换相似，首先停止在原备库上的任何应用，
之后使用LAG命令确认extract和replicat的进度，在确认后关闭extract和replicat。
完成在主库上的维护工作：包括赋予权限，启用触发器等等。

2.修改原主库上的extract的开始时间为当前，保证它不去处理之前的重做日志:
GGSCI (rh2.oracle.com) 165> alter extract extstd1,begin now
EXTRACT altered.

3.此时我们已经可以启动在原主库现在的主库上的应用了

4.启动最早配置的由主库到备库的extract、pump、replicat:

GGSCI (rh2.oracle.com) 166> start extstd1

Sending START request to MANAGER ...
EXTRACT EXTSTD1 starting

GGSCI (rh2.oracle.com) 171> start pumpstd1

Sending START request to MANAGER ...
EXTRACT PUMPSTD1 starting

GGSCI (rh3.oracle.com) 86> start repstd1

Sending START request to MANAGER ...
REPLICAT REPSTD1 starting


以上完成了OGG的Live Standby中主备库之间的计划内的切换Switchover，That's Great!

Filed Under: GoldenGate, Oracle Tagged With: failover, GG, standby, switchover