9i RAC: Manual Backup and Recovery

Transcription

9i RAC: Manual Backup and Recovery
9i RAC: Manual Backup and Recovery
Shankar Govindan
Introduction
When we plan to move a large database or a heavily used OLTP database to a cluster setup, to
get enough mileage from the horizontal scaling of 9i RAC, we normally have lots of questions on
how we are going to handle various database maintenance that are traditionally done, how we
are going to setup the same and what changes or tools need to be in place to handle RAC.
In this paper we are going to look at one of the important administration jobs, the backing up of a
9i RAC. We will also look at how we can recover from a simple data file loss or the other extreme,
a Disaster recovery.
Backup Method
We can use RMAN to backup and recover a database, but if the database is huge or the backup
times are more than 5 hours, then precious time is lost. Sites that have large databases can look
at Shadow copy.
We use Hitachi RAID system and use Hitachi Shadow copy. (For more information on the Hitachi
Shadow copy options, you can visit their website). The backup is done in the traditional manner:
•
•
•
tablespace is put in the begin backup mode,
the database files are copied and
then the tablespace is put back in the end backup mode.
The way the shadow backup works, is by having a media server which has a set of disks. The
disks are attached to the production server and synced (syncing of a terabyte takes 6 hours), the
syncing can happen any time. Once, the media server disks are in sync with the production
server, we fire a script which will put all the tablespaces in backup mode. The sync mirror disk is
then sliced off the production server. This takes less than 3 minutes. The tablespaces are then
put back in the end backup mode. The backup is now complete and takes only 3 minutes. The
mirror disk is then copied over to tape offline using the media server.
We can also sync the backup archive log directories and copy them over to tape.
Setting up environment for Manual Backup
Let’s see how we can do a traditional backup in a 9i RAC environment. As we know that there are
multiple nodes in a RAC setup, all pointing to a single database. But the backup of database
should be initiated from a single node, usually the primary node. The primary node is the one you
setup first and do most of the maintenance from that node.
The first thing to do is setup your profile in such a way that when you login as the owner of the
database, in our case usually ‘ORACLE’, your environment is setup correctly. We have to
remember that the database name, let’s say RPTQ in our case is only used for connecting to the
database by the clients (this can also be masked/wrapped by using alias in tnsnames). When we
login locally to one of the nodes, we have to set our ORACLE_SID to the instance name and not
to the database name.
Let’s say we have a two node(LJCQS034 and LJCQS035) instance called rptq1 and rptq2,
pointing to a database RPTQ ( remember that the SID’s are case sensitive in Unix and it is better
to set them all up in the lower case to avoid confusion) as shown in the figure below.
When we login to server LJCQS034 hosting instance rptq1, then we set our ORACLE_SID
pointing to rptq1( presuming our ORACLE_HOME is shared or point the ORACLE_HOME to the
instance rptq1’s oracle home).
We will execute all the maintenance jobs as a DBA locally and setup our environment for a single
node, the primary node, to execute all our automatic maintenance and monitoring scripts. Your
cron for the database will run in the primary node, although you can copy over the same cron to
the second node for failover and activate it manually.
Setting up your .profile for a single node will be as shown below:
#---------------------------------------------------------------------------------------# If you are oracle and logging in one of the instance, then set that specific env
#---------------------------------------------------------------------------------------if [ "`/usr/ucb/whoami`" = "oracle" ] && [ "`hostname`" = "LJCQS034" ]; then
. ./rptq1.env
else
. ./rptq2.env
fi
You can include the same lines in your second node’s .profile. The env script rptq1.env should
exist and will look like:
#!/bin/ksh
#|----------------------------------------------------------------------------------|
#|
|
#| filename: /dba/etc/rptq1.env
|
#|
|
#| Created by : Shankar Govindan
|
#| Dated
: 23-July-2003
|
#|
|
#|
|
#| History :
|
#|
|
#|----------------------------------------------------------------------------------|
# If TWO_TASK is set, unset it
if [ -n "$TWO_TASK" ]; then
unset TWO_TASK
fi
export
LD_LIBRARY_PATH=/usr/lib:/usr/ccs/lib:/usr/ucblib:/usr/dt/lib:/lib:/sv03/sw/oracle/rptqdb
/9.2.0/lib:
/sv03/sw/oracle/rptqdb/9.2.0/lib32:/sv03/sw/oracle/rptqdb/9.2.0/jdbc/lib:.
export NLS_DATE_FORMAT=DD-MON-RR
export NLS_DATE_LANGUAGE=AMERICAN
export NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
export NLS_NUMERIC_CHARACTERS=.,
export NLS_SORT=BINARY
export ORACLE_BASE=/sv03
export ORACLE_HOME=/sv03/sw/oracle/rptqdb/9.2.0
export ORACLE_SID=rptq1
export ORACLE_TERM=vt100
unset ORAENV_ASK; export ORAENV_ASK
export ORA_NLS32=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data
export ORA_NLS33=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data
export ORA_NLS=/sv03/sw/oracle/rptqdb/9.2.0/ocommon/nls/admin/data
export
PATH=/sv03/sw/oracle/rptqdb/9.2.0/bin:/bin:/usr/bin:/usr/sbin:/usr/ccs/bin:/usr/ucb:/opt/
local/bin:
/opt/hpnp/bin:/usr/local/bin:/usr/openwin/bin:/dba/bin:/dba/sbin:.
export TNS_ADMIN=/sv03/sw/oracle/rptqdb/9.2.0/network/admin
export UDUMP=/sv03/oracle/admin/rptq/udump
export PFILE=/sv03/oracle/admin/rptq/pfile
export BDUMP=/sv03/oracle/admin/rptq/bdump
If you need to automate the backup and if the scripts are run as a cron, then you need to source
the ORACLE_SID from somewhere, either from a file or from the Oratab. The oratab entry for a 9i
RAC instance will have the instance name and not the database name. For example we if we are
initiating the backup from the primary node, then the oratab of the primary node will have:
rptq1:/sv03/sw/oracle/rptqdb/9.2.0:N
The secondary node or the second node will have rptq2 in the oratab.
Manual Backup of data files
The backup of the database is initiated from the primary node or primary instance. We don’t have
to setup anything or initiate anything from the secondary node or any other nodes that are
pointing to the database.
Login to the primary instance as a user who has the privilege to backup the datafiles and initiate
the backup commands.
•
•
•
Alter tablespace tablespace_name begin backup;
Copy datafiles to backup directory/server and then end the
tablespace backup.
Alter tablespace tablespace_name end backup;
We know that we need to do this for all the tablespaces. If you are using the shadow copy
concept like we do for large databases, then we need to put all the tablespaces into backup mode
at once and then break the mirror disk. We then put all the tablespace to end backup mode. The
mirror copy goes to tape backup offline.
Manual Backup of archive logs
The archive logs that are generated at the time of backup should get into the backup set for a
meaningful recovery. Incase of disaster and we need to recover to the last backup, then the last
few archive logs are required to bring the database to some consistent state for a cancel based
recovery.
In a RAC environment we setup the archive format to include a %t. The %t will identify which
thread or instance that generated the archivelog.
log_archive_format =
arch%s_%t.arc
Traditionally we force a log switch before and after a backup to get the timestamps in on the data
files and to push the last few logs to be archived, so they get into the tape as part of the backup
set. We normally execute,
Alter system checkpoint;
Alter system switch logfile;
In a RAC environment, this command will only switch the logfile for a single instance. We have to
remember that there are archive logs generated by the other instances too, and these archives
need to be part of the backup set for any meaningful recovery.
To force a log switch and push the logs to archive for all the instances, we need to execute,
Alter system archive log current;
Alter system archive log current;
Once these archivelogs are pushed and visible, they are then compressed/moved to the backup
archivelog directory and are part of the backup. If you have setup shadow copy, then, the backup
archivelog directory can also be synced and mirrored, it can then be part of the backup set when
the mirror is broken after the data files backup.
Manual backup of server config file
There is a server config file that stores all the database and instance information when the RAC
setup is created. The name of the file is srvm.dbf.
There is also a file in the /var/opt/oracle called srvConfig.loc
oracle ljcqs034:=> pwd
/var/opt/oracle
oracle ljcqs034:=> ls -ltr
total 8
-rw-r--r-1 oracle
dba
-rw-r--r-1 oracle
dba
-rw-rw-r-1 oracle
other
47 Aug 18 16:47 srvConfig.loc
123 Aug 26 13:18 oraInst.loc
812 Aug 29 18:35 oratab
The srvConfig.loc file contains the pointer to the location of the srvm.dbf.
srvconfig_loc=/sv00/db00/oradata/rptq/srvm.dbf
We have to make sure that this file is in the oracle dbf file directory location, so that it gets backed
up periodically. Incase you are not shadow copying your data files, then you need to back this file
up as part of your backup procedure. (You can always work around the loss of this file, by
recreating the RAC setup once again using the srvctl commands).
Recovery from a lost data file
Let’s try to simulate the loss of a data file and try to recover the same in a RAC environment.
•
Create a tablespace
SQL>create tablespace testrac datafile
'/Ora_base/db11/oradata/drac/testrac_01.dbf'
extent management local uniform size 4M
SEGMENT SPACE MANAGEMENT AUTO;
size 100M
SQL> alter tablespace testrac add datafile
'/Ora_base/db11/oradata/drac/testrac_02.dbf' size 100M;
Tablespace altered.
SQL> select file_name,bytes from dba_data_files where tablespace_name
like 'TESTRAC';
FILE_NAME
-------------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf
/Ora_base/db11/oradata/drac/testrac_02.dbf
BYTES
---------104857600
104857600
SQL> alter user sxgovind default tablespace testrac;
User altered.
SQL> connect sgovind
•
Create some tables and load data
SQL> create table ruby as select * from dba_objects;
Table created.
SQL> create table hammerhead as select * from dba_tables;
Table created.
SQL> select segment_name,segment_type,tablespace_name from dba_segments
where tablespace_name like 'TESTRAC';
SEGMENT_NAME
SEGMENT_TYPE
TABLESPACE_NAME
--------------------------------------------------------------------------RUBY
TABLE
TESTRAC
HAMMERHEAD
TABLE
TESTRAC
SQL>alter tablespace TESTRAC begin backup;
•
Backup the tablespace
SQL>select d.name,b.status,b.time from v$datafile d,v$backup b
where d.file#=b.file# and b.status = 'ACTIVE';
NAME
--------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf
/Ora_base/db11/oradata/drac/testrac_02.dbf
ACTIVE
ACTIVE
STATUS
TIME
------------------ --------22-MAY-03
22-MAY-03
oracle ljcqs097:=> cp testrac_01.dbf $HOME
SQL> alter tablespace testrac end backup;
Tablespace altered.
•
Remove data file associated with the tablespace
oracle ljcqs097:=> rm testrac_01.dbf
SQL> select file_name,bytes from dba_data_files where tablespace_name
like 'TESTRAC';
select file_name,bytes from dba_data_files where tablespace_name like
'TESTRAC'
*
ERROR at line 1:
ORA-01116: error in opening database file 64
ORA-01110: data file 64: '/Ora_base/db11/oradata/drac/testrac_01.dbf'
ORA-27041: unable to open file
SVR4 Error: 2: No such file or directory
Additional information: 3
•
Recover data file associated with the tablespace
SQL> alter database datafile
'/Ora_base/db11/oradata/drac/testrac_01.dbf' offline;
Database altered.
oracle ljcqs098:=> cp testrac_01.dbf /Ora_base/db11/oradata/drac
SQL> alter database recover datafile
'/Ora_base/db11/oradata/drac/testrac_01.dbf';
alter database recover datafile
'/Ora_base/db11/oradata/drac/testrac_01.dbf'
*
ERROR at line 1:
ORA-00279: change 1936203896230 generated at 05/22/2003 14:01:09 needed
for
thread 2
ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_12.arc
ORA-00280: change 1936203896230 for thread 2 is in sequence #12
SQL> ALTER DATABASE RECOVER CANCEL;
SQL> recover datafile '/Ora_base/db11/oradata/drac/testrac_01.dbf';
ORA-00279: change 1936203896230 generated at 05/22/2003 14:01:09 needed
for
thread 2
ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_12.arc
ORA-00280: change 1936203896230 for thread 2 is in sequence #12
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
AUTO
ORA-00279: change 1936203896361 generated at 05/22/2003 14:06:13 needed
for
thread 2
ORA-00289: suggestion : /shared/arch/oradata/drac/arch/arch_2_13.arc
ORA-00280: change 1936203896361 for thread 2 is in sequence #13
ORA-00278: log file '/shared/arch/oradata/drac/arch/arch_2_12.arc' no
longer
needed for this recovery
Log applied.
Media recovery complete.
SQL> alter database datafile
'/Ora_base/db11/oradata/drac/testrac_01.dbf' online;
Database altered.
SQL> select file_name,bytes from dba_data_files where tablespace_name
like 'TESTRAC';
FILE_NAME
-------------------------------------------------/Ora_base/db11/oradata/drac/testrac_01.dbf
/Ora_base/db11/oradata/drac/testrac_02.dbf
BYTES
---------104857600
104857600
Done.
Recovery from a Disaster
When we recover a lost data file, we recover them online and both the instances of the RAC
environment are up and running. We only offline the data file and try to recover the data file from
a backup, the same way we do with a non-RAC setup.
Let’s see what happens at the extreme end, when we have a disaster and we need to bring the
last backup from tape and recover to a point in time or apply the available archive logs and do a
cancel based recovery.
We need to remember that we cannot bring both the instances of the database up. We need to
mount the database in a single instance mode and then initiate a recovery. Once the recovery is
complete we can bring all the other nodes up and running.
We don’t have to do any changes to the environment or the parameter files.
Verify the srvm.dbf file exists in the location and is not overwritten by the copy over of datafiles.
Check the node visibility and start GSD Daemon to check RAC config is okay.
oracle ljcqs034:=> lsnodes
ljcqs034
ljcqs035
oracle ljcqs034:=> gsdctl stat
GSD is not running on the local node
oracle ljcqs035:=> gsdctl stat
GSD is not running on the local node
oracle ljcqs034:=> gsdctl start
Failed to start GSD on local node
We have to make sure that we have the server configured correctly and the GSD daemon is up
and running. In case the srvm.dbf file is lost and the GSD does not come up, then recreate the
RAC configuration as explained below in the Recovery of srvConfig information section of this
note.
Once the gsd daemon is up and running, we startup the database in a single instance mode.
Login to the primary node and verify the ORACLE_SID,
oracle ljcqs034:=> echo $ORACLE_SID
rptq1
oracle ljcqs032:=> sqlplus /nolog
SQL*Plus: Release 9.2.0.3.0 - Production on Wed Oct 8 09:24:28 2003
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
SQL> connect / as sysdba
Connected to an idle instance.
SQL> startup nomount;
SQL> recover database using backup controlfile until cancel;
ORA-00279: change 1937810322614 generated at 08/16/2003 12:29:16 needed
for
thread 1
ORA-00289: suggestion : /sv04/data/arch/rptq/arch703203_1.arc
ORA-00280: change 1937810322614 for thread 1 is in sequence #703203
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
ORA-00279: change 1937810322614 generated at 08/16/2003 12:29:16 needed
for
thread 2
It does not suggest what archivelog file to apply for thread 2. You need to choose the latest one
that match the timestamp of the one applied for thread 1 and start the apply from there.
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/sv04/data/arch/rptq/arch3636_2.arc
ORA-00279: change 1937810325033 generated at 08/16/2003 12:35:50 needed
for
thread 1
ORA-00289: suggestion : /sv04/data/arch/rptq/arch703204_1.arc
ORA-00280: change 1937810325033 for thread 1 is in sequence #703204
ORA-00278: log file '/sv04/data/arch/rptq/arch703203_1.arc' no longer
needed
for this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
This time it suggests what file is required for thread 2. The first time seems to be an issue, once it
applies the first right archive log file for thread 2, it then prompts for more.
ORA-00279: change 1937810464267 generated at 08/16/2003 13:07:38 needed
for
thread 2
ORA-00289: suggestion : /sv04/data/arch/rptq/arch3637_2.arc
ORA-00280: change 1937810464267 for thread 2 is in sequence #3637
ORA-00278: log file '/sv04/data/arch/rptq/arch3636_2.arc' no longer
needed for
this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
ORA-00279: change 1937810500912 generated at 08/16/2003 13:12:27 needed
for
thread 2
ORA-00289: suggestion : /sv04/data/arch/rptq/arch3638_2.arc
ORA-00280: change 1937810500912 for thread 2 is in sequence #3638
ORA-00278: log file '/sv04/data/arch/rptq/arch3637_2.arc' no longer
needed for
this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
ORA-00279: change 1937810531734 generated at 08/16/2003 13:15:44 needed
for
thread 2
ORA-00289: suggestion : /sv04/data/arch/rptq/arch3639_2.arc
ORA-00280: change 1937810531734 for thread 2 is in sequence #3639
ORA-00278: log file '/sv04/data/arch/rptq/arch3638_2.arc' no longer
needed for
this recovery
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
CANCEL
Media recovery cancelled.
SQL> alter database open resetlogs;
Database altered.
If your database is setup to use the tempfile temporary tablespace, then you need to recreate
them.
ALTER TABLESPACE TEMP ADD TEMPFILE
'/sv00/db13/oradata/rptq/temp_01.dbf'
SIZE 2044M REUSE AUTOEXTEND OFF;
ALTER TABLESPACE TEMP ADD TEMPFILE
'/sv00/db13/oradata/rptq/temp_02.dbf'
SIZE 2044M REUSE AUTOEXTEND OFF;
ALTER TABLESPACE TEMP ADD TEMPFILE
'/sv00/db13/oradata/rptq/temp_03.dbf'
SIZE 2044M REUSE AUTOEXTEND OFF;
ALTER TABLESPACE TEMP ADD TEMPFILE
'/sv00/db13/oradata/rptq/temp_04.dbf'
SIZE 2044M REUSE AUTOEXTEND OFF;
ALTER TABLESPACE TEMP ADD TEMPFILE
'/sv00/db13/oradata/rptq/temp_05.dbf'
SIZE 2044M REUSE AUTOEXTEND OFF;
Recovery from loss of srvConfig information
If you loose the srvm.dbf file and the GSD daemon does not come up, then its time to recreate
the srvm.dbf, by recreating the server configuration.
oracle ljcqs098:=> which gsd
/shared/oracle/product/9.2.0/bin/gsd
oracle ljcqs098:=> cd $ORACLE_HOME
oracle ljcqs032:=> gsdctl start
Failed to start GSD on local node
The following command will wipe out all the previous information that existed in the srvm.dbf file.
Once you setup the environment, you should not execute the below command again.
oracle ljcqs032:=> srvconfig -init –f
oracle.ops.mgmt.rawdevice.RawDeviceException: PRKR-1025 : file
/var/opt/oracle/srvConfig.loc does not contain property srvconfig_loc
at java.lang.Throwable.<init>(Compiled Code)
at java.lang.Exception.<init>(Compiled Code)
at oracle.ops.mgmt.rawdevice.RawDeviceException.<init>(Compiled
Code)
at
oracle.ops.mgmt.rawdevice.RawDeviceUtil.getDeviceName(Compiled Code)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(Compiled
Code)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
The file srvm.dbf was not part of the backup and hence was not recovered. The file does not exist
and the srvconfig command does point out the same.
Work around is to create a new file and update the /var/opt/oracle/srvConfig.loc file of the new
location of the srvm.dbf file.
oracle ljcqs032:=>touch /sv00/db00/oradata/rpt1/srvm.dbf
oracle ljcqs032:=>chmod 755 srvm.dbf
oracle ljcqs032:=>cd /var/opt/oracle
oracle ljcqs032:=>vi srvConfig.log
and add this line and save.
srvconfig_loc=/sv00/db00/oradata/rpt1/srvm.dbf
Now start the GSD daemon and then start adding the database and instance information to the
srvm.dbf file.
oracle ljcqs098:=> gsdctl start
Successfully started GSD on local node
oracle ljcqs097:=> srvctl add database -d rptq -o /shared/oracle/product/9.2.0
oracle ljcqs098:=> srvctl add instance -d rptq -i rptq1 -n ljcqs097
oracle ljcqs098:=> srvctl add instance -d rptq -i rptq2 -n ljcqs098
Check if the configuration has been setup correctly.
oracle ljcqs097:=> srvctl config
rptq
oracle ljcqs097:=> srvctl config database -d rptq
ljcqs097 rptq1 /shared/oracle/product/9.2.0
ljcqs098 rptq2 /shared/oracle/product/9.2.0
Shankar Govindan works as a Sr. Oracle DBA at CNF Inc, Portland, Oregon. Shankar
Govindan is Oracle Certified 7, 8 and 8I; you can contact him at
shankargovindan@yahoo.com. Note: The above info as usual is of my individual tests and
opinions and has nothing to do with the company I work for or represent.