Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment


Remove the Failed Database Server from the Cluster

1. Disable the listener that runs on the failed database server:

[oragrid@surviving ~]$ srvctl disable listener -n dm01db01
[oragrid@surviving ~]$ srvctl stop listener -n dm01db01
PRCC-1017 : LISTENER was already stopped on dm01db01
PRCR-1005 : Resource ora.LISTENER.lsnr is already stopped

[oragrid@surviving ~]$


2. Delete the Inactive Instance using DBCA via ( Instance management ) and Oracle Home from the Oracle inventory:

[oracle@surviving ~]$ dbca ( vncserver )

[oracle@surviving ~]$ cd $ORACLE_HOME/oui/bin
[oracle@surviving bin]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_1 "CLUSTER_NODES=dm01db02"
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 23984 MB    Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
'UpdateNodeList' was successful.
[oracle@surviving bin]$



3. Verify that the failed database server is unpinned:

[oragrid@surviving ~]$ olsnodes -s -t
dm01db01       Inactive        Unpinned
dm01db02       Active  Unpinned
[oragrid@surviving ~]$


4. Stop the VIP Resources for the failed database server and delete:

[root@surviving ~]# cd /u01/app/11.2.0.4/grid/bin/
[root@surviving bin]# ./srvctl stop vip -i dm01db01-vip
[root@surviving bin]# ./srvctl remove vip -i dm01db01-vip
Please confirm that you intend to remove the VIPs dm01db01-vip (y/[n]) y
[root@surviving bin]#


5. Delete the node from the cluster:

[root@surviving bin]# ./crsctl delete node -n dm01db01
CRS-4661: Node dm01db01 successfully deleted.
[root@surviving bin]#


6. Update the Oracle Inventory:

[root@surviving bin]# su - oragrid
[oragrid@surviving ~]$ cd ${ORACLE_HOME}/oui/bin
[oragrid@surviving bin]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/11.2.0.4/grid "CLUSTER_NODES=dm01db02" CRS=TRUE
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 23984 MB    Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
'UpdateNodeList' was successful.
[oragrid@surviving bin]$



7. Verify the node deletion is successful:

[oracle@surviving]$ cluvfy stage -post nodedel -n dm01db01 -verbose

Performing post-checks for node removal
Checking CRS integrity...
The Oracle clusterware is healthy on node "surviving"
CRS integrity check passed
Result:
Node removal check passed
Post-check for node removal was successful.
Prepare the USB Flash Drive for Imaging

[oragrid@surviving bin]$


Image Replacement Database Server using the ISO file image

1. The iso file image needs to be transferred to the desktop where web ILOM is going to be used for the reimage process.
2. Log into ILOM via web and enable remote console
3. Attach the ISO image to the CD ROM
4. Connect to the ILOM via web interface. Go to Remote Control tab , then Host Control tab. From the Next Boot Device, select CDROM. Next the server is rebooted, it will use the ISO image attached. This is valid for one time, which after the default BIOS order settings will remain.
5. Reboot the box and let the process pick the ISO image and start the re-image process
6. The system boots and should detect the ISO image media. Allow the system to boot.
a. The first phase of the imaging process identifies any BIOS or Firmware that is out of date, and upgrades the components to the expected level for the ImageMaker. If any components must be upgraded (or downgraded), then the system is automatically rebooted.
b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB flash drive from the server, and to then press enter to power off the server.
7. Remove the USB flash drive from the replacement database server.
8. Press Enter to power off the server.


Configure Replacement Database Server


The database server at present does not have any hostnames, IP, DNS, or NTP settings. This section describes how this information is configured in the process called “Configuring Oracle Exadata.” The following list shows the information that you will be asked to supply:
1. Name servers
2. Time zone (for example, America/Chicago)
3. NTP servers
4. IP Address information for Management Network
5. IP Address information for Client Access Network
6. IP Address information for Infiniband Network
7. The canonical host name
8. The default gateway

This information should be the same on all database servers in the Database Machine and the IP Addresses can be
obtained from DNS. Additionally, when the database machine was installed, a document should have been
provided that included all of this information.

To begin the configuration process, power on the replacement database server. When the system boots it automatically
runs the “Configuring Oracle Exadata” routine and prompts you for the information above. After all information
is entered, the system prompts you to confirm the settings and then completes the boot process.

Note: If the database server does not utilize all network interfaces, then the configuration process will
stop warning that some network interfaces are disconnected, and prompt you about whether you want to
retry the discovery phase again. Answer “Yes” or “No”, as appropriate.

Note: If bonding is used for the Client Access Network, then this will be set up in the default activepassive
mode at this time.


Prepare Replacement Database Server for the Cluster

1. Using the files on a surviving database server for reference, copy or merge the contents of the following files:

Copy the /etc/security/limits.conf file.

Merge the contents of /etc/hosts.

Copy the /etc/oracle/cell/network-config/cellinit.ora file and update the IP Address to reflect the IP Address of the bond0 interface on the replacement database server.

Copy the /etc/oracle/cell/network-config/cellip.ora file. The content of the cellip.ora file should be the same on all database servers.

2. Set up the oracle user on the replacement database server.

a) Add the group (or groups) for the Oracle software owner (typically, the owner is oragrid):

On the surviving node, obtain the current group information:

[root@surviving ~]# id oragrid
uid=1000(oragrid) gid=1001(oinstall) groups=1001(oinstall),1004(asmdba),1005(asmoper),1006(asmadmin)
[root@surviving ~]#


On the replacement node, use the groupadd command to add the group information:

[root@replacement]# groupadd -g 1001 oinstall
[root@replacement]# groupadd -g 1002 dba
[root@replacement]# groupadd -g 1003 oper
[root@replacement]# groupadd -g 1004 asmdba
[root@replacement]# groupadd -g 1005 asmoper
[root@replacement]# groupadd -g 1006 asmadmin

b) Add the user (or users) for the Oracle environment (typically, this is oracle).

On the surviving node, obtain the current user information:

[root@surviving]# id oragrid
uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)

[root@surviving]# finger oragrid
Login: oracle Name: (null)
Directory: /home/oracle Shell: /bin/bash
Never logged in.
No mail.
No Plan.

On the replacement node, add user information:

[root@replacement]# useradd -u 1000 -g 1001 -G 1001,1004,1005,1006 -m -d /home/oragrid -s /bin/bash oragrid

c) Set the password for the Oracle software owner (the password is typically configured during deployment to be the oracle user):

[root@replacement]# passwd oragrid
Changing password for user oracle.
New UNIX password:
Retype new UNIX password:
passwd: all authentication tokens updated successfully.

d) Create the ORACLE_BASE and Grid Infrastructure directories such as /u01/app/grid and /u01/app/11.2.0.4/grid, as follows:

[root@replacement]# mkdir -p /u01/app/grid
[root@replcaement]# mkdir -p /u01/app/11.2.0.4/grid
[root@replacement]# chown -R oragrid:oinstall /u01/app

e) Change the ownership on the cellip.ora and cellinit.ora files. This is typically "oracle:oinstall"

[root@replacement]# chown -R oragrid:oinstall /etc/oracle/cell/network-config

f) Set up SSH within the oracle user account.

i. Login to the oracle account:

[root@replacement]# su - oragrid

ii. Create a dcli group file listing the nodes in the Oracle Cluster.

iii. Run the setup ssh script (this assumes the oracle password on all servers in the dbs_group list is set to "welcome")

[oracle@replacement]$ /opt/oracle.SupportTools/onecommand/setssh.sh -s -u oracle -p welcome1 -n N -h dbs_group
.........................

(or)

Node 1 & 2 : oragrid
~~~~~~~~~~~~~~~~~~~~

mkdir ~/.ssh
chmod 700 ~/.ssh
/usr/bin/ssh-keygen -t rsa

-----> both node

Node 1: oragrid

cd ~/.ssh
cat id_rsa.pub >> authorized_keys
scp authorized_keys surviving:/home/oragrid/.ssh

Node 2: oragrid

cd ~/.ssh
cat id_rsa.pub >> authorized_keys
scp authorized_keys dm01db01:/home/oragrid/.ssh

ssh dm01db01 date
ssh dm01db02 date
ssh dm01db01.oracle.com date
ssh dm01db02.oracle.com date


g) Verify ssh equivalency has been set up.
Login to the oracle account and verify the database server using the dcli command:

[root@replacement]# su - oragrid

[oracle@replacement]$ dcli -g dbs_group -l oragrid date
dm01db01: Wed Mar 10 17:21:33 CST 2010
surviving: Wed Mar 10 17:21:34 CST 2010
Bare Metal Restore Procedure


h) Set up or copy any custom login scripts from a surviving database server:

[oracle@surviving]$ scp .bash* oragrid@dm01db01:.


Clone Oracle Grid Infrastructure to the Replacement Database Server

1. Verify the hardware and operating system installations with the Cluster Verification Utility (CVU):

[oracle@surviving]$ cluvfy stage -post hwos -n dm01db01,dm01db02 -verbose

At the end of the report, you should see the text "Post-check for hardware and operating system setup was successful."

2. Verify peer compatibility:

[oracle@surviving]$ cluvfy comp peer -refnode surviving -n dm01db01 -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched

Compatibility check: Available memory [reference node: surviving]
Node Name Status Ref. node status Comment
------------ ----------------------- ----------------------- ----------
dm01db01 31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched
Available memory check failed
Compatibility check: Free disk space for "/tmp" [reference node: dm01db02]
Node Name Status Ref. node status Comment
------------ ----------------------- ---------------------- ----------
dm01db01 55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched
Free disk space check failed
If the only components that failed are related to physical memory, swap space, and disk space, then it is
safe to continue.

3. Perform requisite checks for node addition:

[oracle@surviving]$ cluvfy stage -pre nodeadd -n dm01db01 -fixup -fixupdir /home/oragrid/fixup.d

If the only component that fails is related to swap space, then it is safe to continue.

4. Add the replacement database server into the cluster:

[oracle@surviving]$ cd /u01/app/11.2.0.4/grid/oui/bin/

[oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={dm01db01-vip}"

This initiates the OUI to copy the clusterware software to the replacement database server.
WARNING: A new inventory has been created on one or more nodes in this session.
However, it has not yet been registered as the central inventory of this system.
To register the new inventory please run the script at
'/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'dm01db01'.
If you do not register the inventory, you may not be able to update or patch the
products you installed.
The following configuration scripts need to be executed as the "root" user in
each cluster node:

/u01/app/oraInventory/orainstRoot.sh #On nodes dm01db01
/u01/app/11.2.0.4/grid/root.sh #On nodes dm01db01

To execute the configuration scripts:

a) Open a terminal window.
b) Log in as root.
c) Run the scripts on each cluster node
After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/11.2.0.4/grid was successful.
Please check '/tmp/silentInstall.log' for more details.

5. Run the orainstRoot.sh and root.sh scripts for the replacement database server:

[root@replacement]# /u01/app/oraInventory/orainstRoot.sh

Creating the Oracle inventory pointer file (/etc/oraInst.loc)
Changing permissions of /u01/app/oraInventory.
Adding read,write permissions for group.
Removing read,write,execute permissions for world.
Changing groupname of /u01/app/oraInventory to oinstall.
The execution of the script is complete.

[root@replacement]# /u01/app/11.2.0.4/grid/root.sh

Check /u01/app/11.2.0.4/grid/install/root_dm01db01.acme.com_2010-03-10_17-59-15.log for the output of root script
The output file created above will report that the LISTENER resource on the
replaced database server failed to start. This is the expected output.
PRCR-1013 : Failed to start resource ora.LISTENER.lsnr
PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node dm01db01
CRS-2662: Resource 'ora.LISTENER.lsnr' is disabled on server 'dm01db01'
start listener on node=dm01db01 ... failed

6. Reenable the listener resource that was stopped and disabled in the "Remove the Failed Database Server
from the Cluster" section earlier in this white paper.

[root@replacement]# /u01/app/11.2.0.4/grid/bin/srvctl enable listener -l LISTENER -n dm01db01
[root@replacement]# /u01/app/11.2.0.4/grid/bin/srvctl start listener -l LISTENER -n dm01db01

Clone Oracle Database Homes to Replacement Database Server
The following steps are for example purposes only and are based on Chapter

9 "Adding Oracle RAC to Nodes

with Oracle Clusterware Installed" in Oracle Real Application Cluster Administration and Deployment Guide 11g Release 2
(11.2). If you use Oracle Database 11g Release 1 (11.1), then review the documentation for that release.

1. Add the RDBMS ORACLE_HOME on the replacement database server:

[oracle@surviving]$ cd /u01/app/grid/product/11.2.0/dbhome_1/oui/bin/

[oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}

These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to
the replacement database server. However, to complete the installation, you must run the root scripts on
the replacement database server after the command completes.
WARNING: The following configuration scripts need to be executed as the "root"
user in each cluster node.

/u01/app/grid/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01

To execute the configuration scripts:
Open a terminal window.
Log in as root.
Run the scripts on each cluster node.
After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/grid/product/11.2.0/dbhome_1 was
successful.
Please check '/tmp/silentInstall.log' for more details.

2. Run the following scripts on the replacement database server:

[root@replacement]# /u01/app/grid/product/11.2.0/dbhome_1/root.sh

Check
/u01/app/grid/product/11.2.0/dbhome_1/install/root_dm01db01.acme.com_2010-03-
10_18-27-16.log for the output of root script

3. Validate initialization parameter files

. Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared
storage.
. Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode,
needs to be changed to orapw<SID>




Clone Oracle Database Homes to Replacement Database Server

1. Add the RDBMS ORACLE_HOME on the replacement database server:

[oracle@surviving]$ cd /u01/app/oracle/product/11.2.0/dbhome_1/oui/bin/
[oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}

These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to
the replacement database server. However, to complete the installation, you must run the root scripts on
the replacement database server after the command completes.
WARNING: The following configuration scripts need to be executed as the “root”
user in each cluster node.
/u01/app/oracle/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01
To execute the configuration scripts:
Open a terminal window.
Log in as root.
Run the scripts on each cluster node.
After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/oracle/product/11.2.0/dbhome_1 was
successful.
Please check '/tmp/silentInstall.log' for more details.

2. Run the following scripts on the replacement database server:

[root@replacement]# /u01/app/oracle/product/11.2.0/dbhome_1/root.sh

Check
/u01/app/oracle/product/11.2.0/dbhome_1/install/root_dm01db01.acme.com_2010-03-10_18-27-16.log for the output of root script

3. Validate initialization parameter files
• Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared storage.
• Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode, needs to be changed to orapw<SID>

Comments

Popular posts from this blog

Fatal agent error: Target Interaction Manager failed at Startup

[INS-40718] Single Client Access Name (SCAN): could not be resolved. ( LDOMS & Zones)

CRS-2883: Resource 'ora.asm' failed during Clusterware stack start