Backup and restore Sun Cluster

Imagine for a second, even if it’s extremely unusual situation, that both nodes of your cluster have miserably failed simultaneously because of a buggy component. And now there is a dilemma what to do next: whether to reconfigure everything from scratch or, since you’re a vigilant SA, use the backups. But how to restore cluster configuration, did devices, resource and disk groups, etc? I tried to simulate such a failure in my testbed environment using the most simple SC configuration with one node and two VxVM disk groups to give a general overview and to show how easy and straightforward the whole process is.

As you might already know CCR (Cluster Configuration Repository) is a central database that contains:

Cluster and node names
Cluster transport configuration
The names of Solaris Volume Manager disk sets or VERITAS disk groups
A list of nodes that can master each disk group
Operational parameter values for data services
Paths to data service callback methods
DID device configuration
Current cluster status

This information is kept under /etc/cluster/ccr so it’s quite obvious that we need a whole backup of /etc/cluster directory since some crucial data, i.e. nodeid, are stored under /etc/cluster as well:

# cd /etc/cluster; find ./ | cpio -co > /cluster.cpio

And don’t forget about /etc/vfstab, /etc/hosts, /etc/nsswitch.conf and /etc/inet/ntp.cluster files. Of course, there are a lot more to keep in mind, but here I’m speaking only about SC related files. Now, here goes the trickiest part. What would I give it such a strong name? Because I learned this hard way during my experiment and if you omit it you won’t be able to load did module and recreate DID device entries (here I deliberately decided not to backup /devices and /dev directories). Wright down or try to remember the output:

# grep did /etc/name_to_major
did 300

# grep did /etc/minor_perm
did:*,tp 0666 root sys

Of course, you could simply add those two files to the backup list – whatever you prefer.

Now it’s safe to reinstall OS and once it’s done install SC software. Don’t run scinstall – it’s unnecessary.
First, create a new partition for globaldevices on the root disk. It’s better to assign it the same number as it was before the crash to avoid editing /etc/vfstab file and bothering with scdidadm command. Next, edit /etc/name_to_major and /etc/minor_perm files and make appropriate changes or simply overwrite them with copies form your backup. Now do a reconfiguration reboot:

# reboot — -r
or

#halt
ok> boot -r

or

# touch /reconfigure; init 6

When you’re back check that did module was loaded and there is pseudo/did@0:admin under /devices path:

# modinfo | grep did
285 786f2000 3996 300 1 did (Disk ID Driver 1.15 Aug 20 2006)

# ls -l /devices/pseudo/did\@0\:admin
crw——- 1 root sys 300, 0 Oct 2 16:30 /devices/pseudo/did@0:admin

You should also see that /global/.devices/node@1 was successfully mounted. So far, so good. But we are still in a non cluster mode. Lets fix that.

# mv /etc/cluster /etc/cluster.orig
# mkdir /etc/cluster; cd /etc/cluster
# cpio -i < /cluster.cpio

Restore other files, i.e. /etc/vfstab and others of that ilk, and reboot your system. Once it’s back again double check that DID entries have been created:

#  scdidadm -l 
1        chuk:/dev/rdsk/c1t6d0          /dev/did/rdsk/d1
2        chuk:/dev/rdsk/c2t0d0          /dev/did/rdsk/d2
3        chuk:/dev/rdsk/c2t1d0          /dev/did/rdsk/d3
4        chuk:/dev/rdsk/c3t0d0          /dev/did/rdsk/d4
5        chuk:/dev/rdsk/c3t1d0          /dev/did/rdsk/d5
6        chuk:/dev/rdsk/c6t1d0          /dev/did/rdsk/d6
7        chuk:/dev/rdsk/c6t0d0          /dev/did/rdsk/d7

# for p in  `scdidadm -l | awk '{print $3"*"}' `; do ls -l $p; done

Finally, import VxVM disk group, if that’s haven’t been done automatically and bring them online:

# vxdg import testdg
# vxdg import oradg

# scstat -D

-- Device Group Servers --

                          Device Group        Primary             Secondary
                          ------------        -------             ---------
  Device group servers:     testdg              testbed                -
  Device group servers:     oradg               testbed                -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Offline
  Device group status:        oradg               Offline


# scswitch -z -D oradg -h testbed
# scswitch -z -D testdg -h tetbed

# scstat -D

-- Device Group Servers --

                         Device Group        Primary             Secondary
                         ------------        -------             ---------
  Device group servers:  testdg              testbed                -
  Device group servers:  oradg               testbed                -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Online
  Device group status:        oradg               Online

Easy!

Posted on October 2, 2009 at 7:15 pm by sergeyt · Permalink
In: Solaris · Tagged with: sun cluster

2 Responses

Subscribe to comments via RSS

Written by Leo
on February 24, 2011 at 2:10 pm
Reply ·

Desultory remarks about everything that comes across.

Home

About

Flagword.net

Backup and restore Sun Cluster

2 Responses

Subscribe to comments via RSS

Desultory remarks about everything that comes across.

Home About

Flagword.net

Backup and restore Sun Cluster

2 Responses

Subscribe to comments via RSS

Home

About