Backup and restore Sun Cluster

Imagine for a second, even if it’s extremely unusual situation, that both nodes of your cluster have miserably failed simultaneously because of a buggy component. And now there is a dilemma what to do next: whether to reconfigure everything from scratch or, since you’re a vigilant SA, use the backups. But how to restore cluster configuration, did devices, resource and disk groups, etc? I tried to simulate such a failure in my testbed environment using the most simple SC configuration with one node and two VxVM disk groups to give a general overview and to show how easy and straightforward the whole process is.

As you might already know CCR (Cluster Configuration Repository) is a central database that contains:

This information is kept under /etc/cluster/ccr so it’s quite obvious that we need a whole backup of /etc/cluster directory since some crucial data, i.e. nodeid, are stored under /etc/cluster as well:

# cd /etc/cluster; find ./ | cpio -co > /cluster.cpio

And don’t forget about /etc/vfstab, /etc/hosts, /etc/nsswitch.conf and /etc/inet/ntp.cluster files. Of course, there are a lot more to keep in mind, but here I’m speaking only about SC related files. Now, here goes the trickiest part. What would I give it such a strong name? Because I learned this hard way during my experiment and if you omit it you won’t be able to load did module and recreate DID device entries (here I deliberately decided not to backup /devices and /dev directories). Wright down or try to remember the output:

# grep did /etc/name_to_major
did 300

# grep did /etc/minor_perm
did:*,tp 0666 root sys

Of course, you could simply add those two files to the backup list – whatever you prefer.

Now it’s safe to reinstall OS and once it’s done install SC software. Don’t run scinstall – it’s unnecessary.
First, create a new partition for globaldevices on the root disk. It’s better to assign it the same number as it was before the crash to avoid editing /etc/vfstab file and bothering with scdidadm command. Next, edit /etc/name_to_major and /etc/minor_perm files and make appropriate changes or simply overwrite them with copies form your backup. Now do a reconfiguration reboot:

# reboot — -r
or

#halt
ok> boot -r

or

# touch /reconfigure; init 6

When you’re back check that did module was loaded and there is pseudo/did@0:admin under /devices path:

# modinfo | grep did
285 786f2000 3996 300 1 did (Disk ID Driver 1.15 Aug 20 2006)

# ls -l /devices/pseudo/did\@0\:admin
crw——- 1 root sys 300, 0 Oct 2 16:30 /devices/pseudo/did@0:admin

You should also see that /global/.devices/node@1 was successfully mounted. So far, so good. But we are still in a non cluster mode. Lets fix that.

# mv /etc/cluster /etc/cluster.orig
# mkdir /etc/cluster; cd /etc/cluster
# cpio -i < /cluster.cpio

Restore other files, i.e. /etc/vfstab and others of that ilk, and reboot your system. Once it’s back again double check that DID entries have been created:

#  scdidadm -l 
1        chuk:/dev/rdsk/c1t6d0          /dev/did/rdsk/d1
2        chuk:/dev/rdsk/c2t0d0          /dev/did/rdsk/d2
3        chuk:/dev/rdsk/c2t1d0          /dev/did/rdsk/d3
4        chuk:/dev/rdsk/c3t0d0          /dev/did/rdsk/d4
5        chuk:/dev/rdsk/c3t1d0          /dev/did/rdsk/d5
6        chuk:/dev/rdsk/c6t1d0          /dev/did/rdsk/d6
7        chuk:/dev/rdsk/c6t0d0          /dev/did/rdsk/d7

# for p in  `scdidadm -l | awk '{print $3"*"}' `; do ls -l $p; done

Finally, import VxVM disk group, if that’s haven’t been done automatically and bring them online:

# vxdg import testdg
# vxdg import oradg

# scstat -D

-- Device Group Servers --

                          Device Group        Primary             Secondary
                          ------------        -------             ---------
  Device group servers:     testdg              testbed                -
  Device group servers:     oradg               testbed                -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Offline
  Device group status:        oradg               Offline


# scswitch -z -D oradg -h testbed
# scswitch -z -D testdg -h tetbed

# scstat -D

-- Device Group Servers --

                         Device Group        Primary             Secondary
                         ------------        -------             ---------
  Device group servers:  testdg              testbed                -
  Device group servers:  oradg               testbed                -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Online
  Device group status:        oradg               Online

Easy!

Posted on October 2, 2009 at 7:15 pm by sergeyt · Permalink
In: Solaris · Tagged with: 

2 Responses

Subscribe to comments via RSS

  1. Written by Leo
    on February 24, 2011 at 2:10 pm
    Reply ·