Backup and restore Sun Cluster

Imagine for a second, even if it’s extremely unusual situation, that both nodes of your cluster have miserably failed simultaneously because of a buggy component. And now there is a dilemma what to do next: whether to reconfigure everything from scratch or, since you’re a vigilant SA, use the backups. But how to restore cluster configuration, did devices, resource and disk groups, etc? I tried to simulate such a failure in my testbed environment using the most simple SC configuration with one node and two VxVM disk groups to give a general overview and to show how easy and straightforward the whole process is.

As you might already know CCR (Cluster Configuration Repository) is a central database that contains:

This information is kept under /etc/cluster/ccr so it’s quite obvious that we need a whole backup of /etc/cluster directory since some crucial data, i.e. nodeid, are stored under /etc/cluster as well:

# cd /etc/cluster; find ./ | cpio -co > /cluster.cpio

And don’t forget about /etc/vfstab, /etc/hosts, /etc/nsswitch.conf and /etc/inet/ntp.cluster files. Of course, there are a lot more to keep in mind, but here I’m speaking only about SC related files. Now, here goes the trickiest part. What would I give it such a strong name? Because I learned this hard way during my experiment and if you omit it you won’t be able to load did module and recreate DID device entries (here I deliberately decided not to backup /devices and /dev directories). Wright down or try to remember the output:

# grep did /etc/name_to_major
did 300

# grep did /etc/minor_perm
did:*,tp 0666 root sys

Of course, you could simply add those two files to the backup list – whatever you prefer.

Now it’s safe to reinstall OS and once it’s done install SC software. Don’t run scinstall – it’s unnecessary.
First, create a new partition for globaldevices on the root disk. It’s better to assign it the same number as it was before the crash to avoid editing /etc/vfstab file and bothering with scdidadm command. Next, edit /etc/name_to_major and /etc/minor_perm files and make appropriate changes or simply overwrite them with copies form your backup. Now do a reconfiguration reboot:

# reboot — -r

ok> boot -r


# touch /reconfigure; init 6

When you’re back check that did module was loaded and there is pseudo/did@0:admin under /devices path:

# modinfo | grep did
285 786f2000 3996 300 1 did (Disk ID Driver 1.15 Aug 20 2006)

# ls -l /devices/pseudo/did\@0\:admin
crw——- 1 root sys 300, 0 Oct 2 16:30 /devices/pseudo/did@0:admin

You should also see that /global/.devices/node@1 was successfully mounted. So far, so good. But we are still in a non cluster mode. Lets fix that.

# mv /etc/cluster /etc/cluster.orig
# mkdir /etc/cluster; cd /etc/cluster
# cpio -i < /cluster.cpio

Restore other files, i.e. /etc/vfstab and others of that ilk, and reboot your system. Once it’s back again double check that DID entries have been created:

#  scdidadm -l 
1        chuk:/dev/rdsk/c1t6d0          /dev/did/rdsk/d1
2        chuk:/dev/rdsk/c2t0d0          /dev/did/rdsk/d2
3        chuk:/dev/rdsk/c2t1d0          /dev/did/rdsk/d3
4        chuk:/dev/rdsk/c3t0d0          /dev/did/rdsk/d4
5        chuk:/dev/rdsk/c3t1d0          /dev/did/rdsk/d5
6        chuk:/dev/rdsk/c6t1d0          /dev/did/rdsk/d6
7        chuk:/dev/rdsk/c6t0d0          /dev/did/rdsk/d7

# for p in  `scdidadm -l | awk '{print $3"*"}' `; do ls -l $p; done

Finally, import VxVM disk group, if that’s haven’t been done automatically and bring them online:

# vxdg import testdg
# vxdg import oradg

# scstat -D

-- Device Group Servers --

                          Device Group        Primary             Secondary
                          ------------        -------             ---------
  Device group servers:     testdg              testbed                -
  Device group servers:     oradg               testbed                -

-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Offline
  Device group status:        oradg               Offline

# scswitch -z -D oradg -h testbed
# scswitch -z -D testdg -h tetbed

# scstat -D

-- Device Group Servers --

                         Device Group        Primary             Secondary
                         ------------        -------             ---------
  Device group servers:  testdg              testbed                -
  Device group servers:  oradg               testbed                -

-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        testdg              Online
  Device group status:        oradg               Online


Posted on October 2, 2009 at 7:15 pm by sergeyt · Permalink
In: Solaris · Tagged with: 

2 Responses

Subscribe to comments via RSS

  1. Written by Leo
    on February 24, 2011 at 2:10 pm
    Reply · Permalink

    Nice post…. have you ever tried to simulate a situation where you physically lost both nodes, including the server hardware, and had to restores the backup on two new servers with the same architecture but using a different layout/components on the PCIs ?

    [ ]s

    • Written by sergeyt
      on February 24, 2011 at 11:58 pm
      Reply · Permalink

      Hello, Leo.

      To tell the truth I’ve never tried that myself and because of that I could only surmise what might happen in this case and if it’s achievable at all.
      My assumption is that with a different PCIs layout, and as a result, with absolutely different physical paths to the devices one will have to recreate his/her DID devices because they would no longer match with full device paths. If SAN is also involved then things get more complicated because obviously not only PCI controllers would be different but WWN of the fibre channel devices would also change and SAN zone’s reconfiguration would be required. Summing up, I think that a new cluster node would boot up but none of its resource groups will be available and a manual intervention will be unavoidable.

      I will try to experiment on this subject more thoroughly. Thank you for bringing in a fresh idea.

Subscribe to comments via RSS

Leave a Reply