Two days of outage

Recently I had an unpleasant experience when I found my small blog unavailable whilst a day before it was up and running at full speed. Thankfully, the support team that stands behind my VPS box recovered the system from their backup quite rapidly putting it back on stage. The data were intact so there was no point in using my own local backup copies I have to recover MySQL’s or Apache’s configuration since only system’s files were affected. There was a small issue with few init files, since they had been apparently clobbered by cpanel restoration process, and I had to reinstall them using “yum reinstall” command. So far, so good.

An error during APA creation

If you see an error message like this “unplumb operation for lan was unsuccessful” when creating APA interface double check that all members of APA configuration are down and none of them has an IP address assigned.

Khor Virap – Noravank – Tatev

That was one on those rare days filled to the brims both with the breathtaking views and the spectacular historical places. We covered 500 km. and it took us a whole day to get to Tatev from Yerevan. Here are just a few photos I’d like to share. Now I’ve got a clear understanding of what a mountain pass is and how difficult and at times exceptionally dangerous it might be. Drive safe.

Expanding SB40c with hpacucli

One of the tasks that we had to deal with during the last trip was to expand SB40c blade storage by replacing 4x146GB SAS disks with 6x30GB SAS disks. Preferably, this operation should be done online causing our client a zero downtime. Since these 4x146GB had been configured into RAID10 the initial idea was quite simple and straightforward:

  1. Pull out any one of the disks from SB40c.
  2. Replace it with a new 300GB disk and wait till the LogicalDrive is reconstructed.
  3. Remove another disk from the array but this time it should be from the other mirror strip.
  4. Insert a new 300GB disk and wait till the reconstruction is over.
  5. Do exectly the same with the two old 146GB disks.
  6. Expand the array by growing the logical drive (we had only one in the array’s configuration).

Replacing the first disk worked as planned and as soon as a 300GB replacement was swapped in a green lid went on indicating that the reconstruction had begun. So far so good. But when we replaced the second disk our joy had diminished – this new disk was giving no signs of life (all lids were black) , the system stalled to the point when it was impossible to gracefully shut it down. So we pressed a reset and thankfully once the system rebooted the reconstruction process continued so the data were safe. The other two 146GB disks were replaced online without a single hiccup, so the rest of the expansion plan was very easy and all we had to do was to insert another 2x300GB disks into the box to make the total number of disks equal to six. After that we just grew the logical drive as show bellow:

=> ctrl all show config

Smart Array P400 in Slot 3                (sn: PAFGL0N9SWK2OA)

   array A (SAS, Unused Space: 584359 MB)

      logicaldrive 1 (273.4 GB, RAID 1+0, OK)

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 300 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 300 GB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 300 GB, OK)

   unassigned

      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 300 GB, OK)

=> ctrl slot=3 ld 1 add drives=allunassigned 
=> ctrl all show config

Smart Array P400 in Slot 3                (sn: PAFGL0N9SWK2OA)

   array A (SAS, Unused Space: 1156500 MB)

      logicaldrive 1 (273.4 GB, RAID 1+0, Transforming, 0% complete)

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 300 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 300 GB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 300 GB, OK)

The initial explanation of what could’ve been a root cause was that one shouldn’t leave hpacucli tool running whilst replacing a disk. Sounds plausible since it resembles a problem when someone deletes a file whilst the other process is writing into it. Or more correctly, when a process read/write over NFS and the server becomes unavailable.
Since we had another SB40c and the task was identical we had the second chance. This time we had double checked that no one was running hpacucli and began replacing the disks. Two disks were replaced flawlessly but the third one hit us exactly with the same problem and we had to do a hard reset once again.
It’s still misty what was the real culprit in the first place. Who knows, maybe it was a firmware issue but we didn’t have the third SB40c to check that theory. Anyway, I think that such badly behavior is unacceptable even if these arrays had the oldest firmware possible.
So if anyone knows how to avoid that in the future or point to the possible error from our side – shoot out. Your comments are truly welcome.

Yerevan. Armenia.

Yes, I’m in Yerevan and will spend at last the next two weeks here doing random SA’s stuff from installing HP-UX and HP hardware to migrating Veritas Cluster to a new node and configuring space-ptimized snapshots.