Identifying a broken disk in HP DL360

Since I don’t have great level of experience with HP DL series I was puzzled for a bit when I found out that one of the disk was broken. How I did that? Easy, by gazing at the “Faulty Led” that went steadily on. Not bad but I wanted to be able to grab more detailed information from the console. Since DL360 has built in “HP Smart Array” you won’t be able to squeeze much of the information from the system with ordinary tools i.e. fdisk because the system could see only a logical drive presented by the array controller.
The solution was on the surface – I forwarded my path to www.hp.com downloaded and installed hpacucli RPM and that was it. So now I could do everything I wanted:

# hpacucli ctrl all show

Smart Array 6i in Slot 0 (Embedded)     

hpacucli ctrl all show detail

Smart Array 6i in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   RAID 6 (ADG) Status: Disabled
   Controller Status: OK
   Chassis Slot: 
   Hardware Revision: Rev B
   Firmware Version: 2.36
   Rebuild Priority: Low
   Expand Priority: Low
   Surface Scan Delay: 15 secs
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 100% Read / 0% Write
   Total Cache Size: 64 MB
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
   SATA NCQ Supported: False

# hpacucli ctrl slot=0 logicaldrive all show 

Smart Array 6i in Slot 0 (Embedded)

   array A (Failed)

      logicaldrive 1 (136.7 GB, RAID 1, Interim Recovery Mode)

# hpacucli ctrl slot=0 physicaldrive all show 

Smart Array 6i in Slot 0 (Embedded)

   array A (Failed)

      physicaldrive 1:0   (port 1:id 0 , Parallel SCSI, ??? GB, Failed)
      physicaldrive 1:1   (port 1:id 1 , Parallel SCSI, 146.8 GB, Predictive Failure)

Sorted.

Posted on June 10, 2009 at 1:06 pm by sergeyt · Permalink · Leave a comment
In: Linux

Missing voolboot file

If executing ” vxdctl enable” you receive the following error:

VxVM vxdctl ERROR V-5-1-1589 enable failed: Volboot file not loaded

then this sequence could help you to resolve the problem:

vxio set 10
vxconfigd -d
vxdctl init
vxdctl enable

Good luck.

Posted on June 8, 2009 at 5:43 pm by sergeyt · Permalink · Leave a comment
In: Veritas · Tagged with: 

Steadfast tin soldier

Lat weekend was very saturated with events of every sort and kind. First of all, since we’ve finally entered into summer time, I moved away from stuffy and noisy Moscow, into country side but the drawback was quite noticeable – from now on, since I will be bouncing between Moscow and my new place daily for the next three months, I will also have to spend in the traffic jams at list 4-5 per day sitting locked in a car. Gosh! Anyway, the overall results tremendously outweighs all inconveniences, plus, apart from living in a quiet and neat place, all the nice and beautiful sightseeing spots near Moscow became closer. To one of such places, Borodino, we set of on Sunday.

This is one of the most memorable and exiting places I’ve ever been to. Endless, picturesque sceneries, profoundly sprinkled with blood of soldiers defending the Fatherland

Here, every last Sunday of May, a kid’s festival called “Steadfast tin soldier” is carried out. During the feast everyone could observe old military bivouac of Napoleon’s epoch, admire viewing the marching soldiers vested in old military attire, with shakos on their heads and armed with the muskets. There are also many people in quaint dressing could be seen hanging around. Purely fantastic!

Overtly speaking, this event by itself is very rich in every aspect: the atmosphere, the mood and the overall openness, the moral and the historical inflation. Very touching and it’s simply impossible to stay indifferent.

The culmination of the festival is a reconstruction of episodes of Battle of Borodino, that took place here almost 200 years ago in 1812, with cannons, cavalry, real blasting, clods of soil and smog caused by shooting from muskets. Spectacular sight!

Posted on June 5, 2009 at 11:18 pm by sergeyt · Permalink · One Comment
In: Life

Unified Storage Simulator

Recently I had a chance to fiddle with Sun Storage 7000 Simulator and was totally amazed about this product. It’s absolutely fantastic and awesome because it gives everyone an opportunity to study and familiarize with Sun 7xxx storage appliance just by the means of VirtualBox (or VMWare). Once you boot and go through the initial configuration you will be presented with 15 virtual disks which you could create filesystems and/or LUNs on and share them in whatever manner you prefer: iSCSI, WebDAV, NFS, CIFS, FTP, NDMP.
Initially I was thinking about giving a step-by-step installation and configuration review but once I went through it by myself I cast aside this idea because of its simplicity and plainness. It just that easy and straightforward. More than that, it comes with an easy to understand documentation but if you prefer to use CLI don’t think of being deprived: tab completion and “help command” just don’t give you a single chance to get lost and confused.
Web interface is certainly more friendly and in my opinion it is your day-to-day assistant and the place you will do the most part of you work from. But all the background and cron job scripts will definitely be pumped through CLI.
And of course I just can’t pass over in silence the notorious Analytics feature. It’s epical! From a single menu, have no idea what the DTRACE is, you could drill down to the very source of your problem by identifying the culprit no matter on what tier it is. View the data on-line and in real-time, analyze CPUs, Caches, Disks, Protocols broken down by dozens of metrics i.e type of operations (read or write), clients, files, latency an much much more. Just see it and spend some time playing with it.

Posted on June 5, 2009 at 4:29 pm by sergeyt · Permalink · Leave a comment
In: Sun

OpenSolaris 2009.06 is here

Today, during ongoing CommunityOne confrerence, the new OpenSolaris release was announced with bunch of compelling new features i.e. Crossbow, ClearView, COMSTAR, SPARC support and a lot of more. Release details could be retrieved from OpenSolaris web site.

Posted on June 1, 2009 at 6:45 pm by sergeyt · Permalink · Leave a comment
In: Sun

Utterly depressed

Online petition – Let Alexandra Come Back to Portugal

Posted on June 1, 2009 at 5:36 pm by sergeyt · Permalink · Leave a comment
In: Life

Spontaneous domain reboot on SunFire 6800

Because the respective case at Sun was closed, I want to add this note for the future reference, just in case. So… One day I came to my desk and found that one the domains on SF6800 had been reboot for no reason, at least the very first impression was exactly like that. Superficially and quickly looking at /var/adm/messsage, prtdiag output revealed no hardware or software issues. The next step was to login into SC to go a bit deeper into analyzing the problem. Thus showboards, showfru, showchs, showplatform – everything was fine, but the showlogs, and especially showlogs -d C, output put me on my guard:

May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 757768 local6.crit] 
                           ErrorMonitor: Domain C has a SYSTEM ERROR
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8
May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 894554 local6.error] 
/partition1/domain0/IB8/ar0: 
>>> L2CheckError[0x6150] : 0x06068606
             CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid
             PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq
          AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch
                      FE [15:15] : 0x1 
          AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch

May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 612655 local6.error] 
/partition1/RP2/sdc0: 
>>> SafariPortError8[0x280] : 0x00088008
                      FE [15:15] : 0x1 
           AccParL2ErrDT [19:19] : 0x1 
              ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID

May 15 07:38:52 SF6900-1-sc0 Domain-C.SC: [ID 286372 local6.error] [AD] Event: SF6800.ASIC.SDC.PAR_L2_ERR_DT.60143038
     CSN: 0344MM204E DomainID: C ADInfo: 1.SCAPP.20.3
     Time: Fri May 15 07:38:52 MSD 2009
     FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 046286; FRU-LOC: /N0/IB8
                        FRU-PN: 5016418; FRU-SN: 004613; FRU-LOC: RP2
     Recommended-Action: Service action required

Does it look like a bunch of some cryptic messages which only initiated into Sun’s engineering secretes could decipher? Well, as always the truth is somewhere in between, because in our case we could only make an assumption about which part of our big system is faulty or just went off the beam for a jiffy. So, lets go forward…
First, we see two errors that took place simultaneously:

May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8

Since we have (First Error) FE [15:15]: 0x1 in both errors that indeed means that these two alerts happened at the same time. But keep in mind, they’re unrelated to each other since FE bit is only valid for a single ASIC and has no relation to errors reported by other ASICs in the system. Next:

/partition1/domain0/IB8/ar0: 
>>> L2CheckError[0x6150] : 0x06068606
             CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid
             PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq
          AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch
                      FE [15:15] : 0x1 
          AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch

It just tells us that ports 6 through 9 of the AR (Address Repeater), on IO board 8, received CMDVSyncErr and PreqSyncErr. More details could be found here.
0x3 is a hint that tells us that RP2/RP3 were involved. Acc stand for “accumulated” and hence Acc[CMDVSyncErr|PreqSyncErr] lines just inform us that these errors occurred more than once.

Continue with the second error.

/partition1/RP2/sdc0: 
>>> SafariPortError8[0x280] : 0x00088008
                      FE [15:15] : 0x1 
           AccParL2ErrDT [19:19] : 0x1 
              ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID

This is a clear indication of the parity error on port 8 of SDC (Serengeti Data Controller), on RP2. Consulting “Sun Fire™ 6800/4800/4810/3800 Systems Troubleshooting Manual” revealed that port 8 connects to IB8.

In the end we have a list of suspected FRU:

  1. RP2
  2. IB8

What’s next? With probability of 99%, you will be given a recommendation to monitor you box for a couple of weeks and only if the same error knocks your server down again one of those parts will be replaced and the investigation spins up at the deeper level.

Posted on June 1, 2009 at 1:42 pm by sergeyt · Permalink · Leave a comment
In: Sun

Maximum number of processes

If you’re with some Linux background under you belt then probably the first command you would think about is ulimit -a. The same command exists under Solaris

root@root # ulimit -a
core file size        (blocks, -c) unlimited
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
open files                    (-n) 32768
pipe size          (512 bytes, -p) 10
stack size            (kbytes, -s) 8192
cpu time             (seconds, -t) unlimited
max user processes            (-u) 19995
virtual memory        (kbytes, -v) unlimited

But there is a small difference. Whilst under Linux you are free to use it to change the maximum number of processes available to a single user, under Solaris it won’t work complaining:

ulimit: max user processes: cannot modify limit: Invalid argument

So what’s next? Remember that the maximum size of the process table depends on the total amount of physical memory installed in the system. This dependance is reflected in internal variable, called maxusers, and is determined at boot time.

#define MIN_DEFAULT_MAXUSERS 8u
#define MAX_DEFAULT_MAXUSERS 2048u
#define MAX_MAXUSERS  4096u

if (maxusers == 0) {
      pgcnt_t physmegs = physmem >> (20 - PAGESHIFT);
      pgcnt_t virtmegs = vmem_size(heap_arena, VMEM_FREE) >> 20;
      maxusers = MIN(MAX(MIN(physmegs, virtmegs),
      MIN_DEFAULT_MAXUSERS), MAX_DEFAULT_MAXUSERS);}
}

OpenSolaris source code

It is also used to set two other kernel variables: max_nprocs and maxuprc to describe the maximum number of process systemwide and the maximum number of processes an ordinary user can have respectively.

if (max_nprocs == 0)
     max_nprocs = (10 + 16 * maxusers);
if (platform_max_nprocs > 0 && max_nprocs > platform_max_nprocs)
     max_nprocs = platform_max_nprocs;
if (max_nprocs > maxpid)
     max_nprocs = maxpid;
if (maxuprc == 0)
     maxuprc = (max_nprocs - reserved_procs);

OpenSolaris source code

To display the current values form the console just run mdb to explorer these variables:

> maxusers/D
maxusers:
maxusers:       2048

> max_nprocs/D
max_nprocs:
max_nprocs:     20000

> maxuprc/D
maxuprc:
maxuprc:        19995

To set the maximum number of processes a non-root user could have just update maxuprc value through either mdb or /etc/system file. Keep in mind that:

Whilst what I’ve said here is true both for Solaris 9 and 10 in Solaris 10 using “Resource Management” you could create more refined constrains to define the way a user can run his/her processes.

Posted on May 28, 2009 at 11:09 pm by sergeyt · Permalink · Leave a comment
In: Solaris

What’s new in OpenSolaris 2009.06

If you’re curious about new feature and technologies that are going to be introduced in the new upcoming OpenSolaris release then this presentation prepared by Peter Dennis is a must read.

Posted on May 27, 2009 at 10:53 pm by sergeyt · Permalink · Leave a comment
In: Solaris

No way I want to make the same mistakes again

To avoid stepping on the same rake again and to fix the issue described in this post, I came out with a simple expect script to save current configuration of Qlogic Sanbox switches.

#!/usr/local/bin/expect -f

set switches "switch1 switch2"
set user {user}
set pass {pass}
set ftp_user {ftp_user}
set ftp_pass {ftp_pass}
set timeout 10
log_user 0
set prompt "(%|#|\\$) $"
catch {set prompt $env(EXPECT_PROMPT)}

set sec [clock seconds]
set date [clock format $sec -format %d%m%Y]

set back [clock add $sec -7 days]
set bdate [clock format $back -format %d%m%Y]

for {set x 0} {$x<[llength $switches]} {incr x} {

set current_switch [lindex $switches $x]

spawn telnet $current_switch

expect {
        timeout {puts "timeout while connecting to $host"; exit 1}
        "login:"
}
send "$user\r"

expect {
        timeout {puts "timed out waiting for the password prompt"; exit 1}
        "Password:"
}
send "$pass\r"

expect {
        timeout {puts "timed out after login"; exit 1}
        "#>"
}
send "admin start\r"

expect {
        timeout {puts "timed out waiting for admin mode"; exit 1}
        "(admin) #>"
}
send "config backup\r"

expect {
        "(admin) #>"
}
send "admin end\r"

expect {
        "#>"
}
send "quit\r"

spawn ftp sanbox4
expect {
        timeout {puts "timed out waiting for ftp login request"; exit 1}
        "Name" 
}
send "$ftp_user\r"

expect {
        timeout {puts "timed out waiting fro ftp password request"; exit 1}
        "Password:"
}
send "$ftp_pass\r"

expect {
        timeout {puts "timed out waiting for ftp prompt"; exit 1}
        "ftp>"
}
send "get configdata /pth_to_backup_directory/configdata_$current_switch-$date\r"
expect "ftp>"
send "quit\r"

if {[file exists /path_to_backup_directory/configdata_$current_switch-$bdate]} {
        exec /usr/bin/rm /path_to_backup_directory/configdata_$current_switch-$bdate
}
}
Posted on May 27, 2009 at 6:44 pm by sergeyt · Permalink · One Comment
In: Scripting