Default Linux I/O multipathd configuration, SCSI timeout and Oracle RAC caveat

I’ve been recently involved in a project to migrate from old and rusty Cisco MDS 9222i to the new MDS 9506 SAN switches and during the first phase of the migration the primary node in a two-node Oracle RAC cluster lost access to its voting disks and went down. And that’s when only half paths to SAN storage was unreachable whilst the other half was absolutely ok and active.

Oracle support pointed out to the following errors:

WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.

Metalink document 1581684.1 at gives more thorough explanation:

Generally this kind messages comes in ASM alertlog file on below situations:

  • Too many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
    thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
  • the heart beat delays are sort of ignored for external redundancy diskgroup.
    ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
    but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+ Some of the paths of the physical paths of the multipath device are offline or lost
+ During path ‘failover’ in a multipath set up
+ Server load, or any sort of storage/multipath/OS maintenance

One way to solve that is to set _asm_hbeatiowait on all the modes of Oracle RAC to a higher value (in seconds) but not higher that 200.

But before that it would be a good idea to take a look at multipathd’s configuration first.

# multipathd -k"show conf"

Since Oracle RAC in our case was backed up by EMC VMAX array the following device section is of the most interest:

device {
                vendor "EMC"
                product "SYMMETRIX"
                path_grouping_policy multibus
                getuid_callout "/sbin/scsi_id -g -u -ppre-spc3-83 -s /block/%n"
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "0"
                rr_weight uniform
                no_path_retry 6
                rr_min_io 1000

And it might seem that no_path_retry was one part of the problem:

A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.

In essence, instead of failing over to the active paths I/O was queued. The negative effect of this option was multiplied by the presence of another option, this time in the default section, called polling_interval which by default is set to 5 seconds. Now you see that I/O was queued by polling_interval*no_path_retry which is 30 seconds in total.

One obvious solution was, as expected, to disable queueing on Oracle voting disks by setting no_path_retry = fail. This was certainly a low hanging fruit but there were more in the details since there are several layers where IO commands issued to a device could experience the timeout:

  • At SCSI layer defined in /sys/class/scsi_device/h:c:t:l/device/timeout.
  • FC HBA’s driver layer (in our case it was qla2xxx). Use modinfo to list the current settings.
  • At dm-multipath or block layer.

The following quote from Redhat’s engineer adds more detailed explanation:

Also, please note that the timeout set in “/sys/class/scsi_device/h:c:t:l/device/timeout” is the minimum amount of time that it will take for the scsi error handler to start when a device is not responding, and *NOT* the amount of time it will take for the device to return a SCSI error. For example if the I/O timeout set to 60s, that means there’s a worst case of 120s before the error handler would ever be able to run.

Since IO commands can be submitted to the device up until the first submitted command is timed out, and that may take 60s for first command to get timed out, we could summarize the worst case scenario for longest time required to return IO errors on a device as follows:

[1] Command submitted to the sub path of device, inherits 60s timeout from /sys.

[2] just before 60s is up, another command is submitted, also inheriting a 60s timeout.

[3] first command times out at 60s, error handler starts but must sleep until all other commands have completed or timed out. Since we had a command submitted just before this, we wait another 60s for it to timeout.

[4] Now we attempt to abort all timed out commands. Note that each abort also sends a Test Unit Ready (TUR SCSI command) to the device, which have a 10 second timeout, adding extra time to the total.

[5] depending on the result of the abort, we may also have to reset the device/bus/host. This would add an indeterminate amount of time to the process, including more Test Unit Ready (TUR SCSI command) at 10 seconds each.

[6] Now that we’ve aborted all commands and possibly reset the device/bus/host, we requeue the cancelled commands. This is where we wait (number of allowed attempts + 1 * timeout_per_command) = (5+1 * 60s) = 360s. (**Note: in above formula number of allowed attempts defaults to 5 for any IO commands issued through VFS layer, and “timeout_per_command” is the timeout value set in “/sys/class/scsi_device/h:c:t:l/device/timeout” file).

[7] As commands reach their “(number of allowed attempts + 1 * timeout_per_command)” timeout, they will be failed back up to the DM-Multipath or application layer with an error code. This is where you finally see SCSI errors, and if multipath software is involved, for a path failure.

So the basic idea is that it’s very hard to predict the exact time it would take to failover and it’s worth trying to fiddle with different timeout settings, i.e. already mentioned and fast_io_fail_tmo, dev_loss_tmo from multipath.conf, as well as to look at the problem from the application’s side and update _asm_hbeatiowait accordingly. The question remains, why Oracle decided to set this parameter to 15 sec by default?

Solaris 11.2 beta is available

Yesterday Oracle announced the availability of Solaris 11.2 beta with a bunch of sweet enhancements, e.g. Openstack, Solaris Kernel zones, Unified archives, Compliance check and reporting, Automation with puppet and more.
Find more by reading Solaris 11.2 Beta – What’s new
For those who is interested in a hands on experience Solaris11.2 beta is also available for download in different formats including Virtualbox VM template.
Now I know what I will be doing during the upcoming 4 days-long state holiday.

Jumped into Oracle Solaris 11 Express wagon

It’s not the news anymore that Oracle Solaris 11 Express has been released.
Unsurprisingly it turned out to be a trivial task to upgrade from OpenSolaris (snv 134) and it was just a matter of updating the publisher and do usual image-update:

opensolaris:~$ pfexec pkg set-publisher --non-sticky
opensolaris:~$ pfexec pkg set-publisher --non-sticky extra
opensolaris:~$ pkg set-publisher -P -g solaris
opensolaris:~$ pfexec pkg image-update --accept

Some time later:

opensolaris:~$ uname -a
SunOS opensolaris 5.11 snv_151a i86pc i386 i86pc Solaris

Oracle Secure Backup

In preparation to a possible trip that will never happen I had to go through the configuration steps of Oracle Secure Backup to be able to do it quickly and professionally on-site.

Since I didn’t have a spare tape library to play with I used a built-in tape drive from D240 box and SF6900 as my test bed.

First, I had to configure my tape drive by setting sgen driver since st is not supported by OSB.

# update_drv -d -i '"scsiclass,01"' st
# add_drv -f -m '* 0666 bin bin' -i '"scsiclass,01" "scsiclass,08" "scsa,01.bmpt" "scsa,0.8.bmpt"' sgen
# ln -s /dev/scsi/sequential/c9t6d0 /dev/obt0

Next step is to create a host and assert is a few roles.

ob> mkhost -r admin,mediaserver,client -i server's_ip_address hostname
ob> lshost                                                             
sf6900-2      admin,mediaserver,client (via OB) in service

Once the host is defined it’s time to proceed wit the tape drive:

ob> mkdev -t tape -o -a sf600:/dev/obt0 tc-tape
ob> lsdev 
    Device type:            tape
    Model:                  [none]
    Serial number:          0005306351
    In service:             yes
    Automount:              yes
    Error rate:             8
    Query frequency:        [undetermined]
    Debug mode:             no
    Blocking factor:        (default)
    Max blocking factor:    (default)
    UUID:                   907cedda-ad0e-102d-a8d3-d67ad710fa01
    Attachment 1:
        Host:               sf6900
        Raw device:         /dev/obt0

It’s required to bind an OSB’s user to a unix account so that OS’s user would be able (authorized) to start a backup using RMAN:

ob> chuser --preauth sf6900:system_username+rman admin

In the end created a database backup storage selector and media family using mkssel and mkmf respectively to add more granularity into my backup configuration i.e. “Write window”, “Keep volume set” period or whether the volumes are appendable or not.

Finally, I used the following trivial RMAN script to make sure that everything was fine:

  allocate channel c1 device type sbt
    parms 'ENV=(OB_MEDIA_FAMILY=OracleBackup)';
  backup database include current controlfile;
  backup archivelog all not backed up;

To be on the safe side just used obtool to confirm that everything was indeed alright:

ob> lsj
Job ID           Sched time  Contents                       State
---------------- ----------- ------------------------------ ---------------------------------------
admin/3          none        database orcltst (dbid=1449400826) processed; Oracle job(s) scheduled
admin/3.1        none        datafile backup                running since 2010/11/02.17:06

ob> lspiece 
    POID Database   Content    Copy Created      Host             Piece name
     101 orcltst   full          0 11/02.17:07  sf6900      02ls0ppf_1_1
     102 orcltst   archivelog    0 11/02.17:09  sf6900      03ls0pua_1_1

Orphaned Dtrace, Fishworks and ZFS

First it was Bryan Cantrill and then Adam Leventhal who followed. After that the exodus had continued by Jeff Bonwick and Mike Shapiro both leaving Oracle. But today another big name from Sun Microsystems has closed the Oracle’s door – Brendan Gregg is leaving today and all we have been left with is a new Dtrace book from Brendan and Jim Mauro:

Enjoy the videos.

Below is the list, taken from OpenSolaris mailing list, of all big names that have abandoned Sun/Oracle so far:

  • Ian Murdock (Emerging systems, i.e. new distro architecture)
  • Tim Bray (SGML/XML) (1 March 2010)
  • Simon Phipps (Open Source) (March 2010)
  • James Gosling (Java) (2 April 2010)
  • Sunay Tripathi (CrossBow) (April 2, 2010)
  • Garrett D’Amore (networking, audio, device drivers – formerly with General Dynamics (which had bought Tadpole)
  • Bryan Cantrill (DTrace) (July 2010)
  • Adam Leventhal (DTrace)
  • Jeff Bonwick (ZFS)
  • Michael W. Shapiro (dTrace, storage) (October 2010)
  • Brendan Gregg (dTrace, storage) (October 2010)

Who cares about TCO and ROI?!

As expected people care less about buisness acronyms and high words i.e. TCO, ROI, integrted stack, when real money are involved. I visioned confirmation of this during Oracle+Sun welcome event where all Oracles’/Suns’ consultants were touting about their integrated stack but felt short once asked about the price the customer will have to pay for the new support contract or the license fee for using Oracle on Sun hardware. The innovations and green technology are cool but everything grows dim in the face of a bill. To sweeten the pill, it has been said that once the integration process is completed the price-list for Sun hardware will probably be revised towards reduction.

What was really useful about visiting this event is a talk in the corridors which gave a hope that:

  • Oracle will actually use its privilege of owning the whole stack (from software to disks) to make Sun+Oracle platform more winning and more attractive than any other competitive solutions from performance perspective.
  • Next SPARC64 processors and M-series platform, which are planned to hit the market in 2012, will be 50/50, against todays 20/80, in terms of Sun/Fujitsu partnership. It’s well-known that contemporary SPARC64 CPUs are more Fujitsu brainchild than Sun’s.
  • We were offered to take Sun T5220 equipped with SSDs and Sun Flash F20 PCIe for testing. Very sweet.

Tender thanks for invitation

As I mentioned in my last post there is going to be a planed Oracle+Sun welcome event on the 20th of May in Moscow Marriott Hotel and, nevertheless, I was a bit skeptical about my chances to be allowed to attend this event I still received the confirmation today. Frankly speaking, I don’t expect to hear any breathtaking revelations or confessions, they are all well-known from the similar events that have already taken place earlier in other countries, but anyway I expect it to be it a cheerful moment in addition to have a personal touch to the historical event.

See you there…