Default Linux I/O multipathd configuration, SCSI timeout and Oracle RAC caveat

I’ve been recently involved in a project to migrate from old and rusty Cisco MDS 9222i to the new MDS 9506 SAN switches and during the first phase of the migration the primary node in a two-node Oracle RAC cluster lost access to its voting disks and went down. And that’s when only half paths to SAN storage was unreachable whilst the other half was absolutely ok and active.

Oracle support pointed out to the following errors:

WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.

Metalink document 1581684.1 at support.oracle.com gives more thorough explanation:

Generally this kind messages comes in ASM alertlog file on below situations:

  • Too many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
    thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
  • the heart beat delays are sort of ignored for external redundancy diskgroup.
    ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
    but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+ Some of the paths of the physical paths of the multipath device are offline or lost
+ During path ‘failover’ in a multipath set up
+ Server load, or any sort of storage/multipath/OS maintenance

One way to solve that is to set _asm_hbeatiowait on all the modes of Oracle RAC to a higher value (in seconds) but not higher that 200.

But before that it would be a good idea to take a look at multipathd’s configuration first.

# multipathd -k"show conf"

Since Oracle RAC in our case was backed up by EMC VMAX array the following device section is of the most interest:

device {
                vendor "EMC"
                product "SYMMETRIX"
                path_grouping_policy multibus
                getuid_callout "/sbin/scsi_id -g -u -ppre-spc3-83 -s /block/%n"
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "0"
                rr_weight uniform
                no_path_retry 6
                rr_min_io 1000
        }

And it might seem that no_path_retry was one part of the problem:

A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.

In essence, instead of failing over to the active paths I/O was queued. The negative effect of this option was multiplied by the presence of another option, this time in the default section, called polling_interval which by default is set to 5 seconds. Now you see that I/O was queued by polling_interval*no_path_retry which is 30 seconds in total.

One obvious solution was, as expected, to disable queueing on Oracle voting disks by setting no_path_retry = fail. This was certainly a low hanging fruit but there were more in the details since there are several layers where IO commands issued to a device could experience the timeout:

  • At SCSI layer defined in /sys/class/scsi_device/h:c:t:l/device/timeout.
  • FC HBA’s driver layer (in our case it was qla2xxx). Use modinfo to list the current settings.
  • At dm-multipath or block layer.

The following quote from Redhat’s engineer adds more detailed explanation:

Also, please note that the timeout set in “/sys/class/scsi_device/h:c:t:l/device/timeout” is the minimum amount of time that it will take for the scsi error handler to start when a device is not responding, and *NOT* the amount of time it will take for the device to return a SCSI error. For example if the I/O timeout set to 60s, that means there’s a worst case of 120s before the error handler would ever be able to run.

Since IO commands can be submitted to the device up until the first submitted command is timed out, and that may take 60s for first command to get timed out, we could summarize the worst case scenario for longest time required to return IO errors on a device as follows:

[1] Command submitted to the sub path of device, inherits 60s timeout from /sys.

[2] just before 60s is up, another command is submitted, also inheriting a 60s timeout.

[3] first command times out at 60s, error handler starts but must sleep until all other commands have completed or timed out. Since we had a command submitted just before this, we wait another 60s for it to timeout.

[4] Now we attempt to abort all timed out commands. Note that each abort also sends a Test Unit Ready (TUR SCSI command) to the device, which have a 10 second timeout, adding extra time to the total.

[5] depending on the result of the abort, we may also have to reset the device/bus/host. This would add an indeterminate amount of time to the process, including more Test Unit Ready (TUR SCSI command) at 10 seconds each.

[6] Now that we’ve aborted all commands and possibly reset the device/bus/host, we requeue the cancelled commands. This is where we wait (number of allowed attempts + 1 * timeout_per_command) = (5+1 * 60s) = 360s. (**Note: in above formula number of allowed attempts defaults to 5 for any IO commands issued through VFS layer, and “timeout_per_command” is the timeout value set in “/sys/class/scsi_device/h:c:t:l/device/timeout” file).

[7] As commands reach their “(number of allowed attempts + 1 * timeout_per_command)” timeout, they will be failed back up to the DM-Multipath or application layer with an error code. This is where you finally see SCSI errors, and if multipath software is involved, for a path failure.

So the basic idea is that it’s very hard to predict the exact time it would take to failover and it’s worth trying to fiddle with different timeout settings, i.e. already mentioned and fast_io_fail_tmo, dev_loss_tmo from multipath.conf, as well as to look at the problem from the application’s side and update _asm_hbeatiowait accordingly. The question remains, why Oracle decided to set this parameter to 15 sec by default?