Configuring FCoE in Linux (RHEL) and HP FlexFabric

Actually it’s easy. Very easy indeed, like going 1, 2, 3.

  1. Collect information about MAC addresses to distinguish pure Ethernet NICs and CNA that will pass FCoE traffic. The latter have both MAC and WWN addresses.
  2. Power on a server and update /etc/udev/rules.d/70-persistent-net.rules if required.
  3. Activate new dev rules:
    # udevadm trigger
    
  4. Install fcoe-utils and lldpad packages:
    # yum install fcoe-utils.x86_64
    
  5. Rename /etc/fcoe/cfg-ethx file using the name of you CNAs. For example, if eth5 is your CNA interface, then:
    # cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-eth5
    
  6. Edit /etc/fcoe/cfg-ethX files and set DCB_REQUIRED=”yes” to DCB_REQUIRED=”no”
  7. Start FCoE and LLDPAD services and set adminStatus to disable for ALL Broadcom-based CNA interfaces as stated by HP. Please note, that

    …In a FlexFabric environment, LLPAD must be disabled on all network adapters…

    # chkconfig lldpad on
    # chkconfig fcoe on
    # service lldpad start
    # service fcoe start
    # for d in `ip link ls | grep mtu | awk -F \: '{print $2}'`; do lldptool set-lldp -i $d adminStatus=disabled; done
    # cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-eth5
    
  8. Create Ethernet configuration file for all CNA iterfaces to make sure they will be brought online after reboot:
    DEVICE=eth5
    ONBOOT=yes
    BOOTPROTO=none
    USERCTL=NO
    MTU=9000
    
  9. Run ifup to bring FCoE interfaces up. If everything is OK reboot the server as a final test and start enjoying FCoE.
    # ifup eth5
    
  10. Why MTU=9000? Because FC payload is 2,112 bytes jumbo frames must be turned on to avoid unnecessary IP fragmentation.

OpenSSL TLS 1.1 and wrong version number

If you, like myself, have been living under a rock you’d be also surprised to know that OpenSSL didn’t support TLSv1.1 and TLSv1.2 until version 1.0.1 .
Found out that accidently by trying to disable TLSv1 in Nginx which was running on a RHEL5 box with OpenSSL 0.9.8e. Below is how TLS handshake looked when TLSv1.1 was deliberately requested:

$ openssl s_client -host some_host_name_here -port 443 -tls1_1 -state -msg
CONNECTED(00000003)
SSL_connect:before/connect initialization
>>> TLS 1.1 Handshake [length 0096], ClientHello
    01 00 00 92 03 02 54 e6 ea 6b bc f9 c7 bc 47 4e
    da a9 74 2e c8 27 c4 90 18 94 eb cf 21 40 ef 11
    fe 09 a0 38 bf 2a 00 00 4c c0 14 c0 0a 00 39 00
    38 00 88 00 87 c0 0f c0 05 00 35 00 84 c0 13 c0
    09 00 33 00 32 c0 12 c0 08 00 9a 00 99 00 45 00
    44 00 16 00 13 c0 0e c0 04 c0 0d c0 03 00 2f 00
    96 00 41 00 0a 00 07 c0 11 c0 07 c0 0c c0 02 00
    05 00 04 00 ff 01 00 00 1d 00 0b 00 04 03 00 01
    02 00 0a 00 08 00 06 00 19 00 18 00 17 00 23 00
    00 00 0f 00 01 01
SSL_connect:SSLv3 write client hello A
>>> TLS 1.0 Alert [length 0002], fatal protocol_version
    02 46
SSL3 alert write:fatal:protocol version
SSL_connect:error in SSLv3 read server hello A
140075793618760:error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number:s3_pkt.c:337:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 5 bytes and written 7 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.1
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1424419435
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
---

Linux pptp stumbling blocks that I was hit by

While configuring a pptp on a Linux box I bumped into the several smalish issues which I’d like to blog about.

  1. Make sure that your network engineers have enabled traffic inspection on all intermediate firewalls between tunnel’s endpoints. Otherwise LCP won’t be able to finish its configuration negotiation phase even if the control channel on TCP port 1723 was successfully established before that.
  2. All you would get is the admonitions similar to the ones listed below:

    pppd call connection_name debug nodetach
    using channel 5
    Using interface ppp0
    Connect: ppp0 <--> /dev/pts/2
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    Modem hangup
    Connection terminated.
    Script pptp xxx.xxx.xxx.xxx --nolaunchpppd finished (pid 10385), status = 0x0
    

    Just remember, that without working LCP there will be no ppp connection. Period.

  3. If your are running Redhat Linux distro or any of its derivatives and want to start pptp tunnel using ifup command just do the following:
    • Create a configuration file /etc/sysconfig/network-scripts/ifcfg-your_connection_name
    • In my case the content of the file is rather ascetic and depending on your requirements yours might have different options:

      DEVICE=ppp0
      ONBOOT=yes
      USERCTL=yes
      DEFROUTE=no
      PEERDNS=no
      
    • Make sure that your_connection_name part of /etc/sysconfig/network-scripts/ifcfg-your_connection_name filename matches exactly with the one you have under /etc/ppp/peers/. Otherwise ifup simply won’t fly.
  4. Now you should be able to fire ip “ifup your_connection_name” and a just moment after you should have your tunnel up and running.

Have a stable connection!

Default Linux I/O multipathd configuration, SCSI timeout and Oracle RAC caveat

I’ve been recently involved in a project to migrate from old and rusty Cisco MDS 9222i to the new MDS 9506 SAN switches and during the first phase of the migration the primary node in a two-node Oracle RAC cluster lost access to its voting disks and went down. And that’s when only half paths to SAN storage was unreachable whilst the other half was absolutely ok and active.

Oracle support pointed out to the following errors:

WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.

Metalink document 1581684.1 at support.oracle.com gives more thorough explanation:

Generally this kind messages comes in ASM alertlog file on below situations:

  • Too many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
    thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
  • the heart beat delays are sort of ignored for external redundancy diskgroup.
    ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
    but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+ Some of the paths of the physical paths of the multipath device are offline or lost
+ During path ‘failover’ in a multipath set up
+ Server load, or any sort of storage/multipath/OS maintenance

One way to solve that is to set _asm_hbeatiowait on all the modes of Oracle RAC to a higher value (in seconds) but not higher that 200.

But before that it would be a good idea to take a look at multipathd’s configuration first.

# multipathd -k"show conf"

Since Oracle RAC in our case was backed up by EMC VMAX array the following device section is of the most interest:

device {
                vendor "EMC"
                product "SYMMETRIX"
                path_grouping_policy multibus
                getuid_callout "/sbin/scsi_id -g -u -ppre-spc3-83 -s /block/%n"
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "0"
                rr_weight uniform
                no_path_retry 6
                rr_min_io 1000
        }

And it might seem that no_path_retry was one part of the problem:

A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.

In essence, instead of failing over to the active paths I/O was queued. The negative effect of this option was multiplied by the presence of another option, this time in the default section, called polling_interval which by default is set to 5 seconds. Now you see that I/O was queued by polling_interval*no_path_retry which is 30 seconds in total.

One obvious solution was, as expected, to disable queueing on Oracle voting disks by setting no_path_retry = fail. This was certainly a low hanging fruit but there were more in the details since there are several layers where IO commands issued to a device could experience the timeout:

  • At SCSI layer defined in /sys/class/scsi_device/h:c:t:l/device/timeout.
  • FC HBA’s driver layer (in our case it was qla2xxx). Use modinfo to list the current settings.
  • At dm-multipath or block layer.

The following quote from Redhat’s engineer adds more detailed explanation:

Also, please note that the timeout set in “/sys/class/scsi_device/h:c:t:l/device/timeout” is the minimum amount of time that it will take for the scsi error handler to start when a device is not responding, and *NOT* the amount of time it will take for the device to return a SCSI error. For example if the I/O timeout set to 60s, that means there’s a worst case of 120s before the error handler would ever be able to run.

Since IO commands can be submitted to the device up until the first submitted command is timed out, and that may take 60s for first command to get timed out, we could summarize the worst case scenario for longest time required to return IO errors on a device as follows:

[1] Command submitted to the sub path of device, inherits 60s timeout from /sys.

[2] just before 60s is up, another command is submitted, also inheriting a 60s timeout.

[3] first command times out at 60s, error handler starts but must sleep until all other commands have completed or timed out. Since we had a command submitted just before this, we wait another 60s for it to timeout.

[4] Now we attempt to abort all timed out commands. Note that each abort also sends a Test Unit Ready (TUR SCSI command) to the device, which have a 10 second timeout, adding extra time to the total.

[5] depending on the result of the abort, we may also have to reset the device/bus/host. This would add an indeterminate amount of time to the process, including more Test Unit Ready (TUR SCSI command) at 10 seconds each.

[6] Now that we’ve aborted all commands and possibly reset the device/bus/host, we requeue the cancelled commands. This is where we wait (number of allowed attempts + 1 * timeout_per_command) = (5+1 * 60s) = 360s. (**Note: in above formula number of allowed attempts defaults to 5 for any IO commands issued through VFS layer, and “timeout_per_command” is the timeout value set in “/sys/class/scsi_device/h:c:t:l/device/timeout” file).

[7] As commands reach their “(number of allowed attempts + 1 * timeout_per_command)” timeout, they will be failed back up to the DM-Multipath or application layer with an error code. This is where you finally see SCSI errors, and if multipath software is involved, for a path failure.

So the basic idea is that it’s very hard to predict the exact time it would take to failover and it’s worth trying to fiddle with different timeout settings, i.e. already mentioned and fast_io_fail_tmo, dev_loss_tmo from multipath.conf, as well as to look at the problem from the application’s side and update _asm_hbeatiowait accordingly. The question remains, why Oracle decided to set this parameter to 15 sec by default?

Matching Oracle ASM disks names with the physical devices

Quite often it’s required to find a physical device which backs up ASM disk. The easiest way a Linux administrator could accomplish that is by using oracleasm command (of course, if ASMlib was used to create them):

# oracleasm listdisks
ASMARCHIVE1
ASMARCHIVE2
ASMARCHIVE3
ASMDATA1
ASMDATA2
ASMDATA3
OCR1
OCR2
VOTEDISK1
VOTEDISK2
VOTEDISK3
# oracleasm querydisk -p OCR1
Disk "OCR1" is a valid ASM disk
/dev/mapper/mpath6: LABEL="OCR1" TYPE="oracleasm"
/dev/sdx: LABEL="OCR1" TYPE="oracleasm"
/dev/sdh: LABEL="OCR1" TYPE="oracleasm"
/dev/sdan: LABEL="OCR1" TYPE="oracleasm"
/dev/sdbd: LABEL="OCR1" TYPE="oracleasm"

Openfire. Migrating from HSQLDB to MySQL.

The other day, I had to migtate Openfire from HSQLDB to MySQL using MySQL Migration tool and below just a couple tips that could save a bit of your time if you up to the same task:

  1. I used Windowds XP.
  2. MySQL Migration tool has been EOLed but it is still available from mysql.com.
  3. Java 1.5 is required to run MySQL Migration tool.
  4. Set -Xmx to 512m or bigger, as shown below, if your openfire.script is big. Mine was 135MB and that was essential.
  5. cd "c:\Program Files\MySQL\MySQL Tools for 5.0\
    .\MySQLMigrationTool.exe -Xmx 512m
  6. Not doing so you will get the following error:
  7. Connecting to source database and retrieve schemata names.
    Initializing JDBC driver …
    Driver class Generic Jdbc
    Opening connection …
    Connection jdbc:hsqldb:c:\temp\embedded-db\openfire
    The list of schema names could not be retrieved (error: 0).
    ReverseEngineeringGeneric.getSchemata :Out of Memory
    Details:

  8. Using MySQL Migration tool is trivial but you should provide a proper connection string. If you don’t none of your tables will be migrated and what you’ll see in the end is a report similar to this one:
  9. 1. Schema Migration
    ——————-

    Number of migrated schemata: 1

    Schema Name: PUBLIC
    – Tables: 0
    – Views: 0
    – Routines: 0
    – Routine Groups: 0
    – Synonyms: 0
    – Structured Types: 0
    – Sequences: 0

  10. I stopped Openfire copied the content (there are actually just two files inside – openfire.log and openfire.script) of /opt/openfire/embedded-db to c:\temp\embedded-db on my Windows PC
  11. Copied hsqldb.jar from the server to lib/ directory of MySQL Migration tool where it keeps various jars.
  12. used the following connection string and the class name respectively (also shown on the screenshot):
  13. jdbc:hsqldb:file:c:\temp\embedded-db\openfire
    org.hsqldb.jdbc.Driver

MySQLMigrationTool

The rest is just a series of clicks on the “Next” button.
Please note that if you choose to migrate the data directly into your MySQL DB all the tables will be created with their names in UPPER case. If it’s not what you prefer instead of checking “Create Objects Online” and “Trabfer Data Online” simply select “Create Script File for Create Statements” and “Create Script File for Insert Statements” and the tool will create to files Creates.sql and Inserts.sql which you could later update to meet your preferences.
To solve that issue I came up with a dumb and bold script that fixes that which I put on my GitHub repository

TAO: Facebook’s Distributed Data Store for the Social Graph

This post will be very succinct. Just a single link to a publication of TAO’s details at USENIX and a quote from the abstract part of the paper to catch your eye:

We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model. TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook’s demanding workload using a fixed set of queries. It is deployed at Facebook, replacing memcache for many data types that fit its model. The system runs on thousands of machines, is widely distributed, and provides access to many petabytes of data. TAO can process a billion reads and millions of writes each second.

Definitely worth reading.

Simple Python scp wrapper

Inspired by the following article I decided to write a simply python scp wrapper which is based on the ideas taken from the perl script also mentioned on the same page.

The whole point of this wrapper is dead simple: to allow certain users scp into their specific directories with out granting full ssh access.

The configuration file is straightforward:
/opt/etc/scponly.conf

scponly=/tmp:/home/scptest

Below is the wrapper in all its amature (must admit I’m not a developer at all) glory ;-)

/opt/bin/scp-wrapper.py:

#!/usr/bin/python
 
import sys
import os
import pwd
from subprocess import call, Popen, PIPE
import re
 
def fail(msg):
    """(str) -> str Prints error message to STDOUT
    >>> fail('Error')
    Error
    >>> fail('Wrong argument')
    Wrong argument
    """
    
    print msg
    sys.exit(1)
 
def access_verify(user, dirto):
    """(str) -> Boolean Returns TRUE iff user is allowed to scp
    """
  
    if user in users:
        for d in users[user]:
            print user, dirto[-1:],  d.rstrip()[-1:], dirto[:-1] == d.rstrip(), dirto == d[:-1].rstrip()
            if dirto == d.rstrip():
                return True
            elif (dirto[-1:] == "/" or d.rstrip()[-1:] == "/") and (dirto[:-1] == d.rstrip() or dirto == d.rstrip()[:-1]):
                return True
            else:
                return False
    return False
 
if __name__ == '__main__':
 
    users = {}
    conf = "/usr/local/etc/scponly.conf"
 
    # Reading configuration file
 
    fp = open(conf, "r")
    for line in fp.readlines():
        record = line.split("=")
        users[record[0]] = record[1].split(":")
    fp.close()
  
    command = sys.argv[2]
    scp_args = command.split()
 
    if scp_args[0] != "scp":
        msg = "Only scp is allowed"
        fail(msg)
 
    if scp_args[1] != "-t" and not "-f" and not "-v":
        msg = "Restricted; only server mode is allowed"
        fail(msg)
 
    destdir = scp_args[-1]
   if not os.path.isfile(destdir) or os.path.isfile(destdir):
        destdirv = os.path.dirname(destdir)
    else:
        destdirv = destdir
    uname = pwd.getpwuid(os.getuid())[0]
    if not access_verify(uname, destdirv):
        msg = "User " + uname + " is not authorized to scp to this host."
        fail(msg)
    else:
        scp_args.pop(0)
        if len(scp_args) == 2:
            call(["/usr/bin/scp", scp_args[0], destdir])
        elif len(scp_args) == 3:
            call(["/usr/bin/scp", scp_args[0], scp_args[1], destdir])

Just create a new user and set its shell (-s option in useradd) to /opt/bin/scp-wrapper.py. Hope it helps.
Enjoy and have fun!

OpenLDAP do_syncrep retrying attempts

Do you observe the error messages on your Linux OpenLDAP replica or master server similar to the ones listed below:

May 16 12:05:21 ldapserver1 slapd[5420]: do_syncrep2: rid=005 (-1) Can’t contact LDAP server
May 16 12:05:21 ldapserver1 slapd[5420]: do_syncrepl: rid=005 rc -1 retrying (4 retries left)
May 16 12:05:21 ldapserver1 slapd[5420]: do_syncrep2: rid=002 (-1) Can’t contact LDAP server
May 16 12:05:21 ldapserver1 slapd[5420]: do_syncrepl: rid=002 rc -1 retrying (4 retries left)
May 16 14:05:27 ldapserver1 slapd[5420]: do_syncrep2: rid=005 (-1) Can’t contact LDAP server
May 16 14:05:27 ldapserver1 slapd[5420]: do_syncrepl: rid=005 rc -1 retrying (4 retries left)
May 16 14:05:27 ldapserver1 slapd[5420]: do_syncrep2: rid=002 (-1) Can’t contact LDAP server
May 16 14:05:27 ldapserver1 slapd[5420]: do_syncrepl: rid=002 rc -1 retrying (4 retries left)
May 16 16:05:32 ldapserver1 slapd[5420]: do_syncrep2: rid=005 (-1) Can’t contact LDAP server
May 16 16:05:32 ldapserver1 slapd[5420]: do_syncrepl: rid=005 rc -1 retrying (4 retries left)
May 16 16:05:32 ldapserver1 slapd[5420]: do_syncrep2: rid=002 (-1) Can’t contact LDAP server
May 16 16:05:32 ldapserver1 slapd[5420]: do_syncrepl: rid=002 rc -1 retrying (4 retries left)

If yes and these messages seem to pop up every two hours then you might consider updating the following sysctl parameters:

net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes 

Where:

  • net.ipv4.tcp_keepalive_time – How often TCP sends out keepalive messages when keepalive is enabled. Default: 2hours.
  • net.ipv4.tcp_keepalive_intvl – How frequently the probes are send out. Multiplied by
    tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75sec i.e. connection will be aborted after ~11 minutes of retries.
  • net.ipv4.tcp_keepalive_probes – How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.

Hopefully that would make your OpenLDAP replication more reliable.