Watch “Monitorama 2016: All of Your Networking Monitoring is (probably) wrong” talk

Just came across this talk being mentioned in the comments on Hacker news and, boy, it’s absolutely amazing!
Watch this hilarious talk here – Monitorama 2016: All of Your Networking Monitoring is (probably) wrong

Btw, the talk is presented, presumably, by the same guy who wrote Monitoring and Tuning the Linux Networking Stack: Receiving Data and Monitoring and Tuning the Linux Networking Stack: Sending Data which are both must-read.

TIL gethostbyname*() and gethostbyaddr*() functions are obsolete

It wasn’t obvious to me until I tried to run netcat utility (aka nc) on Ubuntu 10.04 (lucid) release to check Zookeeper’s status:

echo "stat" | nc zookeer_server_name 2181
zookeer_server_name: forward host lookup failed: No address associated with name

It wouldn’t have been a problem had Zookeeper server used IPv4 address but it was configured with IPv6. So tools that used gethostbyname2(), e.g. getent, were still ok, and only those with gethostbyname() were failing me. Luckily, netcat and other important libraries had newer versions I could use. Once again, if you are on an old and rusty Linux distro be aware that gethostbyname*() and gethostbyaddr*() functions are obsolete

Update
As Anton mentioned in his comment below, getaddrinfo() had its own gotchas, which, if I got it right, were caused by AI_ADDRCONFIG flag. There is a good summary page which goes in more details regarding AI_ADDRCONFIG and the peculiarities pertaining to its current implementation in glibc.

jbd2 is munching your disks? Use ftrace to find why.

Have you ever been wondering why jbd2 (or jbd if your are still using ext3) is sitting at the top of iotop and consuming the most of IO bandwidth? Well, it’s certainly not because it’s doing that just to drive you nuts but there is a reason. And the reason is most probably there is an app that is doing a lot of sys_fsync(), sys_fdatasync() or sys_msync().
In case your are not on the latest and greatest kernel and BPF is not available, there is an easy way to confirm that using ftrace.

Just enable tracing of ext4_sync_file_enter events:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable

And print out the output from trace or trace_pipe (refer to the documentation of ftrace for more information):

# cat /sys/kernel/debug/tracing/trace/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 16/16   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          mongod-2299  [001] ...1 661508.531446: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.543931: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.566134: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661511.255926: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [000] ...1 661511.703643: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 

Once you are done, just stop tracing the events and clear ftrace’s buffer:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
# echo > /sys/kernel/debug/tracing/trace

So in my case jbd2 excessive activity was caused by MongoDB’s journaling which syncs data every 50ms (starting from version 3.2).

Have stalled snmpd in recvfrom()? Check Recv-Q

Not so while ago I had an issue with a monitoring system that paged about SNMP checks failing on a number of servers. Quick checking here and there (logs, strace, tcpdump, etc.) revealed that snmpd had stalled in recvfrom() without sending a single packet out in response to the constant queries from our monitoring system. Everything seemed to be ok except “netstat -s” that showed a steady increase in “Udp: packet receive errors” counter. Summon ss to the rescue:

# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     262680 0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Matching 262680 with “sysctl net.core.rmem_default” suggested that the receiving buffers (Recv-Q) were filling up but why Taking a close look at the logs returned the following segfault:

cmanicd[55673]: segfault at 0 ip 00007f041e721081 sp 00007f040e16c700 error 4 in libnetsnmp.so.20.0.0[7f041e6a1000+a0000]

It turned out to be a well known issue with NIC Agent (CMANICD):
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04912220&sp4ts.oid=316583

So it looked to be our guy. Starting cmanicd back immediately solved the problem:

[root@slon02db12 ~]# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     0      0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Recv-Q was dropped to zero and a server became green in the monitoring dashboard. Bingo. Problem solved so now it’s time for the upgrade.

Btw, If you don’t know how to read Linux segfault message (I didn’t know that myself before this issue) then the following note could fix that:

Nov 27 15:26:19 machine kernel: fmg[6335]: segfault at 00000000ffffd2dc rip 00000000ffffd2dc rsp 00000000ffffd1bc error 15

What does the kernel message mean, in detail?

  • The rip value is the instruction pointer register value, the rsp is the stack pointer register value.
  • The error value is a bit mask of page fault error code bits (from arch/x86/mm/fault.c):
  • Raw
     *   bit 0 ==    0: no page found       1: protection fault
     *   bit 1 ==    0: read access         1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==                           1: use of reserved bit detected
     *   bit 4 ==                           1: fault was an instruction fetch
  • Here’s error bit definition:
  • Raw
    enum x86_pf_error_code {
      PF_PROT   =       1 << 0,
      PF_WRITE  =       1 << 1,
      PF_USER   =       1 << 2,
      PF_RSVD   =       1 << 3,
      PF_INSTR  =       1 << 4,
    };

In my case error code was 4 which means cmanicd tried to access address zero from the user space which reeks a NULL pointer dereference.

Do initrd dance before turning Linux physical server into VM

If one day you decide to convert your physical server to a VM, which could be easily achieved if all its disks are presented from SAN, then don’t forget to rebuild initrd beforehand. Otherwise you would see something similar to this:

No device found
Scanning and configuring dmraid supported devices
Scanning logical volumes
  Reading all physical volumes. This may take a while...
  No volume groups found
Activating logical volumes
  Volume group "VolGroup00" not found
Trying to resume from /dev/VolGroup00/LogVol01
Unable to access resume device (/dev/VolGroup00/LogVol01)
Creating root device.
Mounting root filesystem.
mount: could not find filesystem '/dev/root'
Setting up other filesystems.
Setting up new root fs
setuproot: moving /dev failed: No such file or directory
no fstab.sys, mounting internal defaults
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: No such file or directory
Switching to new root and running init
unmount old /dev
unmount old /proc
unmount old /sys
switchroot: mount failed: No such file or directory 
Kernel panic - not syncing: Attempted to kill init! 

Also, if your SAN disks are multipathed, which is an obvious and the only correct choice, then you must (according to RedHat note) to disable multipath by editing /etc/sysconfig/mkinitrd/multipath, otherwise the system won’t boot:

# vi /etc/sysconfig/mkinitrd/multipath MULTIPATH=no

Root Cause
The multipath option should only be set to YES if you your root volume (/) is on a multipathed device
If multipath is enabled with root (/) on a local device, multipathing will enable at boot time and lock down the device
If the device is locked down, fsck will be unable to open it for checking

There are two options to rebuild initrd:

  1. Use mkinitrd or dracut, depending on the OS version you’re currently on, and pre-build a new initrd before detaching the disks from the old system.
  2. If the system has been already converted to a VM, .i.e. all disks from the old system have been detached and presented as RDMs to a new VM, then boot from a rescue disk, and chroot to /mnt/sysimage (if you are running RedHat or CentOS) and run mkinitrd or dracut from their. Keep in mind that /boot partition as well as /sys must be mounted in the chrooted environment or, again, your system will not fly.
  3. mount --bind /proc /mnt/sysimage/proc
    mount --bind /dev /mnt/sysimage/dev
    mount --bind /sys /mnt/sysimage/sys

Good luck.

Workaround for Tomcat7 on Linux, JDBC and javax.naming.NamingException

A few days ago I was dabbling with JDBC and Tomcat7 and the configuration that seemingly had no issues resulted in the following error in the log file:

org.apache.catalina.core.NamingContextListener addResource
WARNING: Failed to register in JMX: javax.naming.NamingException: Could not create resource factory instance
[Root exception is java.lang.ClassNotFoundException: org.apache.tomcat.dbcp.dbcp.BasicDataSourceFactory]

Thankfully, Google pointed me to this post at stackoverflow.com which had both the solution and the link to the details behind this behaviour.

In the nutshell, the workaround looks like the following:

  1. Grab tomcat-dbcp-version.jar from Maven that
    matches the version of Tomcat you are running and place it in $CATALINA_HOME/lib. Copying it somewhere else and creating a link also works.
  2. Update <Resource/> section in context.xml file by adding the following line:
    factory="org.apache.commons.dbcp.BasicDataSourceFactory"
  3. Restart Tomcat

Peace.

P.S. Did a quick test and it looks like that FreeBSD distributes tomcat-dbcp.jar as part of its tomcat package:

# pkg query %Fp tomcat7 | grep dbcp
/usr/local/apache-tomcat-7.0/lib/tomcat-dbcp.jar

Interview fizzle as a chance to get better

Not a long ago I had one of those humiliating moments when a simple question makes you numb a or even worse – you begin to mumble an absolute rubbish. That what exactly what has happened to me recently and being an afterthought person (which, of course, doesn’t give me any advantage) I decided to do some homework/recap o the questions I’ve failed misarebly.

  • Linux PIPE
  • – Read “man 2 pipe” as it basically says it all in a single sentene:

    pipe() creates a pair of file descriptors, pointing to a pipe inode, and places them in the array pointed to by filedes. filedes[0] is for reading, filedes[1] is for writing.

    – Want to go deeper then the source code is the best place to start:

  • Linux VM overcommit
  • – Again, start from reading the documentation.
    – Take a look at the code to figure out how the heuristic overcommit handling works. Especially, __vm_enough_memory() which is run by security_vm_enough_memory_mm(), which in turn could be called from different places, e.g. mmap_region(), acct_stack_growth(), do_brk(), insert_vm_struct(), dup_mmap().

  • MALLOC
  • – “man 3 malloc”, “man 2 mallopt”
    – Go through do_brk() code.

  • Swappiness
    – Read vm sysctl documentation about the swappiness parameter.
    – swappiness comes into play in get_scan_count() which is called from shrink_lruvec().
    – If the code looks murky, take a look at the answer published at unix.stackexchange.com which goes in a greater details about vm.swappiness.
    – Read about Split LRU

And of course, buy, read and re-read Understanding the Linux Kernel even if it’s a bit dated.

Configuring FCoE in Linux (RHEL) and HP FlexFabric

Actually it’s easy. Very easy indeed, like going 1, 2, 3.

  1. Collect information about MAC addresses to distinguish pure Ethernet NICs and CNA that will pass FCoE traffic. The latter have both MAC and WWN addresses.
  2. Power on a server and update /etc/udev/rules.d/70-persistent-net.rules if required.
  3. Activate new dev rules:
    # udevadm trigger
    
  4. Install fcoe-utils and lldpad packages:
    # yum install fcoe-utils.x86_64
    
  5. Rename /etc/fcoe/cfg-ethx file using the name of you CNAs. For example, if eth5 is your CNA interface, then:
    # cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-eth5
    
  6. Edit /etc/fcoe/cfg-ethX files and set DCB_REQUIRED=”yes” to DCB_REQUIRED=”no”
  7. Start FCoE and LLDPAD services and set adminStatus to disable for ALL Broadcom-based CNA interfaces as stated by HP. Please note, that

    …In a FlexFabric environment, LLPAD must be disabled on all network adapters…

    # chkconfig lldpad on
    # chkconfig fcoe on
    # service lldpad start
    # service fcoe start
    # for d in `ip link ls | grep mtu | awk -F \: '{print $2}'`; do lldptool set-lldp -i $d adminStatus=disabled; done
    # cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-eth5
    
  8. Create Ethernet configuration file for all CNA iterfaces to make sure they will be brought online after reboot:
    DEVICE=eth5
    ONBOOT=yes
    BOOTPROTO=none
    USERCTL=NO
    MTU=9000
    
  9. Run ifup to bring FCoE interfaces up. If everything is OK reboot the server as a final test and start enjoying FCoE.
    # ifup eth5
    
  10. Why MTU=9000? Because FC payload is 2,112 bytes jumbo frames must be turned on to avoid unnecessary IP fragmentation.

OpenSSL TLS 1.1 and wrong version number

If you, like myself, have been living under a rock you’d be also surprised to know that OpenSSL didn’t support TLSv1.1 and TLSv1.2 until version 1.0.1 .
Found out that accidently by trying to disable TLSv1 in Nginx which was running on a RHEL5 box with OpenSSL 0.9.8e. Below is how TLS handshake looked when TLSv1.1 was deliberately requested:

$ openssl s_client -host some_host_name_here -port 443 -tls1_1 -state -msg
CONNECTED(00000003)
SSL_connect:before/connect initialization
>>> TLS 1.1 Handshake [length 0096], ClientHello
    01 00 00 92 03 02 54 e6 ea 6b bc f9 c7 bc 47 4e
    da a9 74 2e c8 27 c4 90 18 94 eb cf 21 40 ef 11
    fe 09 a0 38 bf 2a 00 00 4c c0 14 c0 0a 00 39 00
    38 00 88 00 87 c0 0f c0 05 00 35 00 84 c0 13 c0
    09 00 33 00 32 c0 12 c0 08 00 9a 00 99 00 45 00
    44 00 16 00 13 c0 0e c0 04 c0 0d c0 03 00 2f 00
    96 00 41 00 0a 00 07 c0 11 c0 07 c0 0c c0 02 00
    05 00 04 00 ff 01 00 00 1d 00 0b 00 04 03 00 01
    02 00 0a 00 08 00 06 00 19 00 18 00 17 00 23 00
    00 00 0f 00 01 01
SSL_connect:SSLv3 write client hello A
>>> TLS 1.0 Alert [length 0002], fatal protocol_version
    02 46
SSL3 alert write:fatal:protocol version
SSL_connect:error in SSLv3 read server hello A
140075793618760:error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number:s3_pkt.c:337:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 5 bytes and written 7 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.1
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1424419435
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
---

Linux pptp stumbling blocks that I was hit by

While configuring a pptp on a Linux box I bumped into the several smalish issues which I’d like to blog about.

  1. Make sure that your network engineers have enabled traffic inspection on all intermediate firewalls between tunnel’s endpoints. Otherwise LCP won’t be able to finish its configuration negotiation phase even if the control channel on TCP port 1723 was successfully established before that.
  2. All you would get is the admonitions similar to the ones listed below:

    pppd call connection_name debug nodetach
    using channel 5
    Using interface ppp0
    Connect: ppp0 <--> /dev/pts/2
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    sent [LCP ConfReq id=0x1    ]
    Modem hangup
    Connection terminated.
    Script pptp xxx.xxx.xxx.xxx --nolaunchpppd finished (pid 10385), status = 0x0
    

    Just remember, that without working LCP there will be no ppp connection. Period.

  3. If your are running Redhat Linux distro or any of its derivatives and want to start pptp tunnel using ifup command just do the following:
    • Create a configuration file /etc/sysconfig/network-scripts/ifcfg-your_connection_name
    • In my case the content of the file is rather ascetic and depending on your requirements yours might have different options:

      DEVICE=ppp0
      ONBOOT=yes
      USERCTL=yes
      DEFROUTE=no
      PEERDNS=no
      
    • Make sure that your_connection_name part of /etc/sysconfig/network-scripts/ifcfg-your_connection_name filename matches exactly with the one you have under /etc/ppp/peers/. Otherwise ifup simply won’t fly.
  4. Now you should be able to fire ip “ifup your_connection_name” and a just moment after you should have your tunnel up and running.

Have a stable connection!