Pair “Listen queue overflow” FreeBSD errors with pcb

Just yesterday, after an upgrade to MySQL 5.7.12, saw plenty of errors were being logged in the system:

sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)

There is a great post that explains how to find the culprit. In a nutshell, there are two quick options:

  1. Use “lsof -itcp -stcp:listen -P” and grep for pcb.
  2. Or since “the overflow happens when the queue is at about 150% capacity” (as mentioned in the original post), it’s possible to match the number from the error (151 in my case) with an output from “netstat -an -p tcp -L”.

In my case that was trivial as both Postfix and Dovecot complained about missing libmysqlclient.so.18 shared library which was replaced with libmysqlclient.so.20 after the upgrade. Rebuilding from ports and restarting both of them fixed the issue and no hassling with kern.ipc.somaxconn was needed.

Have stalled snmpd in recvfrom()? Check Recv-Q

Not so while ago I had an issue with a monitoring system that paged about SNMP checks failing on a number of servers. Quick checking here and there (logs, strace, tcpdump, etc.) revealed that snmpd had stalled in recvfrom() without sending a single packet out in response to the constant queries from our monitoring system. Everything seemed to be ok except “netstat -s” that showed a steady increase in “Udp: packet receive errors” counter. Summon ss to the rescue:

# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     262680 0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Matching 262680 with “sysctl net.core.rmem_default” suggested that the receiving buffers (Recv-Q) were filling up but why Taking a close look at the logs returned the following segfault:

cmanicd[55673]: segfault at 0 ip 00007f041e721081 sp 00007f040e16c700 error 4 in libnetsnmp.so.20.0.0[7f041e6a1000+a0000]

It turned out to be a well known issue with NIC Agent (CMANICD):
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04912220&sp4ts.oid=316583

So it looked to be our guy. Starting cmanicd back immediately solved the problem:

[root@slon02db12 ~]# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     0      0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Recv-Q was dropped to zero and a server became green in the monitoring dashboard. Bingo. Problem solved so now it’s time for the upgrade.

Btw, If you don’t know how to read Linux segfault message (I didn’t know that myself before this issue) then the following note could fix that:

Nov 27 15:26:19 machine kernel: fmg[6335]: segfault at 00000000ffffd2dc rip 00000000ffffd2dc rsp 00000000ffffd1bc error 15

What does the kernel message mean, in detail?

  • The rip value is the instruction pointer register value, the rsp is the stack pointer register value.
  • The error value is a bit mask of page fault error code bits (from arch/x86/mm/fault.c):
  • Raw
     *   bit 0 ==    0: no page found       1: protection fault
     *   bit 1 ==    0: read access         1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==                           1: use of reserved bit detected
     *   bit 4 ==                           1: fault was an instruction fetch
  • Here’s error bit definition:
  • Raw
    enum x86_pf_error_code {
      PF_PROT   =       1 << 0,
      PF_WRITE  =       1 << 1,
      PF_USER   =       1 << 2,
      PF_RSVD   =       1 << 3,
      PF_INSTR  =       1 << 4,
    };

In my case error code was 4 which means cmanicd tried to access address zero from the user space which reeks a NULL pointer dereference.