Have stalled snmpd in recvfrom()? Check Recv-Q

Not so while ago I had an issue with a monitoring system that paged about SNMP checks failing on a number of servers. Quick checking here and there (logs, strace, tcpdump, etc.) revealed that snmpd had stalled in recvfrom() without sending a single packet out in response to the constant queries from our monitoring system. Everything seemed to be ok except “netstat -s” that showed a steady increase in “Udp: packet receive errors” counter. Summon ss to the rescue:

# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     262680 0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Matching 262680 with “sysctl net.core.rmem_default” suggested that the receiving buffers (Recv-Q) were filling up but why Taking a close look at the logs returned the following segfault:

cmanicd[55673]: segfault at 0 ip 00007f041e721081 sp 00007f040e16c700 error 4 in libnetsnmp.so.20.0.0[7f041e6a1000+a0000]

It turned out to be a well known issue with NIC Agent (CMANICD):

So it looked to be our guy. Starting cmanicd back immediately solved the problem:

[root@slon02db12 ~]# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     0      0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Recv-Q was dropped to zero and a server became green in the monitoring dashboard. Bingo. Problem solved so now it’s time for the upgrade.

Btw, If you don’t know how to read Linux segfault message (I didn’t know that myself before this issue) then the following note could fix that:

Nov 27 15:26:19 machine kernel: fmg[6335]: segfault at 00000000ffffd2dc rip 00000000ffffd2dc rsp 00000000ffffd1bc error 15

What does the kernel message mean, in detail?

  • The rip value is the instruction pointer register value, the rsp is the stack pointer register value.
  • The error value is a bit mask of page fault error code bits (from arch/x86/mm/fault.c):
  • Raw
     *   bit 0 ==    0: no page found       1: protection fault
     *   bit 1 ==    0: read access         1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==                           1: use of reserved bit detected
     *   bit 4 ==                           1: fault was an instruction fetch
  • Here’s error bit definition:
  • Raw
    enum x86_pf_error_code {
      PF_PROT   =       1 << 0,
      PF_WRITE  =       1 << 1,
      PF_USER   =       1 << 2,
      PF_RSVD   =       1 << 3,
      PF_INSTR  =       1 << 4,

In my case error code was 4 which means cmanicd tried to access address zero from the user space which reeks a NULL pointer dereference.

Posted on May 14, 2016 at 9:17 pm by sergeyt · Permalink
In: Linux

Leave a Reply