jbd2 is munching your disks? Use ftrace to find why.

Have you ever been wondering why jbd2 (or jbd if your are still using ext3) is sitting at the top of iotop and consuming the most of IO bandwidth? Well, it’s certainly not because it’s doing that just to drive you nuts but there is a reason. And the reason is most probably there is an app that is doing a lot of sys_fsync(), sys_fdatasync() or sys_msync().
In case your are not on the latest and greatest kernel and BPF is not available, there is an easy way to confirm that using ftrace.

Just enable tracing of ext4_sync_file_enter events:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable

And print out the output from trace or trace_pipe (refer to the documentation of ftrace for more information):

# cat /sys/kernel/debug/tracing/trace/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 16/16   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          mongod-2299  [001] ...1 661508.531446: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.543931: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.566134: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661511.255926: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [000] ...1 661511.703643: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 

Once you are done, just stop tracing the events and clear ftrace’s buffer:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
# echo > /sys/kernel/debug/tracing/trace

So in my case jbd2 excessive activity was caused by MongoDB’s journaling which syncs data every 50ms (starting from version 3.2).

MongoDB – Defeating RangeDeleter

Not closing a cursor in MongoDB could hurt you big, so it’s generally not recommended to use no_cursor_timeout=True (pymongo3) or timeout=False (pymongo2). Especially when you run shared MongoDB installation:

PyMongo does “close” cursors when they are garbage collected, but they aren’t closed immediately and closing a cursor in all current versions of MongoDB is asynchronous. Depending on the python implementation, relying on garbage collection to close the cursor is not a great idea. Discarded, not fully iterated cursors can live for some time when using Jython or PyPy which do not do reference counting garbage collection. That’s why the Cursor object has an explicit close() method.

Not using close() method could potentially be the reason behind the following lines in the MongoDB’s log file:

SHARDING [RangeDeleter] waiting for open cursors before removing range

And even if it looks innocuous, it’s actually not quite, since what it means is that a source shard can’t delete its copy of the documents – Step 7 in chunk migration procedure.

At the time of this writing there is still no way (SERVER-3090) to glean more information that pertains to a cursor’s id, so the only way out that I was able to come up with was to kill those dangling cursors using an undocumented (as of this writing) killCursors command:

> use your_db_name
> db.runCommand({killCursors: 'your_collection_name', cursors: [NumberLong(51518759968), NumberLong(51484189302), NumberLong(51451409949), NumberLong(51434938438), NumberLong(51429435438), NumberLong(51383912702)]})
{
	"cursorsKilled" : [
		NumberLong("51518759968"),
		NumberLong("51484189302"),
		NumberLong("51451409949"),
		NumberLong("51434938438"),
		NumberLong("51429435438"),
		NumberLong("51383912702")
	],
	"cursorsNotFound" : [ ],
	"cursorsAlive" : [ ],
	"cursorsUnknown" : [ ],
	"ok" : 1
}

And as a reward look for the following message in the log file:

SHARDING [RangeDeleter] Deleter starting delete for

Another FreeBSD drop into the Digital Ocean

After several happy years with FreeBSD running in AWS I finally have switched to Digital Ocean. That happened a few days ago and was driven mainly by the lack of the console support which “aws ec2 get-console-output”, in my opinion, is certainly not.
After the upgrade to FreeBSD 11 I found my instance unreachable and had absolutely no clue what was wrong with it. In that situation “aws ec2 get-console-output” was totally useless with its succinct single-worded output – “Output”. Last time when I had a similar issue after another upgrade I at least was able to glean some helpful information with get-console-output to fix the problem. Not this time though.
So without further ado and armed with Tarsnap backups, I jumped on to DO’s bandwagon with ZFS and HTML5 console which, I hope, would be able to save me should I hit the same boot problem again. As an extra bonus, DO instance is a bit cheaper and beefier than the one I had in AWS. But as always… horses for courses.

Turned the page

After 5 exciting and tumultuous years in the enterprise IT as a Unix and SAN engineer it’s time to switch gears and taste something new. Of course, the alley I’m stepping into is not universally novel but for me personally it’s like an uncharted territory and something I could only dream about. The idea for a change had been ripening for quite a while so it was an effortless decision to say goodbye and move forward without turning back and holding no regrets.
Don’t want to paint with a broad brush but the enterprise IT is notoriously known for its conservatism, red-taping and being usually reluctant to any sort of changes. That’s ok for their business goals but, based solely on my personal experience, that could turn IT engineer into a bench sitter. Hopefully, at my new position I will be more exposed to the bleeding edge technologies, systems’ internals, programming and could become a better practitioner. Time will certainly tell but for now I’m emphatically waiting to face new challenges…

P.S.
Speaking of practitioners, if you haven’t seen/heard the latest Bryan Cantrill’s talk I highly encourage you to do so
A Wardrobe for the Emperor: Stitching Practical Bias into Systems Software Research

TIL Common Name (CN) is legacy and subjectAltName must always be used.

Seems I’ve been living under a rock for far too log. From RFC2818:

Although the use of the Common Name is existing practice, it is deprecated and Certification Authorities are encouraged to use the dNSName instead.

So in today’s world CN is only evaluated when subjectAltName is not present and if it’s set all host names, IPs, emails, etc. must be specified in subjectAltName.

As a bonus, below is a one-liner to generate CSR with subjectAltName:

openssl req -new -newkey rsa:2048 -keyout example.com.key -sha256 -nodes -days 36500 -out example.com.csr -subj "/C=US/ST=IL/L=Chicago/O=Fortune500/OU=IT/CN=example.com" -reqexts v3_req -config <(cat /etc/pki/tls/openssl.cnf <(printf "[ v3_req ]\nsubjectAltName = DNS:example.com,DNS:www.example.com"))

Pair “Listen queue overflow” FreeBSD errors with pcb

Just yesterday, after an upgrade to MySQL 5.7.12, saw plenty of errors were being logged in the system:

sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
sonewconn: pcb 0xfffff8006311c870: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)

There is a great post that explains how to find the culprit. In a nutshell, there are two quick options:

  1. Use “lsof -itcp -stcp:listen -P” and grep for pcb.
  2. Or since “the overflow happens when the queue is at about 150% capacity” (as mentioned in the original post), it’s possible to match the number from the error (151 in my case) with an output from “netstat -an -p tcp -L”.

In my case that was trivial as both Postfix and Dovecot complained about missing libmysqlclient.so.18 shared library which was replaced with libmysqlclient.so.20 after the upgrade. Rebuilding from ports and restarting both of them fixed the issue and no hassling with kern.ipc.somaxconn was needed.

Have stalled snmpd in recvfrom()? Check Recv-Q

Not so while ago I had an issue with a monitoring system that paged about SNMP checks failing on a number of servers. Quick checking here and there (logs, strace, tcpdump, etc.) revealed that snmpd had stalled in recvfrom() without sending a single packet out in response to the constant queries from our monitoring system. Everything seemed to be ok except “netstat -s” that showed a steady increase in “Udp: packet receive errors” counter. Summon ss to the rescue:

# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     262680 0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Matching 262680 with “sysctl net.core.rmem_default” suggested that the receiving buffers (Recv-Q) were filling up but why Taking a close look at the logs returned the following segfault:

cmanicd[55673]: segfault at 0 ip 00007f041e721081 sp 00007f040e16c700 error 4 in libnetsnmp.so.20.0.0[7f041e6a1000+a0000]

It turned out to be a well known issue with NIC Agent (CMANICD):
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04912220&sp4ts.oid=316583

So it looked to be our guy. Starting cmanicd back immediately solved the problem:

[root@slon02db12 ~]# ss -ianump \( sport = *:161 \)
State      Recv-Q Send-Q                                                                                       Local Address:Port                                                                                         Peer Address:Port
UNCONN     0      0                                                                                                        *:161                                                                                                     *:*      users:(("snmpd",52984,7))

Recv-Q was dropped to zero and a server became green in the monitoring dashboard. Bingo. Problem solved so now it’s time for the upgrade.

Btw, If you don’t know how to read Linux segfault message (I didn’t know that myself before this issue) then the following note could fix that:

Nov 27 15:26:19 machine kernel: fmg[6335]: segfault at 00000000ffffd2dc rip 00000000ffffd2dc rsp 00000000ffffd1bc error 15

What does the kernel message mean, in detail?

  • The rip value is the instruction pointer register value, the rsp is the stack pointer register value.
  • The error value is a bit mask of page fault error code bits (from arch/x86/mm/fault.c):
  • Raw
     *   bit 0 ==    0: no page found       1: protection fault
     *   bit 1 ==    0: read access         1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==                           1: use of reserved bit detected
     *   bit 4 ==                           1: fault was an instruction fetch
  • Here’s error bit definition:
  • Raw
    enum x86_pf_error_code {
      PF_PROT   =       1 << 0,
      PF_WRITE  =       1 << 1,
      PF_USER   =       1 << 2,
      PF_RSVD   =       1 << 3,
      PF_INSTR  =       1 << 4,
    };

In my case error code was 4 which means cmanicd tried to access address zero from the user space which reeks a NULL pointer dereference.