Restart your Mongos after maxConsecutiveFailedChecks

Take it literally.
If you configured your MongoDB config servers as a replica set and for some reason, say a network outage, Mongos server lost connection to all of them and is not able to reconnect during maxConsecutiveFailedChecks attempts then, surprise, it becomes useless. Even if the network is up and running again, Mongos will not reconnect to the config servers and you won’t be able to authenticate to your shard cluster until Mongos is restarted.

From https://api.mongodb.com/cplusplus/current/classmongo_1_1_replica_set_monitor.html

static int 	maxConsecutiveFailedChecks = 30
 	If a ReplicaSetMonitor has been refreshed more than this many times in a row without finding any live nodes claiming to be in the set, the ReplicaSetMonitorWatcher will stop periodic background refreshes of this set. 

And if you check the source code of 3.2.x (3.2.12 as of this writing) branch you will see the following (./src/mongo/client/replica_set_monitor.cpp):

if (_scan->foundAnyUpNodes) {
            _set->consecutiveFailedScans = 0;
        } else {
            _set->consecutiveFailedScans++;
            if (timeOutMonitoringReplicaSets) {
                warning() << "All nodes for set " << _set->name << " are down. "
                          << "This has happened for " << _set->consecutiveFailedScans
                          << " checks in a row. Polling will stop after "
                          << maxConsecutiveFailedChecks - _set->consecutiveFailedScans
                          << " more failed checks";
            }
        }

So once you go pass maxConsecutiveFailedChecks the replica set will become unusable:

bool SetState::isUsable() const {
    return consecutiveFailedScans < maxConsecutiveFailedChecks;
}

As far as I can't tell 3.4.x doesn't have maxConsecutiveFailedChecks and hopefully one will not have to intervene and restart Mongos manually.

Posted on February 9, 2017 at 5:03 pm by sergeyt · Permalink · Leave a comment
In: MongoDB

Watch “Monitorama 2016: All of Your Networking Monitoring is (probably) wrong” talk

Just came across this talk being mentioned in the comments on Hacker news and, boy, it’s absolutely amazing!
Watch this hilarious talk here – Monitorama 2016: All of Your Networking Monitoring is (probably) wrong

Btw, the talk is presented, presumably, by the same guy who wrote Monitoring and Tuning the Linux Networking Stack: Receiving Data and Monitoring and Tuning the Linux Networking Stack: Sending Data which are both must-read.

Posted on February 8, 2017 at 11:55 am by sergeyt · Permalink · Leave a comment
In: Linux

TIL gethostbyname*() and gethostbyaddr*() functions are obsolete

It wasn’t obvious to me until I tried to run netcat utility (aka nc) on Ubuntu 10.04 (lucid) release to check Zookeeper’s status:

echo "stat" | nc zookeer_server_name 2181
zookeer_server_name: forward host lookup failed: No address associated with name

It wouldn’t have been a problem had Zookeeper server used IPv4 address but it was configured with IPv6. So tools that used gethostbyname2(), e.g. getent, were still ok, and only those with gethostbyname() were failing me. Luckily, netcat and other important libraries had newer versions I could use. Once again, if you are on an old and rusty Linux distro be aware that gethostbyname*() and gethostbyaddr*() functions are obsolete

Update
As Anton mentioned in his comment below, getaddrinfo() had its own gotchas, which, if I got it right, were caused by AI_ADDRCONFIG flag. There is a good summary page which goes in more details regarding AI_ADDRCONFIG and the peculiarities pertaining to its current implementation in glibc.

Posted on January 28, 2017 at 11:54 am by sergeyt · Permalink · 2 Comments
In: Linux, TIL

jbd2 is munching your disks? Use ftrace to find why.

Have you ever been wondering why jbd2 (or jbd if your are still using ext3) is sitting at the top of iotop and consuming the most of IO bandwidth? Well, it’s certainly not because it’s doing that just to drive you nuts but there is a reason. And the reason is most probably there is an app that is doing a lot of sys_fsync(), sys_fdatasync() or sys_msync().
In case your are not on the latest and greatest kernel and BPF is not available, there is an easy way to confirm that using ftrace.

Just enable tracing of ext4_sync_file_enter events:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable

And print out the output from trace or trace_pipe (refer to the documentation of ftrace for more information):

# cat /sys/kernel/debug/tracing/trace/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 16/16   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          mongod-2299  [001] ...1 661508.531446: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.543931: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.566134: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661511.255926: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [000] ...1 661511.703643: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 

Once you are done, just stop tracing the events and clear ftrace’s buffer:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
# echo > /sys/kernel/debug/tracing/trace

So in my case jbd2 excessive activity was caused by MongoDB’s journaling which syncs data every 50ms (starting from version 3.2).

Posted on January 27, 2017 at 11:51 am by sergeyt · Permalink · Leave a comment
In: Linux, MongoDB

MongoDB – Defeating RangeDeleter

Not closing a cursor in MongoDB could hurt you big, so it’s generally not recommended to use no_cursor_timeout=True (pymongo3) or timeout=False (pymongo2). Especially when you run shared MongoDB installation:

PyMongo does “close” cursors when they are garbage collected, but they aren’t closed immediately and closing a cursor in all current versions of MongoDB is asynchronous. Depending on the python implementation, relying on garbage collection to close the cursor is not a great idea. Discarded, not fully iterated cursors can live for some time when using Jython or PyPy which do not do reference counting garbage collection. That’s why the Cursor object has an explicit close() method.

Not using close() method could potentially be the reason behind the following lines in the MongoDB’s log file:

SHARDING [RangeDeleter] waiting for open cursors before removing range

And even if it looks innocuous, it’s actually not quite, since what it means is that a source shard can’t delete its copy of the documents – Step 7 in chunk migration procedure.

At the time of this writing there is still no way (SERVER-3090) to glean more information that pertains to a cursor’s id, so the only way out that I was able to come up with was to kill those dangling cursors using an undocumented (as of this writing) killCursors command:

> use your_db_name
> db.runCommand({killCursors: 'your_collection_name', cursors: [NumberLong(51518759968), NumberLong(51484189302), NumberLong(51451409949), NumberLong(51434938438), NumberLong(51429435438), NumberLong(51383912702)]})
{
	"cursorsKilled" : [
		NumberLong("51518759968"),
		NumberLong("51484189302"),
		NumberLong("51451409949"),
		NumberLong("51434938438"),
		NumberLong("51429435438"),
		NumberLong("51383912702")
	],
	"cursorsNotFound" : [ ],
	"cursorsAlive" : [ ],
	"cursorsUnknown" : [ ],
	"ok" : 1
}

And as a reward look for the following message in the log file:

SHARDING [RangeDeleter] Deleter starting delete for

Posted on January 19, 2017 at 7:47 pm by sergeyt · Permalink · Leave a comment
In: MongoDB

Another FreeBSD drop into the Digital Ocean

After several happy years with FreeBSD running in AWS I finally have switched to Digital Ocean. That happened a few days ago and was driven mainly by the lack of the console support which “aws ec2 get-console-output”, in my opinion, is certainly not.
After the upgrade to FreeBSD 11 I found my instance unreachable and had absolutely no clue what was wrong with it. In that situation “aws ec2 get-console-output” was totally useless with its succinct single-worded output – “Output”. Last time when I had a similar issue after another upgrade I at least was able to glean some helpful information with get-console-output to fix the problem. Not this time though.
So without further ado and armed with Tarsnap backups, I jumped on to DO’s bandwagon with ZFS and HTML5 console which, I hope, would be able to save me should I hit the same boot problem again. As an extra bonus, DO instance is a bit cheaper and beefier than the one I had in AWS. But as always… horses for courses.

Posted on November 9, 2016 at 7:30 pm by sergeyt · Permalink · 2 Comments
In: FreeBSD, Life

Turned the page

After 5 exciting and tumultuous years in the enterprise IT as a Unix and SAN engineer it’s time to switch gears and taste something new. Of course, the alley I’m stepping into is not universally novel but for me personally it’s like an uncharted territory and something I could only dream about. The idea for a change had been ripening for quite a while so it was an effortless decision to say goodbye and move forward without turning back and holding no regrets.
Don’t want to paint with a broad brush but the enterprise IT is notoriously known for its conservatism, red-taping and being usually reluctant to any sort of changes. That’s ok for their business goals but, based solely on my personal experience, that could turn IT engineer into a bench sitter. Hopefully, at my new position I will be more exposed to the bleeding edge technologies, systems’ internals, programming and could become a better practitioner. Time will certainly tell but for now I’m emphatically waiting to face new challenges…

P.S.
Speaking of practitioners, if you haven’t seen/heard the latest Bryan Cantrill’s talk I highly encourage you to do so
A Wardrobe for the Emperor: Stitching Practical Bias into Systems Software Research

Posted on August 13, 2016 at 11:28 am by sergeyt · Permalink · 4 Comments
In: Life

TIL Common Name (CN) is legacy and subjectAltName must always be used.

Seems I’ve been living under a rock for far too log. From RFC2818:

Although the use of the Common Name is existing practice, it is deprecated and Certification Authorities are encouraged to use the dNSName instead.

So in today’s world CN is only evaluated when subjectAltName is not present and if it’s set all host names, IPs, emails, etc. must be specified in subjectAltName.

As a bonus, below is a one-liner to generate CSR with subjectAltName:

openssl req -new -newkey rsa:2048 -keyout example.com.key -sha256 -nodes -days 36500 -out example.com.csr -subj "/C=US/ST=IL/L=Chicago/O=Fortune500/OU=IT/CN=example.com" -reqexts v3_req -config <(cat /etc/pki/tls/openssl.cnf <(printf "[ v3_req ]\nsubjectAltName = DNS:example.com,DNS:www.example.com"))
Posted on July 6, 2016 at 2:52 pm by sergeyt · Permalink · Leave a comment
In: TIL

Calculating percentile in Python

If you need to find the percentile in Python do that correctly. Which means is by following one of the following receipts:

Posted on June 29, 2016 at 11:13 pm by sergeyt · Permalink · Leave a comment
In: TIL

Lecture on OpenZFS read and write code paths

If you are interested in ZFS this is the absolute must see video (which is actually a lecture) from Matt Ahrens (one of the two original ZFS creators):
Matt Ahrens – Lecture on OpenZFS read and write code paths

Posted on June 19, 2016 at 6:02 pm by sergeyt · Permalink · Leave a comment
In: ZFS