TIL Remove a Znode from Zookeeper

Yep, you could easily achieve that (and much more) using zkCli.sh (Zookeeper client):

$ /usr/share/zookeeper/bin/zkCli.sh 
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: localhost:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
	connect host:port
	get path [watch]
	ls path [watch]
	set path data [version]
	rmr path
	delquota [-n|-b] path
	quit 
	printwatches on|off
	create [-s] [-e] path data acl
	stat path [watch]
	close 
	ls2 path [watch]
	history 
	listquota path
	setAcl path acl
	getAcl path
	sync path
	redo cmdno
	addauth scheme auth
	delete path [version]
	setquota -n|-b val path

Issue “rmr” (to remove recursively) or “delete” to remove a znode.

TIL HSTS requires a secure transport

Otherwise (quoting RFC6797):

If an HTTP response is received over insecure transport, the UA MUST ignore any present STS header field(s).

That means SSL certificate on your server must be valid, i.e. no errors or warnings when you open a page from a browser over https.

Restart your Mongos after maxConsecutiveFailedChecks

Take it literally.
If you configured your MongoDB config servers as a replica set and for some reason, say a network outage, Mongos server lost connection to all of them and is not able to reconnect during maxConsecutiveFailedChecks attempts then, surprise, it becomes useless. Even if the network is up and running again, Mongos will not reconnect to the config servers and you won’t be able to authenticate to your shard cluster until Mongos is restarted.

From https://api.mongodb.com/cplusplus/current/classmongo_1_1_replica_set_monitor.html

static int 	maxConsecutiveFailedChecks = 30
 	If a ReplicaSetMonitor has been refreshed more than this many times in a row without finding any live nodes claiming to be in the set, the ReplicaSetMonitorWatcher will stop periodic background refreshes of this set. 

And if you check the source code of 3.2.x (3.2.12 as of this writing) branch you will see the following (./src/mongo/client/replica_set_monitor.cpp):

if (_scan->foundAnyUpNodes) {
            _set->consecutiveFailedScans = 0;
        } else {
            _set->consecutiveFailedScans++;
            if (timeOutMonitoringReplicaSets) {
                warning() << "All nodes for set " << _set->name << " are down. "
                          << "This has happened for " << _set->consecutiveFailedScans
                          << " checks in a row. Polling will stop after "
                          << maxConsecutiveFailedChecks - _set->consecutiveFailedScans
                          << " more failed checks";
            }
        }

So once you go pass maxConsecutiveFailedChecks the replica set will become unusable:

bool SetState::isUsable() const {
    return consecutiveFailedScans < maxConsecutiveFailedChecks;
}

As far as I can't tell 3.4.x doesn't have maxConsecutiveFailedChecks and hopefully one will not have to intervene and restart Mongos manually.

Watch “Monitorama 2016: All of Your Networking Monitoring is (probably) wrong” talk

Just came across this talk being mentioned in the comments on Hacker news and, boy, it’s absolutely amazing!
Watch this hilarious talk here – Monitorama 2016: All of Your Networking Monitoring is (probably) wrong

Btw, the talk is presented, presumably, by the same guy who wrote Monitoring and Tuning the Linux Networking Stack: Receiving Data and Monitoring and Tuning the Linux Networking Stack: Sending Data which are both must-read.

TIL gethostbyname*() and gethostbyaddr*() functions are obsolete

It wasn’t obvious to me until I tried to run netcat utility (aka nc) on Ubuntu 10.04 (lucid) release to check Zookeeper’s status:

echo "stat" | nc zookeer_server_name 2181
zookeer_server_name: forward host lookup failed: No address associated with name

It wouldn’t have been a problem had Zookeeper server used IPv4 address but it was configured with IPv6. So tools that used gethostbyname2(), e.g. getent, were still ok, and only those with gethostbyname() were failing me. Luckily, netcat and other important libraries had newer versions I could use. Once again, if you are on an old and rusty Linux distro be aware that gethostbyname*() and gethostbyaddr*() functions are obsolete

Update
As Anton mentioned in his comment below, getaddrinfo() had its own gotchas, which, if I got it right, were caused by AI_ADDRCONFIG flag. There is a good summary page which goes in more details regarding AI_ADDRCONFIG and the peculiarities pertaining to its current implementation in glibc.

jbd2 is munching your disks? Use ftrace to find why.

Have you ever been wondering why jbd2 (or jbd if your are still using ext3) is sitting at the top of iotop and consuming the most of IO bandwidth? Well, it’s certainly not because it’s doing that just to drive you nuts but there is a reason. And the reason is most probably there is an app that is doing a lot of sys_fsync(), sys_fdatasync() or sys_msync().
In case your are not on the latest and greatest kernel and BPF is not available, there is an easy way to confirm that using ftrace.

Just enable tracing of ext4_sync_file_enter events:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable

And print out the output from trace or trace_pipe (refer to the documentation of ftrace for more information):

# cat /sys/kernel/debug/tracing/trace/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 16/16   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          mongod-2299  [001] ...1 661508.531446: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.543931: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661508.566134: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [003] ...1 661511.255926: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 
          mongod-2299  [000] ...1 661511.703643: ext4_sync_file_enter: dev 252,1 ino 267191 parent 267956 datasync 1 

Once you are done, just stop tracing the events and clear ftrace’s buffer:

# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
# echo > /sys/kernel/debug/tracing/trace

So in my case jbd2 excessive activity was caused by MongoDB’s journaling which syncs data every 50ms (starting from version 3.2).

MongoDB – Defeating RangeDeleter

Not closing a cursor in MongoDB could hurt you big, so it’s generally not recommended to use no_cursor_timeout=True (pymongo3) or timeout=False (pymongo2). Especially when you run shared MongoDB installation:

PyMongo does “close” cursors when they are garbage collected, but they aren’t closed immediately and closing a cursor in all current versions of MongoDB is asynchronous. Depending on the python implementation, relying on garbage collection to close the cursor is not a great idea. Discarded, not fully iterated cursors can live for some time when using Jython or PyPy which do not do reference counting garbage collection. That’s why the Cursor object has an explicit close() method.

Not using close() method could potentially be the reason behind the following lines in the MongoDB’s log file:

SHARDING [RangeDeleter] waiting for open cursors before removing range

And even if it looks innocuous, it’s actually not quite, since what it means is that a source shard can’t delete its copy of the documents – Step 7 in chunk migration procedure.

At the time of this writing there is still no way (SERVER-3090) to glean more information that pertains to a cursor’s id, so the only way out that I was able to come up with was to kill those dangling cursors using an undocumented (as of this writing) killCursors command:

> use your_db_name
> db.runCommand({killCursors: 'your_collection_name', cursors: [NumberLong(51518759968), NumberLong(51484189302), NumberLong(51451409949), NumberLong(51434938438), NumberLong(51429435438), NumberLong(51383912702)]})
{
	"cursorsKilled" : [
		NumberLong("51518759968"),
		NumberLong("51484189302"),
		NumberLong("51451409949"),
		NumberLong("51434938438"),
		NumberLong("51429435438"),
		NumberLong("51383912702")
	],
	"cursorsNotFound" : [ ],
	"cursorsAlive" : [ ],
	"cursorsUnknown" : [ ],
	"ok" : 1
}

And as a reward look for the following message in the log file:

SHARDING [RangeDeleter] Deleter starting delete for

Another FreeBSD drop into the Digital Ocean

After several happy years with FreeBSD running in AWS I finally have switched to Digital Ocean. That happened a few days ago and was driven mainly by the lack of the console support which “aws ec2 get-console-output”, in my opinion, is certainly not.
After the upgrade to FreeBSD 11 I found my instance unreachable and had absolutely no clue what was wrong with it. In that situation “aws ec2 get-console-output” was totally useless with its succinct single-worded output – “Output”. Last time when I had a similar issue after another upgrade I at least was able to glean some helpful information with get-console-output to fix the problem. Not this time though.
So without further ado and armed with Tarsnap backups, I jumped on to DO’s bandwagon with ZFS and HTML5 console which, I hope, would be able to save me should I hit the same boot problem again. As an extra bonus, DO instance is a bit cheaper and beefier than the one I had in AWS. But as always… horses for courses.

Turned the page

After 5 exciting and tumultuous years in the enterprise IT as a Unix and SAN engineer it’s time to switch gears and taste something new. Of course, the alley I’m stepping into is not universally novel but for me personally it’s like an uncharted territory and something I could only dream about. The idea for a change had been ripening for quite a while so it was an effortless decision to say goodbye and move forward without turning back and holding no regrets.
Don’t want to paint with a broad brush but the enterprise IT is notoriously known for its conservatism, red-taping and being usually reluctant to any sort of changes. That’s ok for their business goals but, based solely on my personal experience, that could turn IT engineer into a bench sitter. Hopefully, at my new position I will be more exposed to the bleeding edge technologies, systems’ internals, programming and could become a better practitioner. Time will certainly tell but for now I’m emphatically waiting to face new challenges…

P.S.
Speaking of practitioners, if you haven’t seen/heard the latest Bryan Cantrill’s talk I highly encourage you to do so
A Wardrobe for the Emperor: Stitching Practical Bias into Systems Software Research

TIL Common Name (CN) is legacy and subjectAltName must always be used.

Seems I’ve been living under a rock for far too log. From RFC2818:

Although the use of the Common Name is existing practice, it is deprecated and Certification Authorities are encouraged to use the dNSName instead.

So in today’s world CN is only evaluated when subjectAltName is not present and if it’s set all host names, IPs, emails, etc. must be specified in subjectAltName.

As a bonus, below is a one-liner to generate CSR with subjectAltName:

openssl req -new -newkey rsa:2048 -keyout example.com.key -sha256 -nodes -days 36500 -out example.com.csr -subj "/C=US/ST=IL/L=Chicago/O=Fortune500/OU=IT/CN=example.com" -reqexts v3_req -config <(cat /etc/pki/tls/openssl.cnf <(printf "[ v3_req ]\nsubjectAltName = DNS:example.com,DNS:www.example.com"))