Restart your Mongos after maxConsecutiveFailedChecks

Take it literally.
If you configured your MongoDB config servers as a replica set and for some reason, say a network outage, Mongos server lost connection to all of them and is not able to reconnect during maxConsecutiveFailedChecks attempts then, surprise, it becomes useless. Even if the network is up and running again, Mongos will not reconnect to the config servers and you won’t be able to authenticate to your shard cluster until Mongos is restarted.

From https://api.mongodb.com/cplusplus/current/classmongo_1_1_replica_set_monitor.html

static int 	maxConsecutiveFailedChecks = 30
 	If a ReplicaSetMonitor has been refreshed more than this many times in a row without finding any live nodes claiming to be in the set, the ReplicaSetMonitorWatcher will stop periodic background refreshes of this set. 

And if you check the source code of 3.2.x (3.2.12 as of this writing) branch you will see the following (./src/mongo/client/replica_set_monitor.cpp):

if (_scan->foundAnyUpNodes) {
            _set->consecutiveFailedScans = 0;
        } else {
            _set->consecutiveFailedScans++;
            if (timeOutMonitoringReplicaSets) {
                warning() << "All nodes for set " << _set->name << " are down. "
                          << "This has happened for " << _set->consecutiveFailedScans
                          << " checks in a row. Polling will stop after "
                          << maxConsecutiveFailedChecks - _set->consecutiveFailedScans
                          << " more failed checks";
            }
        }

So once you go pass maxConsecutiveFailedChecks the replica set will become unusable:

bool SetState::isUsable() const {
    return consecutiveFailedScans < maxConsecutiveFailedChecks;
}

As far as I can't tell 3.4.x doesn't have maxConsecutiveFailedChecks and hopefully one will not have to intervene and restart Mongos manually.

Posted on February 9, 2017 at 5:03 pm by sergeyt · Permalink
In: MongoDB

Leave a Reply