标题太短了放不下,完整的异常提示为:
Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.
原为三主三从集群,同时宕掉一对主从,此时集群状态fail,启动从节点,会发现从节点并没有如我们所想那样成为主节点,集群状态依旧fail,这个是个正常的现象并没有解决方案,这篇文章是为了分析在什么情况下会出现主从不能切换的问题
1:S 18 Aug 2021 17:00:07.622 # Error condition on socket for SYNC: Connection refused
1:S 18 Aug 2021 17:00:08.626 * Connecting to MASTER 127.0.0.1:8518
1:S 18 Aug 2021 17:00:08.626 * MASTER <-> REPLICA sync started
1:S 18 Aug 2021 17:00:08.626 # Error condition on socket for SYNC: Connection refused
1:S 18 Aug 2021 17:00:09.528 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.
1:S 18 Aug 2021 17:00:09.628 * Connecting to MASTER 127.0.0.1:8518
1:S 18 Aug 2021 17:00:09.628 * MASTER <-> REPLICA sync started
1:S 18 Aug 2021 17:00:09.628 # Error condition on socket for SYNC: Connection refused
1:S 18 Aug 2021 17:00:10.631 * Connecting to MASTER 127.0.0.1:8518
1:S 18 Aug 2021 17:00:10.631 * MASTER <-> REPLICA sync started
1:S 18 Aug 2021 17:00:10.631 # Error condition on socket for SYNC: Connection refused
1:S 18 Aug 2021 17:00:11.636 * Connecting to MASTER 127.0.0.1:8518
1:S 18 Aug 2021 17:00:11.636 * MASTER <-> REPLICA sync started
1:S 18 Aug 2021 17:00:11.636 # Error condition on socket for SYNC: Connection refused
1:S 18 Aug 2021 17:00:12.640 * Connecting to MASTER 127.0.0.1:8518
通过以下的源码分析,主从断开达到一定的时间(计算时长主从断开超过160s)时,会存在从不能切主的情况,需要手动failover才可以切换
slave和master的复制连接断开时间不超过给定的值(值可配置,目的是确保slave上的数据足够完整,所以运维时不能任由一个slave长时间不可用,需要通过监控将异常的slave及时恢复)。
计算failover的超时时间
超时时间为 max(cluster_node_timeout*2,2000),
集群默认超时时间为15000*2,所以默认的 auth_timeout 时间为30s 默认 auth_retry_time 时间为60s
/* Compute the failover timeout (the max time we have to send votes
* and wait for replies), and the failover retry time (the time to wait
* before trying to get voted again).
*
* Timeout is MAX(NODE_TIMEOUT*2,2000) milliseconds.
* Retry is two times the Timeout.
*/
auth_timeout = server.cluster_node_timeout*2;
if (auth_timeout < 2000) auth_timeout = 2000;
auth_retry_time = auth_timeout*2;
'以下情况不能执行failover'
1).当前节点不是slave节点
2).主节点被标记为fail,或者是手动故障切换
3).没有设置failover的配置(cluster-slave-no-failover yes) 并且不是手动故障切换
4).上边有slots使用
/* Pre conditions to run the function, that must be met both in case
* of an automatic or manual failover:
* 1) We are a slave.
* 2) Our master is flagged as FAIL, or this is a manual failover.
* 3) We don't have the no failover configuration set, and this is
* not a manual failover.
* 4) It is serving slots. */
if (nodeIsMaster(myself) ||
myself->slaveof == NULL ||
(!nodeFailed(myself->slaveof) && !manual_failover) ||
(server.cluster_slave_no_failover && !manual_failover) ||
myself->slaveof->numslots == 0)
{
/* There are no reasons to failover, so we set the reason why we
* are returning without failing over to NONE. */
server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
return;
}
// 主节点断开连接的秒数
/* Set data_age to the number of seconds we are disconnected from
* the master. */
if (server.repl_state == REPL_STATE_CONNECTED) {
data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
* 1000;
} else {
data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
}
//data_age中要除去集群超时时间
/* Remove the node timeout from the data age as it is fine that we are
* disconnected from our master at least for the time it was down to be
* flagged as FAIL, that's the baseline. */
if (data_age > server.cluster_node_timeout)
data_age -= server.cluster_node_timeout;
根据用户配置的cluster_slave_validity_factor检查我们的数据是否足够新,并检查是否有手动failover'
'默认参数配置:'
cluster-node-timeout 15000
cluster-replica-validity-factor 10
repl-ping-replica-period 10
repl-timeout 60
'同时满足以下条件不能主从切换'
1.配置了cluster_slave_validity_factor
2.data_age > 10*1000 + 15000*10 = 160s
3.不是手动failover
/* Check if our data is recent enough according to the slave validity
* factor configured by the user.
* Check bypassed for manual failovers. */
if (server.cluster_slave_validity_factor &&
data_age >
(((mstime_t)server.repl_ping_slave_period * 1000) +
(server.cluster_node_timeout * server.cluster_slave_validity_factor)))
{
if (!manual_failover) {
clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
return;
}
}
排除以上不能failover的情况以外,如果可以failover下边提供了具体的failover的规则
如果上一次故障切换尝试时间已过,且重试时间已过,则可以设置新的故障切换尝试时间
只有master为fail状态,slave才会发起选举。但并不是master为fail时立即发起选举,而是延迟下列随机时长,以避免多个slaves同时发起选举(至少延迟0.5秒后才会发起选举):
500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds
/* If the previous failover attempt timedout and the retry time has
* elapsed, we can setup a new one. */
if (auth_age > auth_retry_time) {
server.cluster->failover_auth_time = mstime() +
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
server.cluster->failover_auth_count = 0;
server.cluster->failover_auth_sent = 0;
server.cluster->failover_auth_rank = clusterGetSlaveRank();
/* We add another delay that is proportional to the slave rank.
* Specifically 1 second * rank. This way slaves that have a probably
* less updated replication offset, are penalized. */
server.cluster->failover_auth_time +=
server.cluster->failover_auth_rank * 1000;
/* However if this is a manual failover, no delay is needed. */
if (server.cluster->mf_end) {
server.cluster->failover_auth_time = mstime();
server.cluster->failover_auth_rank = 0;
clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
}
serverLog(LL_WARNING,
"Start of election delayed for %lld milliseconds "
"(rank #%d, offset %lld).",
server.cluster->failover_auth_time - mstime(),
server.cluster->failover_auth_rank,
replicationGetSlaveOffset());
/* Now that we have a scheduled election, broadcast our offset
* to all the other slaves so that they'll updated their offsets
* if our offset is better. */
clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
return;
}