redis 脑裂现象

2019-12-27  本文已影响0人  wenfh2020


由于网络问题,集群节点失去联系。主从数据不同步;重新平衡选举,产生两个主服务。两套主服务一起运行,导致数据不一致。详细请参考: 《redis 脑裂等极端情况分析

比较简单的方案,进行 redis 设置

// master 至少有 3 个副本连接
min-slaves-to-write 3
// 数据复制和同步的延迟不能超过 10 秒
min-slaves-max-lag 10

redis.conf 相关解析

# It is possible for a master to stop accepting writes if there are less than
# N slaves connected, having a lag less or equal than M seconds.
# The N slaves need to be in "online" state.
# The lag in seconds, that must be <= the specified value, is calculated from
# the last ping received from the slave, that is usually sent every second.
# This option does not GUARANTEE that N replicas will accept the write, but
# will limit the window of exposure for lost writes in case not enough slaves
# are available, to the specified number of seconds.
# For example to require at least 3 slaves with a lag <= 10 seconds use:
# min-slaves-to-write 3
# min-slaves-max-lag 10
# Setting one or the other to 0 disables the feature.
# By default min-slaves-to-write is set to 0 (feature disabled) and
# min-slaves-max-lag is set to 10.


#define run_with_period(_ms_) if ((_ms_ <= 1000/server.hz) || !(server.cronloops%((_ms_)/(1000/server.hz))))

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
  run_with_period(1000) replicationCron();

/* Replication cron function, called 1 time per second. */
// 复制周期执行的函数,每秒调用1次
void replicationCron(void) {
    // 更新延迟至log小于min-slaves-max-lag的从服务器数量

/* This function counts the number of slaves with lag <= min-slaves-max-lag.
 * If the option is active, the server will prevent writes if there are not
 * enough connected slaves with the specified lag (or less). */
// 更新延迟至log小于min-slaves-max-lag的从服务器数量
void refreshGoodSlavesCount(void) {
    listIter li;
    listNode *ln;
    int good = 0;

    // 没设置限制则返回
    if (!server.repl_min_slaves_to_write ||
        !server.repl_min_slaves_max_lag) return;

    // 遍历所有的从节点client
    while((ln = listNext(&li))) {
        client *slave = ln->value;
        // 计算延迟值
        time_t lag = server.unixtime - slave->repl_ack_time;

        // 计数小于延迟限制的个数
        if (slave->replstate == SLAVE_STATE_ONLINE &&
            lag <= server.repl_min_slaves_max_lag) good++;
    server.repl_good_slaves_count = good;


