Redis

018.Redis Cluster故障转移原理

2020-03-25  本文已影响0人  CoderJed

1. 故障发现

当集群内某个节点出现问题时,需要通过一种健壮的方式保证识别出节点是否发生了故障。Redis集群内节点通过ping/pong消息实现节点通信,消息不但可以传播节点槽信息,还可以传播其他状态如:主从状态、节点故障等。因此故障发现也是通过消息传播机制实现的,主要环节包括:主观下线(PFAIL-Possibly Fail)客观下线(Fail)

一个节点认为某个节点失联了并不代表所有的节点都认为它失联了。所以集群还得经过一次协商的过程,只有当大多数节点都认定了某个节点失联了,集群才认为该节点需要进行主从切换来容错。Redis 集群节点采用 Gossip 协议来广播自己的状态以及自己对整个集群认知的改变。比如一个节点发现某个节点失联了(PFail),它会将这条信息向整个集群广播,其它节点也就可以收到这点失联信息。如果一个节点收到了某个节点失联的数量 (PFail Count) 已经达到了集群的大多数,就可以标记该节点为确定下线状态 (Fail),然后向整个集群广播,强迫其它节点也接收该节点已经下线的事实,并立即对该失联节点进行主从切换。

1.1 主观下线

集群中每个节点都会定期向其他节点发送ping消息,接收节点回复pong消息作为响应。如果在cluster-node-timeout时间内通信一直失败,则发送节点会认为接收节点存在故障,把接收节点标记为主观下线(PFail)状态

主观下线简单来讲就是,当cluster-note-timeout时间内某节点无法与另一个节点顺利完成ping消息通信时,则将该节点标记为主观下线状态

1.2 客观下线

Redis集群对于节点最终是否故障判断非常严谨,只有一个节点认为主观下线并不能准确判断是否故障。当某个节点判断另一个节点主观下线后,相应的节点状态会跟随消息在集群内传播,通过Gossip消息传播,集群内节点不断收集到故障节点的下线报告。当半数以上持有槽的主节点都标记某个节点是主观下线时。触发客观下线流
程。

为什么必须是负责槽的主节点参与故障发现决策?

因为集群模式下只有处理槽的主节点才负责读写请求和集群槽等关键信息维护,而从节点只进行主节点数据和状态信息的复制。

为什么半数以上处理槽的主节点?

必须半数以上是为了应对网络分区等原因造成的集群分割情况,被分割的小集群因为无法完成从主观下线到
客观下线这一关键过程,从而防止小集群完成故障转移之后继续对外提供服务。

客观下线流程:

注意:

如果在cluster-node-time*2时间内无法收集到一半以上槽节点的下线报告,那么之前的下线报告将会过期,也就是说主观下线上报的速度追赶不上下线报告过期的速度,那么故障节点将永远无法被标记为客观下线从而导致
故障转移失败。因此不建议将cluster-node-time设置得过小

广播fail消息是客观下线的最后一步,它承担着非常重要的职责:

需要理解的是,尽管存在广播fail消息机制,但是集群所有节点知道故障节点进入客观下线状态是不确定的。比如当出现网络分区时有可能集群被分割为一大一小两个独立集群中。大的集群持有半数槽节点可以完成客观下线并广播fail消息,但是小集群无法接收到fail消息,网络分区会导致分割后的小集群无法收到大集群的fail消息,因此如果故障节点所有的从节点都在小集群内将导致无法完成后续故障转移,因此部署主从结构时需要根据自身机房/机架拓扑结构,降低主从被分区的可能性。

2. 故障恢复

故障节点变为客观下线后,如果下线节点是持有槽的主节点则需要在它的从节点中选出一个替换它,从而保证集群的高可用。下线主节点的所有从节点承担故障恢复的义务,当从节点通过内部定时任务发现自身复制的主节点进入客观下线时,将会触发故障恢复流程

3. 故障转移时间

4. 故障转移演练

root       3423      1  0 11:38 ?        00:01:06 bin/redis-server 10.0.0.100:6379 [cluster]
root       3428      1  0 11:38 ?        00:01:05 bin/redis-server 10.0.0.100:6380 [cluster]
root       3840   3004  0 17:09 pts/0    00:00:00 grep --color=auto redis
[root@node01 redis]# kill -9 3423
[root@node03 redis]# cat /var/log/redis/redis_6380.log

654:S 25 Mar 17:10:29.783 # Connection with master lost.
2654:S 25 Mar 17:10:29.784 * Caching the disconnected master state.
2654:S 25 Mar 17:10:29.784 * Connecting to MASTER 10.0.0.100:6379
2654:S 25 Mar 17:10:29.784 * MASTER <-> SLAVE sync started
2654:S 25 Mar 17:10:29.785 # Error condition on socket for SYNC: Connection refused
[root@node02 redis]# cat /var/log/redis/redis_6379.log
2876:M 25 Mar 17:10:45.391 * Marking node 9c02aef2d45e44678202721ac923c615dd8300ea as failing (quorum reached).

[root@node03 redis]# cat /var/log/redis/redis_6379.log
2649:M 25 Mar 17:10:45.411 * Marking node 9c02aef2d45e44678202721ac923c615dd8300ea as failing (quorum reached).
2654:S 25 Mar 17:10:45.415 # Cluster state changed: fail
2654:S 25 Mar 17:10:45.510 # Start of election delayed for 724 milliseconds (rank #0, offset 21930).
2654:S 25 Mar 17:10:46.322 # Starting a failover election for epoch 7.
2649:M 25 Mar 17:10:46.327 # Failover auth granted to 0955dc1eeeec59c1e9b72eca5bcbcd04af108820 for epoch 7
2876:M 25 Mar 17:10:46.310 # Failover auth granted to 0955dc1eeeec59c1e9b72eca5bcbcd04af108820 for epoch 7
[root@node01 redis]# bin/redis-server conf/redis_6379.conf
3873:M 25 Mar 17:24:32.823 * Node configuration loaded, I'm 9c02aef2d45e44678202721ac923c615dd8300ea
873:M 25 Mar 17:24:32.825 # Configuration change detected. Reconfiguring myself as a replica of 0955dc1eeeec59c1e9b72eca5bcbcd04af108820
3873:S 25 Mar 17:24:32.825 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
3428:S 25 Mar 17:24:32.830 * Clear FAIL state for node 9c02aef2d45e44678202721ac923c615dd8300ea: master without slots is reachable again.
2876:M 25 Mar 17:24:32.914 * Clear FAIL state for node 9c02aef2d45e44678202721ac923c615dd8300ea: master without slots is reachable again.
2881:S 25 Mar 17:24:32.916 * Clear FAIL state for node 9c02aef2d45e44678202721ac923c615dd8300ea: master without slots is reachable again.
2654:M 25 Mar 17:24:32.853 * Clear FAIL state for node 9c02aef2d45e44678202721ac923c615dd8300ea: master without slots is reachable again.
2649:M 25 Mar 17:24:32.854 * Clear FAIL state for node 9c02aef2d45e44678202721ac923c615dd8300ea: master without slots is reachable again.
# slave
3873:S 25 Mar 17:24:33.832 * Connecting to MASTER 10.0.0.102:6380
3873:S 25 Mar 17:24:33.833 * MASTER <-> SLAVE sync started
3873:S 25 Mar 17:24:33.835 * Non blocking connect for SYNC fired the event.
3873:S 25 Mar 17:24:33.837 * Master replied to PING, replication can continue...
3873:S 25 Mar 17:24:33.840 * Trying a partial resynchronization (request b3a120153f855c5b200783267f6d88655d616318:1).
3873:S 25 Mar 17:24:33.843 * Full resync from master: 6b10906d0f362be8f9dfcb373c47d2ab44f8f805:21930

# master
2654:M 25 Mar 17:24:33.845 * Slave 10.0.0.100:6379 asks for synchronization
2654:M 25 Mar 17:24:33.845 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'b3a120153f855c5b200783267f6d88655d616318', my replication IDs are '6b10906d0f362be8f9dfcb373c47d2ab44f8f805' and 'e5a8131d602c8d58155a74b1bad17fae955431f1')
2654:M 25 Mar 17:24:33.846 * Starting BGSAVE for SYNC with target: disk
2654:M 25 Mar 17:24:33.846 * Background saving started by pid 3089
3089:C 25 Mar 17:24:33.851 * DB saved on disk
3089:C 25 Mar 17:24:33.852 * RDB: 0 MB of memory used by copy-on-write
2654:M 25 Mar 17:24:33.861 * Background saving terminated with success
2654:M 25 Mar 17:24:33.862 * Synchronization with slave 10.0.0.100:6379 succeeded
上一篇 下一篇

猜你喜欢

热点阅读