Paxos Made Practical（译）

2019-03-11 本文已影响0人 WILL_HUNTING

Paxos [3] is a simple protocol that a group of ma- chines in a distributed system can use to agree on a value proposed by a member of the group. If it terminates, the protocol reaches consensus even if the network was unreliable and multiple machines simultaneously tried to propose differ- ent values. The basic idea is that each proposal has a unique number. Higher numbered pro- posals override lower-numbered ones. However, a “proposer” machine must notify the group of its proposal number before proposing a partic- ular value. If, after hearing from a majority of the group, the proposer learns one or more values from previous proposals, it must re-use the same value as the highest-numbered previ- ous proposal. Otherwise, the proposer can select any value to propose.

Paxos是一个简单的分布式系统中节点一致性协议。Paxos协议可以保证，即使网络不可靠且多台机器同时尝试提出不同的值，协议也会达成一致。协议的基本思想是每个提议（proposal）都有一个唯一的id，id高的提议会覆盖id底的提议。同时，一个“proposer”节点在正式提议某个值之前必须要通知到一组成员。通过询问组内多数成员，Proposer可以从之前的议案中学习到一个或者多个值，并且必须使用先序号最高的议案作为它的提议。否则（没有更高的议案序号），这个Proposer可以选择任何一个值发起提议。

The protocol has three rounds. In the first round, the proposer selects a proposal num- ber, n > 0. n’s low-order bits should con- tain a unique identifier for the proposer ma- chine, so that two different machines never se- lect the same n. The proposer then broadcasts the message prepare(n). Each group member either rejects this message if it has already seen a prepare message greater than n, replies with prepare-result(n′, v′) if the highest numbered proposal it has seen is n′ < n for value v′, or replies with prepare-result(0, nil) if it has not yet seen any value proposed.

Paxos协议分为三个阶段。第一个阶段，Proposer生成一个序号，n>0。其中序号n必须为全局唯一的，确保两个不同的节点生成相同的序号。然后Proposer广播prepare message prepare(n)。组内节点收到prepare消息后检查是否收到过序号大于n的消息，如果收到过则返回拒绝。否则，返回之前收到的序号n'最大的消息对应的序号，以及value，既prepare-result(n', v')，其中n'<n，如果之前没有收到过任何propose消息，则返回prepare-result(0, nil)。

If at least a majority of the group (including the proposer) accepts the prepare message, the proposer moves to the second round. It sets v to the value in the highest-numbered prepare- result it received. If v is nil, it selects any value it wishes for v. The proposer then broad-casts the message propose(n, v). Again, each group member rejects this message if it has seen a prepare(n′′) message with n′′ > n. Other- wise, it indicates acceptance in its reply to the proposer.

如果多数（包括proposer本身）接受了prepare message，Proposer进入第二阶段。这个Proposer选择收到accept消息中序号最大的n对应的value为v，如果v等于nil，则可以任选一个值最为v。然后再次广播propose message(n, v)。同样的，组内其他成员检查是否收到过prepare消息prepare(n'', v'')，如果n''>n则拒绝。否则回复接收消息。

If at least a majority of the group (includ- ing the proposer) accepts the propose message, the proposer broadcasts decide(n, v) to indicate that the group has agreed on value v.

如果至少多数（包含proposer本身）接受了propose message，Proposer广播decide(n, v)消息，通知组内成员这个值v已经被接受。

A number of fault-tolerant distributed sys- tems [1, 4, 8] have been published that claim to use Paxos for consensus. However, this is tanta- mount to saying they use sockets for consensus— it leaves many details unspecified. To begin with, systems must agree on more than one value. Moreover, in fault-tolerant systems, machines come and go. If one is using Paxos to agree on the set of machines replicating a service, does a majority of machines mean a majority of the old replica set, the new set, or both? How do you know it is safe to agree on a new set of replicas? Will the new set have all the state from the old set? What about operations in progress at the time of the change? What if machines fail and none of the new replicas receive the decide mes- sage? Many such complicated questions are just not addressed in the literature.

他们使用套接字来达成共识——这留下了许多未指明的细节。首先，系统必须就多个值达成一致。此外，在容错系统中，机器来来去去。如果使用Paxos就复制服务的机器集达成一致，那么大多数机器是指旧的节点、新节点还是两者都有？你怎么知道同意一套新的复制品是安全的呢？新集合是否拥有旧集合的所有状态？更改时正在进行的操作如何?如果机器出现故障，并且没有一个新的副本收到decide message怎么办？许多这样复杂的问题在文献中都没有提到。

The one paper that makes a comprehensive effort to explain how to use a Paxos-like pro- tocol in a real system is Viewstamped Replica- tion [6]. However, that paper has two shortcom- ings, the first cosmetic, the second substantive. First, Viewstamped Replication is described in terms of distributed transactions. As depicted in Figure 1, a system consists of groups of ma- chines. Each group contains of one or more co- horts, which are machines that maintain replicasof the same objects. Different groups store dif- ferent objects. Transactions span objects in mul- tiple groups, with a two-phase commit used to commit transactions across groups. Distributed transactions add a great deal of complexity that not all applications need, making it harder to fig- ure out how to replicate a simple system. Thus, in this paper, we concentrate on implementing a single group that replicates a state machine.

有一篇论文对如何在实际系统中使用类似于paxos的协议进行了全面的解释，它就是viewstamp Replication[6]。然而，这篇论文有两个短板，第一个是修饰性的，第二个是实质性的。首先，使用分布式事务描述viewstamp复制。如图1所示，一个系统由一组机器组成。每个组包含一个或多个队列，这些队列是维护相同对象的副本的机器。不同的组存储不同的对象。事务跨多组组中的对象，使用两阶段提交跨组提交事务。分布式事务增加了并非所有应用程序都需要的大量复杂性，使得确定如何复制一个简单的系统变得更加困难。因此，在本文中，我们主要实现一个复制状态机的组。

image.png

The second limitation of Viewstamped Replication is an assumption that the set of possible cohorts is fixed over time. The system requires participation of a majority of all possible cohorts, when it would be far better to require only a majority of all active cohorts. For example, if a group has five cohorts and two of them fail, the group will reconfigure itself to have three co- horts and continue to operate. However, if one more cohort fails before first two can be fixed, the group will also fail. This is unfortunate, since a majority of the three active cohorts is still functioning. More generally, for many applications it would be desirable to be able to add and remove cohorts from a group dynamically, for instance to facilitate migrating cohorts for maintenance or load-balancing purposes.

viewstamp复制的第二个限制是它假设一组可能的队列随着时间的推移是固定的。这就要求所有可能的团体的大多数参加，而只要求所有积极团体的多数参加则要好得多。例如，如果一个组有五个队列，其中两个失败，则该组将重新配置自己为有三个队列，并继续运行。然而，如果在前两个队列之前有一个队列失败，那么这个组也会失败。这是不幸的，因为这三个积极小组中的大多数仍在运作。更一般地说，对于许多应用程序来说，能够动态地从组中添加和删除队列是可取的，例如，为了便于迁移队列以进行维护或负载平衡。

The remainder of this paper describes a practical protocol one can use to replicate systems for fault tolerance. Unlike other papers on replicated systems, it doesn’t gloss over the details of how to use Paxos. It also overcomes a significant limitation of Viewstamped Replication and likely other Paxos-based systems.

本文的其余部分描述了一个实用的协议，可以用来复制系统的容错。与其他关于复制系统的论文不同，它没有掩盖如何使用Paxos的细节。它还克服了viewstamp复制和其他基于paxos的系统的显著限制。

Paxos Made Practical（译）

猜你喜欢

热点阅读