zk--Building Distributed Systems with ZooKeeper(58)

龙博
2023-12-01

//(zk)分布式系统的并发来源

Processes in a distributed system have two broad options for communication: they can exchange messages directly through a network, or read and write to some shared storage. Zookeeper uses the shared storage model to let applications implement coordination and synchronization primitives. But shared storage itself requires network communication between the processes and the storage. It is important to stress the role of network communication because it is an important source of complications in the design of a distributed system.

In real systems, it is important to watch out for the following issues:

Message delays
    Messages can get arbitrarily delayed; for instance, due to network congestion. Such arbitrary delays may introduce undesirable situations. 
    For example, process P may send a message before another process Q sends its message, according to a reference clock, but Q's message might be delivered first.

Processor speed
    Operating system scheduling and overload might induce arbitrary delays in message processing.
    When one process sends a message to another,the overall latency of this message is roughly the sum of the processing time on the sender, the transmission time, and the processing on the receiver.

Clock drift
    It is not uncommon to find systems that use some notion of time, such as when determining the time at which events occur in the system.
    Processor clocks are not reliable and arbitrarily drift away from each other. Consequently, relying upon processor clocks might lead to incorrect decisions.

One important consequence of these issues is that it is very hard in practice to tell if a process has crashed or if any of these factors is introducing some arbitrary delay. Not receiving a message from a process could mean that it has crashed, that the network is delaying its latest message arbitrarily, that there is something delaying the process, or that the process clock is drifting away. A system in which such a distinction can’t be made is said to be asynchronous.//系统无法判定 那块延迟造成 的问题,被称为异步的。

ZooKeeper does not make the problems disappear or render them completely transparent to applications, but it does make the problems more tractable. ZooKeeper implements solutions to important distributed computing problems and packages up these implementations in a way that is intuitive to developers… at least, this has been our hope all along.//方便处理 更加直观

Master Failures

Recovering the state is not the only important issue. Suppose that the primary master is up, but the backup master suspects that the primary master has crashed. This false suspicion could happen because, for example, the primary master is heavily loaded and its messages are being delayed arbitrarily. The backup master will execute all necessary
procedures to take over the role of primary master and may eventually start executing the role of primary master, becoming a second primary master. Even worse, if some workers can’t communicate with the primary master, say because of a network partition, they may end up following the second primary master. This scenario leads to a problem commonly called split-brain: two or more parts of the system make progress independently, leading to inconsistent behavior. As part of coming up with a way to cope with master failures, it is critical that we avoid split-brain scenarios.//避免脑裂 是最主要的

Worker Failures

Clients submit tasks to the master, which assigns the tasks to available workers. The workers receive assigned tasks and report the status of the execution once these tasks have been executed. The master next informs the clients of the results of the execution.
If a worker crashes, all tasks that were assigned to it and not completed must be reassigned. The first requirement here is to give the master the ability to detect worker crashes. The master must be able to detect when a worker crashes and must be able to determine what other workers are available to execute its tasks. In the case a worker crashes, it may end up partially executing tasks or even fully executing tasks but not reporting the results. If the computation has side effects, some recovery procedure might be necessary to clean up the state.

Communication Failures
If a worker becomes disconnected from the master, say due to a network partition, reassigning a task could lead to two workers executing the same task. If executing a task more than once is acceptable, we can reassign without verifying whether the first worker has executed the task. If it is not acceptable, then the application must be able to accommodate the possibility that multiple workers may end up trying to execute the task.//适应 容纳


Exactly-Once and At-Most-Once Semantics

Using locks for tasks(as with the case of master election)is not sufficient to avoid having tasks executed multiple times because we can have, for example, the following succession of events:
1 Master M1 assigns Task T1 to Worker W1;
2 W1 acquires the lock for T1, executes it, and release the lock;
3 Master M1 suspects that W1 has crashed and reassigns Task T1 to W2;
4 W2 acquires the lock for T1, executes it, and release the lock;
Here, the lock over T1 did not prevent the task from being executed twice because the two workers exactly-once or at-most-once semantics are required, an application relies on mechanisms that are specific to its nature. For example, if application data has timestamps and a task is supposed to modify application data, then a successful execution of the task could be conditional on the timestamp values of the data it touches. The application also needs the ability to roll back partial changes in the case that the application state is not modified atomically(原子修改); otherwise, it might end up with an inconsistent state.


Another important issue with communication failures is the impact they have on synchronization primitives like locks. Because nodes can crash and systems are prone to network partitions, locks can be problematic: if a node crashes or gets partitioned away, the lock can prevent others from making progress. ZooKeeper consequently needs to implement mechanisms to deal with such scenarios. First, it enables clients to say that some data in the Zookeeper state is ephemeral. Second, the Zookeeper ensemble requires that clients periodically notify that they are alive. If a client fails to notify the ensemble in a timely manner, then all ephemeral state belonging to this client is deleted. Using these two mechanisms, we are able to prevent clients individually from bringing the application to a halt in the presence of crashes and communication failures.// ephemeral and alive暂时 & 心跳

Summary of Tasks
From the preceding descriptions, we can extract the following requirements for our master-worker architecture:

Master election
It is critical for progress to have a master available to assign tasks to workers.

Crash detection
The master must be able to detect when workers crash or disconnect.

Group membership management
The master must be able to figure out which workers are available to execute tasks.

Metadata management//元数据管理
The master and the workers must be able to store assignments and execution statuses in a reliable manner.

Ideally, each of these tasks is exposed to the application in the form of a primitive, hiding completely the implementation details from the application developer. Zookeeper provides key mechanisms to implement such primitives so that developers can implement the ones that best suit their needs and focus on the application logic. Throughout this book, we often refer to implementations of tasks like **master election or crash detection as primitives **because these are concrete tasks that distributed applications build upon.

Byzantine Faults//拜占庭式故障
The truly difficult problems you will encounter as you develop distributed applications have to do with faults—specifically, crashes and communication faults. These failures can crop up at any point, and it may be impossible to enumerate all the different corner cases that need to be handled.

//一致性和可用性
A similar result known as CAP, which stands for Consistency, Availability, and Partition-tolerance, says that when designing a distributed system we may want all three of those properties, but that no system can handle all three.
ZooKeeper has been designed with mostly consistency and availability in mind, although it also provides read-only capability in the presence of network partitions.

//我们妥协什么?
Okay, so we cannot have an ideal fault-tolerant, distributed, real-world system that transparently takes care of all problems that might ever occur. We can strive for a slightly less ambitious goal, though. First, we have to relax some of our assumptions and/or our
goals. For example, we may assume that the clock is synchronized within some bounds; we may choose to be always consistent and sacrifice the ability to tolerate some network partitions; there may be times when a process may be running, but must act as if it is
faulty because it cannot be sure of the state of the system. While these are compromises, they are compromises that have allowed us to build some rather impressive distributed systems.
//国内外 应用效果
Over the years we have found that people can easily deploy a ZooKeeper cluster and develop applications for it—so easily, in fact, that some developers use it without completely understanding some of the cases that require the developer to make decisions
that ZooKeeper cannot make by itself.

 类似资料: