Storm Fault tolerance

伍宝

2023-12-01

下面主要说明Storm在容错方面做的一些处理，虽说都是理论上的表述，但是可以在实际测试的过程中验证一下这些情况。

1）What happens when a worker dies?

When a worker dies, the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the worker to another machine.

当worker挂了的时候，supervisor负责重启worker，但是因启动失败而导致Nimbus很长时间没有收到worker的心跳时，Nimbus会在其他机器上重启该worker。

2）What happens when a node dies?

The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines.

当一个节点挂了时候，该机器上的task会因为超时而被Nimbus重新分配给其他机器。

3）What happens when Nimbus or Supervisor daemons die?

The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk). As described in Setting up a Storm cluster, the Nimbus and Supervisor daemons must be run under supervision using a tool like daemontools or monit. So if the Nimbus or Supervisor daemons die, they restart like nothing happened.

Most notably, no worker processes are affected by the death of Nimbus or the Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all the running jobs are lost.

Storm在设计Nimbus和Supervisor的时候，它们是无状态的（状态信息保存在Zookeeper或者disk上），并期望它们能够在挂掉的时候能够迅速被重启(fail-fast)，所以在使用storm的时候最后有一个监控程序，负责重启挂掉的Nimbus或者Supervisor。

在Storm中Nimbus或者Supervisor短暂挂掉，基本上不会对worker有影响，这个和Hadoop中的JobTracker挂了有很大的不同。

4）Is Nimbus a single point of failure?

If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine).

So the answer is that Nimbus is "sort of" a SPOF. In practice, it's not a big deal since nothing catastrophic happens when the Nimbus daemon dies. There are plans to make Nimbus highly available in the future.

这里描述的是Nimbus是否是SPOF，当Nimbus挂掉的时候，worker进程是能够继续工作的，并且supervisor本身就能够负责worker重启的任务，这个过程并不需要Nimbus参与，但是当worker在本机上重启失败的时候，因为Nimbus挂了，而不能够将该worker重新分配给其他机器。

所以说Nimbus可以认为是一个SPOF，但是并不会像hadoop JobTracker挂掉那样产生很严重的影响。

5）How does Storm guarantee data processing?

Storm provides mechanisms to guarantee data processing even if nodes die or messages are lost. See Guaranteeing message processing for the details.

Storm在数据可靠性方面是如何保证的，能够在节点挂掉或者消息丢失的时候消息会被重放(retry)，可以参考：https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing

参考：https://github.com/nathanmarz/storm/wiki/Fault-tolerance

Storm Fault tolerance

相关阅读

相关文章

相关问答