下面主要说明Storm在容错方面做的一些处理,虽说都是理论上的表述,但是可以在实际测试的过程中验证一下这些情况。
1)What happens when a worker dies?
When a worker dies, the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the worker to another machine.
当worker挂了的时候,supervisor负责重启worker,但是因启动失败而导致Nimbus很长时间没有收到worker的心跳时,Nimbus会在其他机器上重启该worker。
2)What happens when a node dies?
The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines.当一个节点挂了时候,该机器上的task会因为超时而被Nimbus重新分配给其他机器。
3)What happens when Nimbus or Supervisor daemons die?
The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk). As described in Setting up a Storm cluster, the Nimbus and Supervisor daemons must be run under supervision using a tool like daemontools or monit. So if the Nimbus or Supervisor daemons die, they restart like nothing happened.
Storm在设计Nimbus和Supervisor的时候,它们是无状态的(状态信息保存在Zookeeper或者disk上),并期望它们能够在挂掉的时候能够迅速被重启(fail-fast),所以在使用storm的时候最后有一个监控程序,负责重启挂掉的Nimbus或者Supervisor。
在Storm中Nimbus或者Supervisor短暂挂掉,基本上不会对worker有影响,这个和Hadoop中的JobTracker挂了有很大的不同。
4)Is Nimbus a single point of failure?
这里描述的是Nimbus是否是SPOF,当Nimbus挂掉的时候,worker进程是能够继续工作的,并且supervisor本身就能够负责worker重启的任务,这个过程并不需要Nimbus参与,但是当worker在本机上重启失败的时候,因为Nimbus挂了,而不能够将该worker重新分配给其他机器。
所以说Nimbus可以认为是一个SPOF,但是并不会像hadoop JobTracker挂掉那样产生很严重的影响。
Storm provides mechanisms to guarantee data processing even if nodes die or messages are lost. See Guaranteeing message processing for the details.
Storm在数据可靠性方面是如何保证的,能够在节点挂掉或者消息丢失的时候消息会被重放(retry),可以参考:https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing