Apache Helix (1)

能修谨

2023-12-01

A cluster management framework.

Helix解决了什么痛点？
一个open-sourced，free的，集群管理基础设施部件。
具体如下：
管理集群，“集群hosting分片的，复制的资源”link
功能方面包括（1）软硬件故障检测和处理，（2）通过在节点上的资源placement，和节点capacity匹配，access pattern，partition size，自动LB。（3）集群expansion时的LB。（4）节点的生命周期，add，start，stop，enable，disable。（5）Monitor cluster health和SLA vialation。
配置方面：中心化的配置管理。节点不需要单独配置。
服务发现：提供服务发现机制来，route请求。

为什么其他技术不能解决这个痛点？
其他技术比较零散，Helix把各种open-source轮子集成在一起，构建了这个系统。

它是如何做到的？
它采用了Zookeeper在Node和Cluster manager之间做任务和state的双向传递。每个Node有相应的Queues。Node先把状态写入zookeeper，然后CM去读。CM把task写入对应node的Task queue，node去读。

几个概念：resources，比如DB。resources被分为partitions，资源的COPY叫做replicas。State有Master，slave，leader，standby，online，offline等状态。

几个role：participant（就是node），speculator（就是旁观的，只拉View），controller（是控制node的），它们之间通过Zookeeper联系。

状态机：
只读partition呢，partition的每个replica都是等价的，online/offline就好了。
读写的partition呢，partition的每个replica，要分为master，slave，offline。状态机就是要到master先得从offline到Slave状态，反之亦然。要想offline得先到slave。

Ideal state /current state /external view
全局理想状态 /节点当前状态 /所有节点当前状态的集合

partition# 64
replica#3
node#8
用1.1来表示partition#1，replica#1
比如说，ideal state

partition1.1， master， node1
partition1.2，slave，node2
partition1.3，slave，node3
partition2.1，master，node4
partition2.2，master，node5
// 一共64×3=192个entries

current state
node2

partition1.2，slave

External view，ideally应该和ideal state一样

partition1.1， master， node1
partition1.2，slave，node2
partition1.3，slave，node3
partition2.1，master，node4
partition2.2，master，node5
// 一共64×3=192个entries

Controller做的工作就是定义idealstate，然后计算和external View之间的差距，计算transition plan，然后推送任务给nodes。

Apache Helix (1)

相关阅读

相关文章

相关问答

相关文档