From our experience, cluster can be well balanced and yet, one table's regions may be badly concentrated on few region servers.
For example, one table has 839 regions (380 regions at time of table creation) out of which 202 are on one server.
It would be desirable for load balancer to distribute regions for specified tables evenly across the cluster. Each of such tables has number of regions many times the cluster size.
Jonathan Gray 给出的第一个comments是:
On cluster startup in 0.90, regions are assigned in one of two ways. By default, it will attempt to retain the previous assignment of the cluster. The other option which I've also used is round-robin. This will evenly distribute each table.
That plus the change to do round-robin on table create should probably cover per-table distribution fairly well.
I think the next step in the load balancer is a major effort to switch to something with more of a cost-based approach. I think ideally you don't need even distribution of each table, you want even distribution of load. If one hot table, it will get evenly balanced anyways.
Have you guys considered using a consistent hashing method to choose which server a region belongs to? You would create ~50 buckets for each server by hashing serverName_port_bucketNum, and then hash the start key of each region into the buckets.
There are a few benefits:
And a few drawbacks:
I think consistent hashing would be a major step backwards for us and unnecessary because there is no cost of moving bits around in HBase. The primary benefit of consistent hashing is that it reduces the amount of data you have to physically move around. Because of our use of HDFS, we never have to move physical data around.
In your benefit list, we are already implementing almost all of these features, or if not, it is possible in the current architecture. In addition, our architecture is extremely flexible and we can do all kinds of interesting load balancing techniques related to actual load profiles not just #s of shards/buckets as we do today or as would be done with consistent hashing.
The fact that split regions open back up on the same server is actually an optimization in many cases because it reduces the amount of time the regions are offline and when they come back online and do a compaction to drop references, all the files are more likely to be on the local DataNode rather than remote. In some cases, like time-series, you may want the splits to move to different servers. I could imagine some configurable logic in there to ensure the bottom half goes to a different server (or maybe the top half would actually be more efficient to move away since most the time you'll write more to the bottom half and thus want the data locality / quick turnaround). There's likely going to be a bit of split rework in 0.92 to make it more like the ZK-based regions-in-transition.
As far as binding regions to servers between cluster restarts, this is already implemented and on by default in 0.90.
Consistent hashing also requires a fixed keyspace (right?) and that's a mismatch for HBase's flexibility in this regard.
更多讨论参考:https://issues.apache.org/jira/browse/HBASE-3373
HBase中Region分配问题的探讨: http://www.spnguru.com/?p=246