一般来说,对数据库建立索引,往往需要单独的数据结构来存储索引的数据.在为hbase建立索引时,可以另外建立一张索引表,查询时先查询索引表,然后用查询结果查询数据表.
[img]http://dl2.iteye.com/upload/attachment/0099/4712/041675af-eed4-3f4d-9007-7367504ca6e5.png[/img]
这个图左边表示索引表,右边是数据表.
但是对于hbase这种分布式的数据库来说,最大的问题是解决索引表和数据表的本地性问题,hbase很容易就因为负载均衡,表split等原因把索引表和数据表的数据分布到不同的region server上,比如下图中,数据表和索引表就出现在了不同的region server上
[img]http://dl2.iteye.com/upload/attachment/0099/4719/8d7d9d6c-6041-33a5-85d8-c16dae6c3e60.png[/img]
所以为了解决这个问题,[url=https://issues.apache.org/jira/browse/HBASE-2037]ihbase[/url]项目应运而生,它的主要思想是在region级别建立索引而不是在表级别.
它的解决方案是用IdxRegion代替了常规的region实现,在flush的时候为region建立索引
@Override
protected void internalPreFlashcacheCommit() throws IOException {
rebuildIndexes();
super.internalPreFlashcacheCommit();
}
在scan的时候,提供特殊的scanner
@Override
protected InternalScanner instantiateInternalScanner(Scan scan,
List<KeyValueScanner> additionalScanners) throws IOException {
Expression expression = IdxScan.getExpression(scan);
if (scan == null || expression == null) {
totalNonIndexedScans.incrementAndGet();
return super.instantiateInternalScanner(scan, additionalScanners);
} else {
totalIndexedScans.incrementAndGet();
// Grab a new search context
IdxSearchContext searchContext = indexManager.newSearchContext();
// use the expression evaluator to determine the final set of ints
IntSet matchedExpression = expressionEvaluator.evaluate(searchContext,
expression);
if (LOG.isDebugEnabled()) {
LOG.debug(String.format("%s rows matched the index expression",
matchedExpression.size()));
}
return new IdxRegionScanner(scan, searchContext, matchedExpression);
}
}
ihbase在内存中为region维护了一份索引,在scan的时候首先在索引中查找数据,按顺序提供rowkey,而在常规的scan时,能利用上一步的rowkey来move forward,有目的的进行seek.
IdxRegionScanner在进行scan的时候,用索引来构造keyProvider,然后执行next方法时,用keyProvider提供的rowkey进行定位
@Override
public boolean next(List<KeyValue> outResults) throws IOException {
// Seek to the next key value
seekNext();
boolean result = super.next(outResults);
//省略部分代码
return result;
}
seekNext方法就是从keyProvider取得下一个rowkey,然后跳到该rowkey
protected void seekNext() throws IOException {
KeyValue keyValue;
do {
keyValue = keyProvider.next();
if (keyValue == null) {
// out of results keys, nothing more to process
super.getStoreHeap().close();
return;
} else if (lastKeyValue == null) {
// first key returned from the key provider
break;
} else {
// it's possible that the super nextInternal method progressed past the
// ketProvider's next key. We need to keep calling next on the keyProvider
// until the key returned is after the last key returned from the
// next(List<KeyValue>) method.
// determine which of the two keys is less than the other
// when the keyValue is greater than the lastKeyValue then we're good
int comparisonResult = comparator.compareRows(keyValue, lastKeyValue);
if (comparisonResult > 0) {
break;
}
}
} while (true);
// seek the store heap to the next key
// (this is what makes the scanner faster)
getStoreHeap().seek(keyValue);
}
我感觉这种实现问题在于内存占用很高,而且不知道如果region如果load balance到其他region server上,还能不能保持索引和数据的一致性