groupBy不包括任何的重新分区,它把输入流转换为按组的输入流,加入了groupBy,则后续的聚合aggregate,则是按照组进行。
1. groupBy可以放在partitionAggregate前面。此时partitionAggregate的作用是对分区内数据做分组聚合。
2. groupBy可以放在aggregate前面。此时同一批次中的所有tuple会分配到一个单独partition当中,然后对此partition当中的tuple数据分组聚合。
分别示例代码如下
1.
FixedBatchSpout spout = new FixedBatchSpout(new Fields("user", "score"), 3,
new Values("nickt1", 4),
new Values("nickt2", 7),
new Values("nickt1", 8),
new Values("nickt1", 9),
new Values("nickt5", 7),
new Values("nickt5", 11),
new Values("nickt7", 5)
);
spout.setCycle(false);
TridentTopology topology = new TridentTopology();
topology.newStream("spout1", spout)
.shuffle()
.each(new Fields("user", "score"),new Debug("shuffle print:"))
.parallelismHint(3)
.groupBy(new Fields("user"))
.partitionAggregate(new Fields("score"), new BaseAggregator<State>() {
//在处理每一个batch的数据之前,调用1次
//空batch也会调用
@Override
public State init(Object batchId, TridentCollector collector) {
return new State();
}
//batch中的每个tuple各调用1次
@Override
public void aggregate(State state, TridentTuple tuple, TridentCollector collector) {
state.count = tuple.getInteger(0) + state.count;
}
//batch中的所有tuples处理完成后调用
@Override
public void complete(State state, TridentCollector collector) {
collector.emit(new Values(state.count));
}
}, new Fields("sum"))
.each(new Fields("user", "sum"),new BaseFunction() {
@Override
public void execute(TridentTuple tuple, TridentCollector collector) {
System.out.println("sum:" + tuple.toString());
}
}, new Fields());
输出:
[partition2-Thread-58-b-0-executor[35 35]]> DEBUG(shuffle print:): [nickt1, 8]
sum:[nickt1, 8]
[partition0-Thread-72-b-0-executor[33 33]]> DEBUG(shuffle print:): [nickt2, 7]
sum:[nickt2, 7]
[partition1-Thread-136-b-0-executor[34 34]]> DEBUG(shuffle print:): [nickt1, 4]
sum:[nickt1, 4]
[partition0-Thread-72-b-0-executor[33 33]]> DEBUG(shuffle print:): [nickt1, 9]
[partition0-Thread-72-b-0-executor[33 33]]> DEBUG(shuffle print:): [nickt5, 7]
[partition0-Thread-72-b-0-executor[33 33]]> DEBUG(shuffle print:): [nickt5, 11]
sum:[nickt1, 9]
sum:[nickt5, 18] //分析第2个batch当中的数据,因为使用的shuffle分区,此时正好nickt1, 9与nickt5, 7与nickt5, 11被分配到了partition0当中,对partition0进行按user分组求和,则结果分别为,9和11
[partition0-Thread-72-b-0-executor[33 33]]> DEBUG(shuffle print:): [nickt7, 5]
sum:[nickt7, 5]
2.
FixedBatchSpout spout = new FixedBatchSpout(new Fields("user", "score"), 3,
new Values("nickt1", 4),
new Values("nickt2", 7),
new Values("nickt1", 8),
new Values("nickt1", 9),
new Values("nickt5", 7),
new Values("nickt5", 11),
new Values("nickt7", 5)
);
spout.setCycle(false);
TridentTopology topology = new TridentTopology();
topology.newStream("spout1", spout)
.shuffle()
.each(new Fields("user", "score"),new Debug("shuffle print:"))
.parallelismHint(5)
.groupBy(new Fields("user"))
.aggregate(new Fields("score"), new CombinerAggregator<Integer>() {
//partition当中的每个tuple调用 1次
public Integer init(TridentTuple tuple) {
return tuple.getInteger(0);
}
//聚合结果
//第1次调用时,val1值为zero返回的值,之后的调用为上次调用 combine的返回值
//val2为每次init返回的值
public Integer combine(Integer val1, Integer val2) {
return val1+val2;
}
//如果partition如此没有tuple,也会调用
public Integer zero() {
return 0;
}
}, new Fields("sum"))
.each(new Fields("user", "sum"),new Debug("sum print:"))
.parallelismHint(5);
输出: