问题：

使用 Apache Flink 对 DataStream 进行排序

姚新霁

2023-03-14

我正在学习Flink，我从使用DataStream的简单字数统计开始。为了增强处理能力，我过滤了输出，以仅显示找到3个或更多单词的结果。

    DataStream<Tuple2<String, Integer>> dataStream = env
            .socketTextStream("localhost", 9000)
            .flatMap(new Splitter())
            .keyBy(0)
            .timeWindow(Time.seconds(5))
            .apply(new MyWindowFunction())
            .sum(1)
            .filter(word -> word.f1 >= 3);

我想创建一个WindowFunction，根据找到的单词值对输出进行排序。我试图实现的WindowFunction根本不编译。我正在努力定义WindowFunction接口的apply方法和参数。

public static class MyWindowFunction implements WindowFunction<
        Tuple2<String, Integer>, // input type
        Tuple2<String, Integer>, // output type
        Tuple2<String, Integer>, // key type
        TimeWindow> {

    void apply(Tuple2<String, Integer> key, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) {

        String word = ((Tuple2<String, Integer>)key).f0;
        Integer count = ((Tuple2<String, Integer>)key).f1;

        .........
        out.collect(new Tuple2<>(word, count));
    }
}

共有2个答案

姚星腾

2023-03-14

只要Splitter类（应该是一个FlatMapFunction）正在发射Tuple2，. sum（1）方法将完成您需要的一切（无需使用Application（））

那么 .sum（1）将为您进行聚合。如果你需要的东西与 sum（）不同，你通常会使用 .reduce（new MyCustomReduceFunction（）），因为这将是最有效和可扩展的方法，因为就不需要在内存中缓冲大量数据而言。

余弘新

2023-03-14

我正在更新这个答案，使用Flink 1.12.0。为了对中的流元素进行排序，我必须在使用< code>ReduceFunction对流进行计数后使用< code > KeyedProcessFunction 。然后，我必须将最后一个转换的并行度设置为< code>1，以便不改变我使用< code > KeyedProcessFunction 排序的元素的顺序。我使用的序列是< code>socketTextStream -

public class SocketWindowWordCountJava {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.socketTextStream("localhost", 9000)
                .flatMap(new SplitterFlatMap())
                .keyBy(new WordKeySelector())
                .reduce(new SumReducer())
                .keyBy(new WordKeySelector())
                .process(new SortKeyedProcessFunction(3 * 1000))
                .print().setParallelism(1);
        String executionPlan = env.getExecutionPlan();
        System.out.println("ExecutionPlan ........................ ");
        System.out.println(executionPlan);
        System.out.println("........................ ");
        env.execute("Window WordCount sorted");
    }
}

我用来对流进行排序的自定义项是SortKeyedProcessFunction，它扩展了keyedProcessFunctions。我使用＜code＞值状态

public class SortKeyedProcessFunction extends KeyedProcessFunction<String, Tuple2<String, Integer>, Event> {
        private static final long serialVersionUID = 7289761960983988878L;
        // delay after which an alert flag is thrown
        private final long timeOut;
        // state to remember the last timer set
        private ValueState<List<Event>> listState = null;
        private ValueState<Long> lastTime = null;

        public SortKeyedProcessFunction(long timeOut) {
            this.timeOut = timeOut;
        }

        @Override
        public void open(Configuration conf) {
            // setup timer and HLL state
            ValueStateDescriptor<List<Event>> descriptor = new ValueStateDescriptor<>(
                    // state name
                    "sorted-events",
                    // type information of state
                    TypeInformation.of(new TypeHint<List<Event>>() {
                    }));
            listState = getRuntimeContext().getState(descriptor);

            ValueStateDescriptor<Long> descriptorLastTime = new ValueStateDescriptor<Long>(
                    "lastTime",
                    TypeInformation.of(new TypeHint<Long>() {
                    }));

            lastTime = getRuntimeContext().getState(descriptorLastTime);
        }

        @Override
        public void processElement(Tuple2<String, Integer> value, Context context, Collector<Event> collector) throws Exception {
            // get current time and compute timeout time
            long currentTime = context.timerService().currentProcessingTime();
            long timeoutTime = currentTime + timeOut;
            // register timer for timeout time
            context.timerService().registerProcessingTimeTimer(timeoutTime);

            List<Event> queue = listState.value();
            if (queue == null) {
                queue = new ArrayList<Event>();
            }
            Long current = lastTime.value();
            queue.add(new Event(value.f0, value.f1));
            lastTime.update(timeoutTime);
            listState.update(queue);
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<Event> out) throws Exception {
            // System.out.println("onTimer: " + timestamp);
            // check if this was the last timer we registered
            System.out.println("timestamp: " + timestamp);
            List<Event> queue = listState.value();
            Long current = lastTime.value();

            if (timestamp == current.longValue()) {
                Collections.sort(queue);

                queue.forEach( e -> {
                    out.collect(e);
                });
                queue.clear();
                listState.clear();
            }
        }
    }

class Event implements Comparable<Event> {
    String value;
    Integer qtd;
    public Event(String value, Integer qtd) {
        this.value = value;
        this.qtd = qtd;
    }
    public String getValue() { return value; }
    public Integer getQtd() { return qtd; }
    @Override
    public String toString() {
        return "Event{" +"value='" + value + '\'' +", qtd=" + qtd +'}';
    }
    @Override
    public int compareTo(@NotNull Event event) {
        return this.getValue().compareTo(event.getValue());
    }
}

因此，当我使用$ nc -lk 9000并在控制台上键入单词时，我会在输出上按顺序看到它们

...
Event{value='soccer', qtd=7}
Event{value='swim', qtd=5}

...
Event{value='basketball', qtd=9}
Event{value='soccer', qtd=8}
Event{value='swim', qtd=6}

其他UDF用于流程序的其他转换，它们在这里是为了完整性。

public class SplitterFlatMap implements FlatMapFunction<String, Tuple2<String, Integer>> {
        private static final long serialVersionUID = 3121588720675797629L;
        @Override
        public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
            for (String word : sentence.split(" ")) {
                out.collect(Tuple2.of(word, 1));
            }
        }
    }
    public class WordKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
        @Override
        public String getKey(Tuple2<String, Integer> value) throws Exception {
            return value.f0;
        }
    }
    public class SumReducer implements ReduceFunction<Tuple2<String, Integer>> {
        @Override
        public Tuple2<String, Integer> reduce(Tuple2<String, Integer> event1, Tuple2<String, Integer> event2) throws Exception {
            return Tuple2.of(event1.f0, event1.f1 + event2.f1);
        }
    }

使用 Apache Flink 对 DataStream 进行排序

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档