测试flink实时流系列（二）：搭建DataGen数据生成节点服务器（Hadoop + HiBench)

谷星文

2023-12-01

一、在服务器节点安装及运行Hadoop

安装和运行单节点Hadoop请参考：搭建Hadoop（v2.7.1）单节点伪模式，集群（2 节点）及集群（5 节点）

二、在服务器节点安装运行HiBench

下载HiBench-7.0，解压后进入HiBench-7.0，修改conf/目录下相应的配置文件：

1. 修改hadoop.conf配置文件

重点关注参数hibench.hadoop.home 和 hibench.hdfs.master

# Hadoop home
#hibench.hadoop.home     /PATH/TO/YOUR/HADOOP/ROOT
hibench.hadoop.home      /home/yjiang2/hadoop-2.7.1

# The path of hadoop executable
#hibench.hadoop.executable     ${hibench.hadoop.home}/bin/hadoop
hibench.hadoop.executable     /home/yjiang2/hadoop-2.7.1/bin/hadoop

# Hadoop configraution directory
#hibench.hadoop.configure.dir  ${hibench.hadoop.home}/etc/hadoop
hibench.hadoop.configure.dir   /home/yjiang2/hadoop-2.7.1/etc/hadoop

# The root HDFS path to store HiBench data
hibench.hdfs.master       hdfs://localhost:9000


# Hadoop release provider. Supported value: apache, cdh5, hdp
hibench.hadoop.release    apache

2. 修改hibench.conf配置文件

重点关注参数 hibench.streambench.zkHost 和 hibench.streambench.kafka.brokerList。hibench.streambench.zkHost 是配置ZK 集群节点的，hibench.streambench.kafka.brokerList 是配置Kafka集群节点的。相关服务器节点，请参见测试flink实时流系列（一）：搭建ZK+Kafka集群

# Data scale profile. Available value is tiny, small, large, huge, gigantic and bigdata.
# The definition of these profiles can be found in the workload's conf file i.e. conf/workloads/micro/wordcount.conf
hibench.scale.profile                 ud
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         40

# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     20


#======================================================
# Report files
#======================================================
# default report formats
hibench.report.formats          "%-12s %-10s %-8s %-20s %-20s %-20s %-20s\n"

# default report dir path
hibench.report.dir              ${hibench.home}/report

# default report file name
hibench.report.name             hibench.report

# input/output format settings. Available formats: Text, Sequence.
sparkbench.inputformat         Sequence
sparkbench.outputformat        Sequence

# hibench config folder
hibench.configure.dir           ${hibench.home}/conf

# default hibench HDFS root
hibench.hdfs.data.dir           ${hibench.hdfs.master}/HiBench

# path of hibench jars
hibench.hibench.datatool.dir              ${hibench.home}/autogen/target/autogen-7.0-jar-with-dependencies.jar
hibench.common.jar                      ${hibench.home}/common/target/hibench-common-7.0-jar-with-dependencies.jar
hibench.sparkbench.jar                  ${hibench.home}/sparkbench/assembly/target/sparkbench-assembly-7.0-dist.jar
hibench.streambench.stormbench.jar      ${hibench.home}/stormbench/streaming/target/stormbench-streaming-7.0.jar
hibench.streambench.gearpump.jar        ${hibench.home}/gearpumpbench/streaming/target/gearpumpbench-streaming-7.0-jar-with-dependencies.jar
hibench.streambench.flinkbench.jar      ${hibench.home}/flinkbench/streaming/target/flinkbench-streaming-7.0-jar-with-dependencies.jar

hibench.streambench.flink.bufferTimeout 1000
hibench.streambench.flink.checkpointDuration    1000
#======================================================
# workload home/input/ouput path
#======================================================
hibench.hive.home               ${hibench.home}/hadoopbench/sql/target/${hibench.hive.release}
hibench.hive.release            apache-hive-0.14.0-bin
hibench.hivebench.template.dir  ${hibench.home}/hadoopbench/sql/hive_template
hibench.bayes.dir.name.input    ${hibench.workload.dir.name.input}
hibench.bayes.dir.name.output   ${hibench.workload.dir.name.output}

hibench.mahout.release.apache   apache-mahout-distribution-0.11.0
hibench.mahout.release.hdp      apache-mahout-distribution-0.11.0
hibench.mahout.release.cdh5         mahout-0.9-cdh5.1.0
hibench.mahout.release                ${hibench.mahout.release.${hibench.hadoop.release}}
hibench.mahout.home                       ${hibench.home}/hadoopbench/mahout/target/${hibench.mahout.release}

hibench.masters.hostnames
hibench.slaves.hostnames

hibench.workload.input
hibench.workload.output
hibench.workload.dir.name.input         Input
hibench.workload.dir.name.output        Output

hibench.nutch.dir.name.input    ${hibench.workload.dir.name.input}
hibench.nutch.dir.name.output   ${hibench.workload.dir.name.output}
hibench.nutch.nutchindexing.dir ${hibench.home}/hadoopbench/nutchindexing/
hibench.nutch.release           nutch-1.2
hibench.nutch.home              ${hibench.home}/hadoopbench/nutchindexing/target/${hibench.nutch.release}

hibench.dfsioe.dir.name.input   ${hibench.workload.dir.name.input}
hibench.dfsioe.dir.name.output  ${hibench.workload.dir.name.output}


#======================================================
# Streaming General
#======================================================
# Indicate whether in debug mode for correctness verfication (default: false)
hibench.streambench.debugMode false
hibench.streambench.sampleProbability 0.1
hibench.streambench.fixWindowDuration            10000
hibench.streambench.fixWindowSlideStep           10000


#======================================================
# Kafka for streaming benchmarks
#======================================================
hibench.streambench.kafka.home                  #/home/axel/FlinkTest/kafka_2.11-0.8.2.2
# zookeeper host:port of kafka cluster, host1:port1,host2:port2...
hibench.streambench.zkHost                      1.17.1.45:2181,1.17.1.115:2181,10.110.169.76:2181
# Kafka broker lists, written in mode host:port,host:port,..
hibench.streambench.kafka.brokerList            1.17.1.45:9092,1.17.1.115:9092,10.110.169.76:9092
hibench.streambench.kafka.consumerGroup          HiBench
# number of partitions of generated topic (default 20)
hibench.streambench.kafka.topicPartitions       20
# consumer group of the consumer for kafka (default: HiBench)
hibench.streambench.kafka.consumerGroup HiBench
# Set the starting offset of kafkaConsumer (default: largest)
hibench.streambench.kafka.offsetReset largest


#======================================================
# Data generator for streaming benchmarks
#======================================================
# Interval span in millisecond (default: 50)
hibench.streambench.datagen.intervalSpan         10
# Number of records to generate per interval span (default: 5)
hibench.streambench.datagen.recordsPerInterval   100
# fixed length of record (default: 200)
hibench.streambench.datagen.recordLength         200
# Number of KafkaProducer running on different thread (default: 1)
hibench.streambench.datagen.producerNumber       16
# Total round count of data send (default: -1 means infinity)
hibench.streambench.datagen.totalRounds          -1
# Number of total records that will be generated (default: -1 means infinity)
hibench.streambench.datagen.totalRecords        -1
# default path to store seed files (default: ${hibench.hdfs.data.dir}/Streaming)
hibench.streambench.datagen.dir                         ${hibench.hdfs.data.dir}/Streaming
# default path setting for genearate data1 & data2
hibench.streambench.datagen.data1.name                  Seed
hibench.streambench.datagen.data1.dir                   ${hibench.streambench.datagen.dir}/${hibench.streambench.datagen.data1.name}
hibench.streambench.datagen.data2_cluster.dir           ${hibench.streambench.datagen.dir}/Kmeans/Cluster
hibench.streambench.datagen.data2_samples.dir           ${hibench.streambench.datagen.dir}/Kmeans/Samples

#======================================================
# MetricsReader for streaming benchmarks
#======================================================
# Number of sample records for `MetricsReader` (default: 5000000)
hibench.streambench.metricsReader.sampleNum      5000000
# Number of thread for `MetricsReader` (default: 20)
hibench.streambench.metricsReader.threadNum      20
# The dir where stored the report of benchmarks (default: ${hibench.home}/report)
hibench.streambench.metricsReader.outputDir      ${hibench.home}/report

三、运行HiBench生成数据脚本（以repartition这个test case 为例）

1. jps查看Hadoop运行进程

$ jps
155943 DataNode
157675 NodeManager
39738 Jps
156508 SecondaryNameNode
157148 ResourceManager
155580 NameNode

2. 先运行脚本生成数据种子，例如跑repartition这个case，运行如下：

$cd HiBench-7.0
$./bin/workloads/streaming/repartition/prepare/genSeedDataset.sh

3. 运行脚本生成数据，如下：

./bin/workloads/streaming/repartition/prepare/dataGen.sh

4. 加大生成数据的workload，例如新创一个脚本dataGenP.sh，如下:

$ cat ./bin/workloads/streaming/repartition/prepare/dataGenP.sh
#!/bin/bash

proc_num=$1
echo $proc_num
for i in {1..4}
do
    exec /home/yjiang2/HiBench-7.0/bin/workloads/streaming/repartition/prepare/dataGen.sh &
done

这时可用dstat命令查看，数据收发实时状态。

5. 中途中断dataGen，可运行下面命令：

ps -ef|grep DataGenerator|grep -v grep|cut -c 9-15|xargs kill -9

测试flink实时流系列（二）：搭建DataGen数据生成节点服务器（Hadoop + HiBench)

一、在服务器节点安装及运行Hadoop

二、在服务器节点安装运行HiBench

三、运行HiBench生成数据脚本（以repartition这个test case 为例）

相关阅读

相关文章

相关问答

相关文档