一、前言
分析HiBench 的 Hive测试用例 sql (包括scan/join/aggregation),版本 HiBench6.0
Hadoop版本:v2.7.1
Spark版本:v2.1
二、需要解决的问题
1. 跑 HiBench 的 测试用例 sql 需要 手动安装Hive吗?
不需要,HiBench 会自动下载 pom.xml文件里的Hive版本。
2. HiBench6.0 的Hive版本是多少?
版本是0.140。HiBench/hadoopbench/sql/pom.xml文件里显示的Hive URL 如下:
<properties>
<repo>http://archive.apache.org</repo>
<file>dist/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz</file>
</properties>
拼起来就是http://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz. 所以如果想用HiBench测试Hive其它的版本,就可以更改这里的 Hive URL。
3. HiBench 的 Spark SQL bench 代码,载入 sql 语句文件 并运行
package com.intel.hibench.sparkbench.sql
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
/*
* ported from HiBench's hive bench
*/
object ScalaSparkSQLBench{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaSparkSQLBench <workload name> <SQL sciprt file>"
)
System.exit(1)
}
val workload_name = args(0)
val sql_file = args(1)
val sparkConf = new SparkConf().setAppName(workload_name)
val sc = new SparkContext(sparkConf)
val hc = new HiveContext(sc)
val _sql = scala.io.Source.fromFile(sql_file).mkString
_sql.split(';').foreach { x =>
if (x.trim.nonEmpty)
hc.sql(x)
}
sc.stop()
}
}
4. HiBench 的 Hive测试用例 sql (包括scan/join/aggregation)究竟是些什么sql语句?
a) scan 测试用例:就是select 生成数据input的表uservisits所有内容,插入到output的uservisits_copy表中。只有一个job,包含一个阶段的map任务,没有reduce任务。
run.sh最终会调用hive -f HiBench/report/scan/hadoop/rankings_uservisits_scan.hive,文件内容如下:
DROP TABLE IF EXISTS uservisits;
CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Scan/Input/uservisits';
DROP TABLE IF EXISTS uservisits_copy;
CREATE EXTERNAL TABLE uservisits_copy (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Scan/Output/uservisits_copy';
INSERT OVERWRITE TABLE uservisits_copy SELECT * FROM uservisits;
b) join 测试用例:就是联合表input的rankings 和 表uservisits 到 output 的新的表 rankings_uservisits_join 中。
DROP TABLE IF EXISTS rankings;
CREATE EXTERNAL TABLE rankings (pageURL STRING, pageRank INT, avgDuration INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Input/rankings';
DROP TABLE IF EXISTS uservisits_copy;
CREATE EXTERNAL TABLE uservisits_copy (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Input/uservisits';
DROP TABLE IF EXISTS rankings_uservisits_join;
CREATE EXTERNAL TABLE rankings_uservisits_join ( sourceIP STRING, avgPageRank DOUBLE, totalRevenue DOUBLE) STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Output/rankings_uservisits_join';
INSERT OVERWRITE TABLE rankings_uservisits_join SELECT sourceIP, avg(pageRank), sum(adRevenue) as totalRevenue FROM rankings R JOIN (SELECT sourceIP, destURL, adRevenue FROM uservisits_copy UV WHERE (datediff(UV.visitDate, '1999-01-01')>=0 AND datediff(UV.visitDate, '2000-01-01')<=0)) NUV ON (R.pageURL = NUV.destURL) group by sourceIP order by totalRevenue DESC;
运行完job后,hdfs目录查看:
$ hdfs dfs -ls /HiBench/Join/Output/
Found 1 items
drwxr-xr-x - user supergroup 0 2018-08-21 17:08 /HiBench/Join/Output/rankings_uservisits_join
[user@cluster1 ~]$ hdfs dfs -ls /HiBench/Join/Output/rankings_uservisits_join/
Found 1 items
-rw-r--r-- 1 user supergroup 1000523704 2018-08-21 17:08 /HiBench/Join/Output/rankings_uservisits_join/000000_0
[user@cluster1 ~]$ hdfs dfs -du -s -h /HiBench/Join/Output/rankings_uservisits_join/
954.2 M /HiBench/Join/Output/rankings_uservisits_join
c) aggregation 测试用例:统计 input的表 uservisits 中 同一个sourceIP的总共sumAdRevenue,写到 output表的 uservisits_aggre中。有1个job,也就是包含了一个完整的的map任务和reduce任务。
DROP TABLE IF EXISTS uservisits;
CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Aggregation/Input/uservisits';
DROP TABLE IF EXISTS uservisits_aggre;
CREATE EXTERNAL TABLE uservisits_aggre ( sourceIP STRING, sumAdRevenue DOUBLE) STORED AS SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Aggregation/Output/uservisits_aggre';
INSERT OVERWRITE TABLE uservisits_aggre SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP;
运行完job后,hdfs目录查看:
$ hdfs dfs -du -s -h /HiBench/Aggregation/Input/
89.4 G /HiBench/Aggregation/Input
$hdfs dfs -du -s -h /HiBench/Aggregation/Output
16.4 G /HiBench/Aggregation/Output
$hdfs dfs -ls /HiBench/Aggregation/Output/uservisits_aggre
Found 69 items
255501195 2018-08-21 17:27 /HiBench/Aggregation/Output/uservisits_aggre/000000_0
...
255501195 2018-08-21 17:27 /HiBench/Aggregation/Output/uservisits_aggre/000068_0
三、升级HiBench中的Hive版本(0.14.0)到 1.2.1
为什么要升级?v 0.14.0 源码里并没有SparkCompiler,那么为什么HiBench 能跑Spark的SQL测试用例。因为HiBench的Spark SQL测试用例根本没有Hive,根本没有Hive,根本没有Hive,而是用的SparkSQL。
v 0.14.0 TaskCompilerFactory.java 代码如下:
/**
* Returns the appropriate compiler to translate the operator tree
* into executable units.
*/
public static TaskCompiler getCompiler(HiveConf conf, ParseContext parseContext) {
if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
return new TezCompiler();
} else {
return new MapReduceCompiler();
}
v 1.2.1 TaskCompilerFactory.java 代码如下:
public static TaskCompiler getCompiler(HiveConf conf, ParseContext parseContext) {
if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
return new TezCompiler();
} else if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("spark")) {
return new SparkCompiler();
} else {
return new MapReduceCompiler();
}
}
怎么升级 HiBench 的 Hive-v 0.14.0 到 Hive-v1.2.1
1 修改maven 配置文件 HiBench/hadoopbench/sql/pom.xml
<properties>
<repo>http://archive.apache.org</repo>
<file>dist/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz</file>
</properties>
2. 重新编译HiBench: sudo mvn -Phadoopbench -Dspark=2.1 -Dscala=2.11 clean package
对于已经编译过的目录,如果再次编译出错且提示 mvs [goal] -rf :xxx,例如:mahout,可以执行如下命令:
sudo mvn -Phadoopbench -Dspark=2.1 -Dscala=2.11 clean package -rf :mahout
3. 更新HiBench的配置文件 HiBench/conf/hibench.conf
hibench.hive.home ${hibench.home}/hadoopbench/sql/target/${hibench.hive.release}
hibench.hive.release apache-hive-1.2.1-bin
hibench.hivebench.template.dir ${hibench.home}/hadoopbench/sql/hive_template
四、相关的源码
HiBench/autogen/src/main/java/HiBench目录下有HiveData.java、DataGen.java 和 DataOptions.java的源码。
五、运行时报的错误
1. 运行HiBench的scan测试用例:
~/HiBench/bin/workloads/sql/scan/prepare/prepare.sh报如下错误:
19/01/04 13:47:12 INFO mapreduce.Job: Task Id : attempt_1546410718870_0023_r_000019_0, Status : FAILED
Error: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.io.SequenceFile.CompressionType.block
at java.lang.Enum.valueOf(Enum.java:238)
at org.apache.hadoop.io.SequenceFile$CompressionType.valueOf(SequenceFile.java:222)
解决方法:注释掉 Hadoop/etc/hadoop/mapred-site.xml 里的 file output 压缩配置
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>block</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
六、参考
理解Hive表(Hive Table)https://blog.csdn.net/bingduanlbd/article/details/52076219