当前位置: 首页 > 工具软件 > Apache Tez > 使用案例 >

Hive中配置Apache Tez运行MR

王英奕
2023-12-01

前言

  • Hive:2.3.0
  • Hadoop:2.7.7
  • JDK:1.8.0_221
  • Tez:0.9.1
  • 本次配置Apache Tez只是用于Hive执行MR任务,而非Hadoop全局配置,并且使用的是已编译二进制压缩包
  • Hadoop-Tez兼容性:Apache Tez 0.9.0中使用了部分Hadoop 2.7.0开发包,因此如果Hadoop是2.7.x版本,建议使用0.9.0及更新版本的Tez,避免发生兼容性问题。而对于Hadoop 2.6.x版本,官方建议使用Tez 0.8.3及更新版本的Tez
  • Hive-Tez各版本兼容信息:Hive-Tez Compatibility
  • Install/Deploy Instructions for Tez

配置过程

1)下载已编译Tez压缩包,并解压

下载地址:

Apache Tez各版本下载地址:Apache TEZ Releases

备用下载地址:Apache Tez

解压并更名:

tar -xzvf apache-tez-0.9.1-bin.tar.gz -C /opt/module/
mv /opt/module/apache-tez-0.9.1-bin /opt/module/tez-0.9.1

注意: 需要将Tez(客户端)安装在与Hive(客户端)相同节点上

2)替换tez/lib路径下的hadoop相关jar包

这一步操作时为了避免jar包版本冲突,因为后续这些不同版本的jar包都会添加到HADOOP_CLASSPATH中,如果不覆盖,在Hive中使用MR引擎执行Job时会发生版本冲突而报错

删除tez-0.9.1/lib下的hadoop相关的jar包:

rm hadoop-mapreduce-client-core-2.7.0.jar
rm hadoop-mapreduce-client-common-2.7.0.jar

将集群中hadoop中的对应jar包复制添加到tez-0.9.1/lib下(实测也可以不添加):

cp /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.7.jar /opt/module/tez-0.9.1/lib
cp /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.7.7.jar /opt/module/tez-0.9.1/lib

3)将压缩包tez/share/tez.tar.gz上传至HDFS中,并修改权限

hadoop fs -rm -R /apps/tez-0.9.1
hadoop fs -mkdir -p /apps/tez-0.9.1
hadoop fs -put -f /opt/module/tez-0.9.1/share/tez.tar.gz /apps/tez-0.9.1
hadoop fs -chmod -R 777 /apps

PS:如果是编译Tez的Maven项目源码,则是将压缩包 tez/target/tez-x.y.z-SNAPSHOT.tar.gz 上传到HDFS

4)在hive/conf目录下创建tez-site.xml文件

hive/conf目录下创建 tez-site.xml 文件,并配置相关参数

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 设置tez依赖的jar包路径,值为上传的Tez压缩包所在的HDFS路径 -->
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/apps/tez-0.9.1/tez.tar.gz</value>
        <description>
            String value to a file path.
            The location of the Tez libraries which will be localized for DAGs.
        </description>
        <type>string</type>
    </property>

    <!-- 设置是否使用集群中的hadoop函数库,如果为false,则使用tez.lib.uris中包含的hadoop依赖 -->
    <property>
        <name>tez.use.cluster.hadoop-libs</name>
        <value>false</value>
        <description>
            Boolean value.
            Specify whether hadoop libraries required to run Tez should be the ones deployed on the cluster.
            This is disabled by default - with the expectation being that tez.lib.uris has a complete
            tez-deployment which contains the hadoop libraries.
        </description>
        <type>boolean</type>
    </property>

    <!-- 如果没有设置 tez.am.launch.cmd-opts 参数,则便会使用此功能.
    此参数设定Tez Job所能使用的JVM堆内存占整个Container内存大小的比例
    如果YARN中的container内存资源较少,则将此值适当减小,反之则适当增大. -->
    <property>
        <name>tez.container.max.java.heap.fraction</name>
        <value>0.2</value>
        <description>
            Double value. Tez automatically determines the Xmx for the JVMs used to run
            Tez tasks and app masters. This feature is enabled if the user has not
            specified Xmx or Xms values in the launch command opts. Doing automatic Xmx
            calculation is preferred because Tez can determine the best value based on
            actual allocation of memory to tasks the cluster. The value if used as a
            fraction that is applied to the memory allocated Factor to size Xmx based
            on container memory size. Value should be greater than 0 and less than 1.

            Set this value to -1 to allow Tez to use different default max heap fraction
            for different container memory size. Current policy is to use 0.7 for container
            smaller than 4GB and use 0.8 for larger container.
        </description>
        <type>float</type>
    </property>

    <!-- 设置Tez task的ApplicationMaster 所用内存,单位MB -->
    <!-- 由于主机内存只有1.5G可用,因此将此值减小 -->
    <!-- 默认值:1024 -->
    <property>
        <name>tez.am.resource.memory.mb</name>
        <value>1024</value>
        <description>
            Int value. The amount of memory in MB to be used by the AppMaster
        </description>
        <type>integer</type>
    </property>

    <!-- 设置Tez task的所用内存,单位MB-->
    <!-- 由于主机内存只有1.5G可用,因此将此值减小 -->
    <!-- 默认值:1024 -->
    <property>
        <name>tez.task.resource.memory.mb</name>
        <value>512</value>
        <description>
            Int value. The amount of memory in MB to be used by tasks. This applies to 
            all tasks across all vertices. Setting it to the same value for all tasks 
            is helpful for container reuse and thus good for performance typically.
        </description>
        <type>integer</type>
    </property>

</configuration>

5)在Hive客户端节点上配置Tez相关环境变量,并添加HADOOP_CLASSPATH

直接在Hive安装路径下的conf/hive-env.sh文件结尾设置相关环境变量,故每次Hive启动时,自动加载Tez相关环境变量。

  • TEZ_CONF_DIR:Tez配置文件 tez-site.xml 所在路径
  • TEZ_JARS:Tez压缩包解压路径
  • HADOOP_CLASSPATH:Hadoop运行时的classpath
# Tez classpath
TEZ_CONF_DIR=/opt/module/tez-0.9.1/conf/tez-site.xml
TEZ_JARS=/opt/module/tez-0.9.1
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
# 如果使用某些额外的jar包,可以通过HIVE_AUX_JARS_PATH变量指定路径
# 如hadoop-lzo依赖包等,此处额外依赖包都放在了/opt/libs/路径下
export HIVE_AUX_JARS_PATH=/opt/libs/*

6)启动/重启Hadoop

7)启动Hive CLI(或者启动hiveserver服务和beeline),设置Hive的MR执行引擎

hive> set hive.execution.engine=tez;

8)执行测试命令,查看输出

hive> SELECT deptno, avg(sal) as avg_sal FROM emp group by deptno;

9)设置Hive默认使用Tez执行MR Job(可选)

可以直接在hive/conf/hive-site.xml文件中设置参数hive.execution.engine值为tez,即默认使用Tez执行MR Job:

    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
        <description>
            Expects one of [mr, tez, spark].
            Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
            remains the default engine for historical reasons, it is itself a historical engine
            and is deprecated in Hive 2 line. It may be removed without further warning.
        </description>
    </property>

10)依旧使用MR执行Job(可选)

hive> set hive.execution.engine=mr;

End~

 类似资料: