当前位置: 首页 > 工具软件 > X-Hive/DB > 使用案例 >

Hive(二):之Compression(2)----HDFS/Hive里的压缩配置和测试

黎奇略
2023-12-01
  • 配置

  1. core-site.xml 配置需要支持的压缩格式
<property> 
	<name>io.compression.codecs</name> 
	<value>
	org.apache.hadoop.io.compress.GzipCodec,
	org.apache.hadoop.io.compress.DefaultCodec,
	org.apache.hadoop.io.compress.BZip2Codec, 
	com.hadoop.compression.lzo.LzoCodec, 
	com.hadoop.compression.lzo.LzopCodec, 
	org.apache.hadoop.io.compress.Lz4Codec, 
	org.apache.hadoop.io.compress.SnappyCodec, 
	</value> 
</property>
  1. 然后在mapred-site.xml里配置实际使用的压缩
<!--是否支持压缩-->
<property>
	 <name>mapreduce.output.fileoutputformat.compress</name>
	<value>true</value> 
<!--压缩方式-->
</property>
<property> 
	<name>mapreduce.output.fileoutputformat.compress.codec</name> 
	<value>org.apache.hadoop.io.compress.BZip2Codec</value> 
</property>  
  • HDFS里压缩[配置

    1. 输入压缩

    HDFS里的文件压缩格式

    2.中间压缩

旧:之被遗弃的属性,新:之代替的属性

属性描述默认值
mapred.compress.map.output(旧);mapreduce.map.output.compress(新)Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression.alse
mapred.map.output.compression.codec(旧); mapreduce.map.output.compress.codec(新)If the map outputs are compressed, how should they be compressed?org.apache.hadoop.io.compress.DefaultCodecorg.apache.hadoop.io.compress.DefaultCodec
  • 例子
<!--是否支持压缩-->
<property>
	 <name>mapreduce.map.output.compress</name>
	<value>true</value> 
</property>
<!--压缩方式-->
<property> 
	<name>mapred.map.output.compression.codec</name> 						
	<value>org.apache.hadoop.io.compress.SnappyCodec</value> 
	<description> 
		This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress*
	</description> 
</property> 

  1. 最终压缩
名称默认定义
mapred.output.compress (旧);mapreduce.output.fileoutputformat.compress(新)mapreduce.output.fileoutputformat.compressfalse
mapred.output.compression.codec (旧);mapreduce.output.fileoutputformat.compress.codec(新)If the job outputs are compressed, how should they be compressed?org.apache.hadoop.io.compress.DefaultCodec
  • 例子:
<!--是否支持压缩-->
<property>
	 <name>mapreduce.output.fileoutputformat.compress</name>
	<value>true</value> 
</property>
<!--压缩方式-->
<property> 
	<name>mapreduce.output.fileoutputformat.compress.codec</name> 
	<value>org.apache.hadoop.io.compress.BZip2Codec</value> 
</property> 
  1. Hive里压缩[配置

    官网:https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-hive-site.xmlandhive-default.xml.template

  • 输出压缩
  1. 是否开启
压缩位置名称描述默认值
最终压缩hive.exec.compress.outputDetermines whether the output of the final map/reduce job in a query is compressed or notfalse
中间压缩hive.exec.compress.intermediateDetermines whether the output of the intermediate map/reduce jobs in a query is compressed or not.false
  1. 压缩配置:参照HDFS

参考博客:https://blog.csdn.net/yu0_zhang0/article/details/79524842

  • 测试

测试数据下载: https://blog.csdn.net/huonan_123/article/details/84784811

一. HDFS—Demo测试

  1. BZip2格式压缩
  • 配置core-site.xml

列出所有的项目里会用到的压缩格式

<property>
	<name>
	io.compression.codecs
	</name>
	<value>
		org.apache.hadoop.io.compress.GzipCodec,
		org.apache.hadoop.io.compress.DefaultCodec,
		org.apache.hadoop.io.compress.BZip2Codec,
	</value>
</property>
  • 配置map-red.xml

开启输出压缩;指定输出压缩格式

<property>
	<name>mapreduce.output.fileoutputformat.compress</name>
	<value>true</value>
</property>
<property>
	<name>mapreduce.output.fileoutputformat.compress.codec</name>
	<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
  • 准备数据
[hadoop@hadoop001 data]$ hadoop fs -mkdir -p /user/hadoop/test/hive/input
18/11/30 18:17:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[hadoop@hadoop001 data]$ hadoop fs -put WC2.txt /user/hadoop/test/hive/input

[hadoop@hadoop001 data]$ hadoop fs -text test/hive/input/*
18/11/30 18:57:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello world
hello java
hello spark
hello spark

  • 执行任务
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$  hadoop jar $HADOOP_HOME/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar  wordcount test/hive/input test/hive/output2

  • 结果
[hadoop@hadoop001 data]$ hadoop fs -ls test/hive/output2
18/11/30 18:48:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-11-30 18:48 test/hive/output2/_SUCCESS
-rw-r--r--   1 hadoop supergroup         66 2018-11-30 18:48 test/hive/output2/part-r-00000.bz2

[hadoop@hadoop001 data]$ hadoop fs -text test/hive/output2/*
18/11/30 18:49:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/30 18:49:14 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
hello	4
java	1
spark	2
world	1

  1. Lzo格式
  • lzop命令压缩为.lzp格式
  • 前提安装好Lzo的压缩库
[hadoop@hadoop001 hive-test-data]$ lzop page_views.dat

[hadoop@hadoop001 hive-test-data]$ ll
总用量 213096
-rw-rw-r-- 1 hadoop hadoop       140 12月  5 13:04 lzo_names.txt
-rw-r--r-- 1 hadoop hadoop 190149930 12月  5 11:29 page_views(181M).dat
-rw-r--r-- 1 hadoop hadoop  19014993 11月 29 09:39 page_views.dat
-rw-r--r-- 1 hadoop hadoop   9029650 11月 29 09:39 page_views.dat.lzo
-rw-rw-r-- 1 hadoop hadoop        51 11月 29 09:47 wc.txt

  • 上传到HDFS
[hadoop@hadoop001 hive-test-data]$ hadoop fs -put page_views.dat.lzo  /user/hive/warehouse/ruozedata_test.db/lzo/

  • 创建Index
[hadoop@hadoop001 hive-test-data]$ hadoop jar /home/hadoop/app/hadoop-lzo-master/build/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/ruozedata_test.db/lzo

  • 查看

page_views.dat.lzo 为压缩文件
page_views.dat.lzo.index为索引

[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r--   1 hadoop supergroup      8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r--   1 hadoop supergroup        584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index

二. Hive----Demo测试

  1. BZip2压缩
  • 配置
hive (default)> set hive.exec.compress.output;
hive.exec.compress.output=false
hive (default)> set hive.exec.compress.output=true;
hive (default)> set hive.exec.compress.output;
hive.exec.compress.output=true
hive (default)> 

hive (default)> set hive.exec.compress.output;
hive (default)> hive.exec.compress.output=false

hive (default)> set hive.exec.compress.output=true;
hive (default)> set mapreduce.output.fileoutputformat.compress.codec;
hive (default)> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;


  • 导入数据
hive (default)> create table page_views_BZip2 row format delimited fields terminated by '\t' as select* from page_views;
  • 查看
[hadoop@hadoop001 conf]$ hadoop fs -ls /user/hive/warehouse/page_views_bzip2
18/11/30 19:32:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x   1 hadoop supergroup    3722768 2018-11-30 19:31 /user/hive/warehouse/page_views_bzip2/000000_0.bz2

  1. Lzo 压缩
第一种方法
  • 准备数据
[hadoop@hadoop001 hive-test-data]$ lzop page_views.dat

[hadoop@hadoop001 hive-test-data]$ ll
总用量 213096
-rw-r--r-- 1 hadoop hadoop  19014993 11月 29 09:39 page_views.dat
-rw-r--r-- 1 hadoop hadoop   9029650 11月 29 09:39 page_views.dat.lzo
-rw-rw-r-- 1 hadoop hadoop        51 11月 29 09:47 wc.txt
  • Hive创建表
  • 注意指定存储格式(STORED AS):

    INPUTFORMAT
    ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’
    OUTPUTFORMAT
    ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
CREATE TABLE page_views_LzoCodec(
  `track_time` string, 
  `url` string, 
  `session_id` string, 
  `referer` string, 
  `ip` string, 
  `end_user_id` string, 
  `city_id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  • 导入Hive
load data local inpath '/home/hadoop/data/hive-test-data/page_views.dat.lzo' overwrite into table page_views_LzoCodec ;

  • 查看
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo

第二种方法

  • 配置
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
  • 创建表
CREATE TABLE page_views_LzoCodec2(
  `track_time` string, 
  `url` string, 
  `session_id` string, 
  `referer` string, 
  `ip` string, 
  `end_user_id` string, 
  `city_id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  • 执行任务
insert overwrite table page_views_LzoCodec2 select * from page_views_LzoCodec;

  • 查看
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r--   1 hadoop supergroup      8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r--   1 hadoop supergroup        584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views
-rwxr-xr-x   1 hadoop supergroup     18.1 M 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views/page_views.dat
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2/000000_0.lzo_deflate
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 15:49 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec3
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:49 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec3/000000_0.lzo


注意:

文件后缀是  .lzo_deflate和.lzo
设置
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
是以lzo_deflate结尾
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
是以.lzo结尾

第三种(创建索引)

重要:

mapred.min.split.size和mapred.max.split.size
mapred.min.split.size是每个map的大小的最小值,而map的大小不能超过mapred.max.split.size且不超过blocksize,因此map的大小是Math.max(minSize, Math.min(maxSize, blockSize))

介绍
page_views_lzocodec_181m_index_test以Lzo格式存储测试数据
page_views_lzocodec_181m_index_no未给page_views_lzocodec_181m_index_test建索引导入时导入
page_views_lzocodec_181m_index_yes给page_views_lzocodec_181m_index_test建索引导入时导入
  • 创建表
CREATE TABLE page_views_lzocodec_181m_index_test(
  `track_time` string, 
  `url` string, 
  `session_id` string, 
  `referer` string, 
  `ip` string, 
  `end_user_id` string, 
  `city_id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';


insert overwrite table page_views_lzocodec_181m_index_test select * from page_views_181;

  • 没给创建索引是导入
CREATE TABLE page_views_lzocodec_181m_index_no(
  `track_time` string, 
  `url` string, 
  `session_id` string, 
  `referer` string, 
  `ip` string, 
  `end_user_id` string, 
  `city_id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

 insert overwrite table page_views_lzocodec_181m_index_no select * from page_views_lzocodec_181m_index_test;
  • 查看mapReduce作业的map个数(number of mappers: 1)
hive (ruozedata_test)>  insert overwrite table page_views_lzocodec_181m_index_no select * from page_views_lzocodec_181m_index_test;
Query ID = hadoop_20181205150404_1a865910-88bb-4b53-b47d-da6d6716a692
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1543985301484_0019, Tracking URL = http://hadoop001:8088/proxy/application_1543985301484_0019/
Kill Command = /home/hadoop/app/compile/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1543985301484_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  • 创建索引
hadoop jar /home/hadoop/app/hadoop-lzo-master/build/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test
  • 导入表
CREATE TABLE page_views_lzocodec_181m_index_yes(
  `track_time` string, 
  `url` string, 
  `session_id` string, 
  `referer` string, 
  `ip` string, 
  `end_user_id` string, 
  `city_id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

insert overwrite table page_views_lzocodec_181m_index_yes select * from page_views_lzocodec_181m_index_test;


  • 查看mapReduce作业map个数()
hive (ruozedata_test)> 
                     >  insert overwrite table page_views_lzocodec_181m_index_yes select * from page_views_lzocodec_181m_index_test;
Query ID = hadoop_20181205150404_1a865910-88bb-4b53-b47d-da6d6716a692
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1543985301484_0020, Tracking URL = http://hadoop001:8088/proxy/application_1543985301484_0020/
Kill Command = /home/hadoop/app/compile/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1543985301484_0020
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
  • 查看HDFS

发现Lzo格式加入索引之后分成两个文件存储

[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r--   1 hadoop supergroup      8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r--   1 hadoop supergroup        584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views
-rwxr-xr-x   1 hadoop supergroup     18.1 M 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views/page_views.dat
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 16:17 /user/hive/warehouse/ruozedata_test.db/page_views_181
-rwxr-xr-x   1 hadoop supergroup    181.3 M 2018-12-05 16:17 /user/hive/warehouse/ruozedata_test.db/page_views_181/page_views(181M).dat
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2
-rwxr-xr-x   1 hadoop supergroup      8.6 M 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2/000000_0.lzo_deflate
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 17:18 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_no
-rwxr-xr-x   1 hadoop supergroup     85.7 M 2018-12-05 17:18 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_no/000000_0.lzo
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 17:22 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test
-rwxr-xr-x   1 hadoop supergroup     85.7 M 2018-12-05 17:16 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test/000000_0.lzo
-rw-r--r--   1 hadoop supergroup      6.1 K 2018-12-05 17:22 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test/000000_0.lzo.index
drwxr-xr-x   - hadoop supergroup          0 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes
-rwxr-xr-x   1 hadoop supergroup     42.9 M 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes/000000_0.lzo
-rwxr-xr-x   1 hadoop supergroup     42.7 M 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes/000001_0.lzo


总结:

在hadoop中的应用场景总结在三方面:输入,中间,输出。
整体思路:hdfs ==> map ==> shuffle ==> reduce
  • Use Compressd Map Input:
    从HDFS中读取文件进行Mapreuce作业,如果数据很大,可以使用压缩并且选择支持分片的压缩方式(Bzip2,LZO),可以实现并行处理,提高效率,减少磁盘读取时间,同时选择合适的存储格式例如Sequence Files,RC,ORC等;
  • Compress Intermediate Data:
    Map输出作为Reducer的输入,需要经过shuffle这一过程,需要把数据读取到一个环形缓冲区,然后读取到本地磁盘,所以选择压缩可以减少了存储文件所占空间,提升了数据传输速率,建议使用压缩速度快的压缩方式,例如Snappy和LZO.
  • Compress Reducer Output:
    进行归档处理或者链接Mapreduce的工作(该作业的输出作为下个作业的输入),压缩可以减少了存储文件所占空间,提升了数据传输速率,如果作为归档处理,可以采用高的压缩比(Gzip,Bzip2),如果作为下个作业的输入,考虑是否要分片进行选择。
 类似资料: