<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.Lz4Codec,
org.apache.hadoop.io.compress.SnappyCodec,
</value>
</property>
<!--是否支持压缩-->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
<!--压缩方式-->
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
旧:之被遗弃的属性,新:之代替的属性
属性 | 描述 | 默认值 |
---|---|---|
mapred.compress.map.output(旧);mapreduce.map.output.compress(新) | Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. | alse |
mapred.map.output.compression.codec(旧); mapreduce.map.output.compress.codec(新) | If the map outputs are compressed, how should they be compressed?org.apache.hadoop.io.compress.DefaultCodec | org.apache.hadoop.io.compress.DefaultCodec |
<!--是否支持压缩-->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<!--压缩方式-->
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress*
</description>
</property>
名称 | 默认 | 定义 |
---|---|---|
mapred.output.compress (旧);mapreduce.output.fileoutputformat.compress(新) | mapreduce.output.fileoutputformat.compress | false |
mapred.output.compression.codec (旧);mapreduce.output.fileoutputformat.compress.codec(新) | If the job outputs are compressed, how should they be compressed? | org.apache.hadoop.io.compress.DefaultCodec |
<!--是否支持压缩-->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<!--压缩方式-->
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
压缩位置 | 名称 | 描述 | 默认值 |
---|---|---|---|
最终压缩 | hive.exec.compress.output | Determines whether the output of the final map/reduce job in a query is compressed or not | false |
中间压缩 | hive.exec.compress.intermediate | Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. | false |
参考博客:https://blog.csdn.net/yu0_zhang0/article/details/79524842
测试数据下载: https://blog.csdn.net/huonan_123/article/details/84784811
一. HDFS—Demo测试
列出所有的项目里会用到的压缩格式
<property>
<name>
io.compression.codecs
</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
</value>
</property>
开启输出压缩;指定输出压缩格式
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
[hadoop@hadoop001 data]$ hadoop fs -mkdir -p /user/hadoop/test/hive/input
18/11/30 18:17:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 data]$ hadoop fs -put WC2.txt /user/hadoop/test/hive/input
[hadoop@hadoop001 data]$ hadoop fs -text test/hive/input/*
18/11/30 18:57:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello world
hello java
hello spark
hello spark
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount test/hive/input test/hive/output2
[hadoop@hadoop001 data]$ hadoop fs -ls test/hive/output2
18/11/30 18:48:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-11-30 18:48 test/hive/output2/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 66 2018-11-30 18:48 test/hive/output2/part-r-00000.bz2
[hadoop@hadoop001 data]$ hadoop fs -text test/hive/output2/*
18/11/30 18:49:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/30 18:49:14 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
hello 4
java 1
spark 2
world 1
[hadoop@hadoop001 hive-test-data]$ lzop page_views.dat
[hadoop@hadoop001 hive-test-data]$ ll
总用量 213096
-rw-rw-r-- 1 hadoop hadoop 140 12月 5 13:04 lzo_names.txt
-rw-r--r-- 1 hadoop hadoop 190149930 12月 5 11:29 page_views(181M).dat
-rw-r--r-- 1 hadoop hadoop 19014993 11月 29 09:39 page_views.dat
-rw-r--r-- 1 hadoop hadoop 9029650 11月 29 09:39 page_views.dat.lzo
-rw-rw-r-- 1 hadoop hadoop 51 11月 29 09:47 wc.txt
[hadoop@hadoop001 hive-test-data]$ hadoop fs -put page_views.dat.lzo /user/hive/warehouse/ruozedata_test.db/lzo/
[hadoop@hadoop001 hive-test-data]$ hadoop jar /home/hadoop/app/hadoop-lzo-master/build/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/ruozedata_test.db/lzo
page_views.dat.lzo 为压缩文件
page_views.dat.lzo.index为索引
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r-- 1 hadoop supergroup 8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r-- 1 hadoop supergroup 584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index
二. Hive----Demo测试
hive (default)> set hive.exec.compress.output;
hive.exec.compress.output=false
hive (default)> set hive.exec.compress.output=true;
hive (default)> set hive.exec.compress.output;
hive.exec.compress.output=true
hive (default)>
hive (default)> set hive.exec.compress.output;
hive (default)> hive.exec.compress.output=false
hive (default)> set hive.exec.compress.output=true;
hive (default)> set mapreduce.output.fileoutputformat.compress.codec;
hive (default)> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
hive (default)> create table page_views_BZip2 row format delimited fields terminated by '\t' as select* from page_views;
[hadoop@hadoop001 conf]$ hadoop fs -ls /user/hive/warehouse/page_views_bzip2
18/11/30 19:32:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 3722768 2018-11-30 19:31 /user/hive/warehouse/page_views_bzip2/000000_0.bz2
[hadoop@hadoop001 hive-test-data]$ lzop page_views.dat
[hadoop@hadoop001 hive-test-data]$ ll
总用量 213096
-rw-r--r-- 1 hadoop hadoop 19014993 11月 29 09:39 page_views.dat
-rw-r--r-- 1 hadoop hadoop 9029650 11月 29 09:39 page_views.dat.lzo
-rw-rw-r-- 1 hadoop hadoop 51 11月 29 09:47 wc.txt
CREATE TABLE page_views_LzoCodec(
`track_time` string,
`url` string,
`session_id` string,
`referer` string,
`ip` string,
`end_user_id` string,
`city_id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
load data local inpath '/home/hadoop/data/hive-test-data/page_views.dat.lzo' overwrite into table page_views_LzoCodec ;
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
CREATE TABLE page_views_LzoCodec2(
`track_time` string,
`url` string,
`session_id` string,
`referer` string,
`ip` string,
`end_user_id` string,
`city_id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
insert overwrite table page_views_LzoCodec2 select * from page_views_LzoCodec;
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/
drwxr-xr-x - hadoop supergroup 0 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r-- 1 hadoop supergroup 8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r-- 1 hadoop supergroup 584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index
drwxr-xr-x - hadoop supergroup 0 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views
-rwxr-xr-x 1 hadoop supergroup 18.1 M 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views/page_views.dat
drwxr-xr-x - hadoop supergroup 0 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo
drwxr-xr-x - hadoop supergroup 0 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2/000000_0.lzo_deflate
drwxr-xr-x - hadoop supergroup 0 2018-12-05 15:49 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec3
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:49 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec3/000000_0.lzo
文件后缀是 .lzo_deflate和.lzo
设置
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
是以lzo_deflate结尾
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
是以.lzo结尾
重要:
mapred.min.split.size和mapred.max.split.size
mapred.min.split.size是每个map的大小的最小值,而map的大小不能超过mapred.max.split.size且不超过blocksize,因此map的大小是Math.max(minSize, Math.min(maxSize, blockSize))
表 | 介绍 |
---|---|
page_views_lzocodec_181m_index_test | 以Lzo格式存储测试数据 |
page_views_lzocodec_181m_index_no | 未给page_views_lzocodec_181m_index_test建索引导入时导入 |
page_views_lzocodec_181m_index_yes | 给page_views_lzocodec_181m_index_test建索引导入时导入 |
CREATE TABLE page_views_lzocodec_181m_index_test(
`track_time` string,
`url` string,
`session_id` string,
`referer` string,
`ip` string,
`end_user_id` string,
`city_id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
insert overwrite table page_views_lzocodec_181m_index_test select * from page_views_181;
CREATE TABLE page_views_lzocodec_181m_index_no(
`track_time` string,
`url` string,
`session_id` string,
`referer` string,
`ip` string,
`end_user_id` string,
`city_id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
insert overwrite table page_views_lzocodec_181m_index_no select * from page_views_lzocodec_181m_index_test;
hive (ruozedata_test)> insert overwrite table page_views_lzocodec_181m_index_no select * from page_views_lzocodec_181m_index_test;
Query ID = hadoop_20181205150404_1a865910-88bb-4b53-b47d-da6d6716a692
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1543985301484_0019, Tracking URL = http://hadoop001:8088/proxy/application_1543985301484_0019/
Kill Command = /home/hadoop/app/compile/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1543985301484_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
hadoop jar /home/hadoop/app/hadoop-lzo-master/build/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test
CREATE TABLE page_views_lzocodec_181m_index_yes(
`track_time` string,
`url` string,
`session_id` string,
`referer` string,
`ip` string,
`end_user_id` string,
`city_id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
insert overwrite table page_views_lzocodec_181m_index_yes select * from page_views_lzocodec_181m_index_test;
hive (ruozedata_test)>
> insert overwrite table page_views_lzocodec_181m_index_yes select * from page_views_lzocodec_181m_index_test;
Query ID = hadoop_20181205150404_1a865910-88bb-4b53-b47d-da6d6716a692
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1543985301484_0020, Tracking URL = http://hadoop001:8088/proxy/application_1543985301484_0020/
Kill Command = /home/hadoop/app/compile/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1543985301484_0020
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
发现Lzo格式加入索引之后分成两个文件存储
[hadoop@hadoop001 hive-test-data]$ hadoop fs -ls -h -R /user/hive/warehouse/ruozedata_test.db/
drwxr-xr-x - hadoop supergroup 0 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo
-rw-r--r-- 1 hadoop supergroup 8.6 M 2018-12-05 14:26 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo
-rw-r--r-- 1 hadoop supergroup 584 2018-12-05 14:27 /user/hive/warehouse/ruozedata_test.db/lzo/page_views.dat.lzo.index
drwxr-xr-x - hadoop supergroup 0 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views
-rwxr-xr-x 1 hadoop supergroup 18.1 M 2018-12-05 14:09 /user/hive/warehouse/ruozedata_test.db/page_views/page_views.dat
drwxr-xr-x - hadoop supergroup 0 2018-12-05 16:17 /user/hive/warehouse/ruozedata_test.db/page_views_181
-rwxr-xr-x 1 hadoop supergroup 181.3 M 2018-12-05 16:17 /user/hive/warehouse/ruozedata_test.db/page_views_181/page_views(181M).dat
drwxr-xr-x - hadoop supergroup 0 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:08 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec/page_views.dat.lzo
drwxr-xr-x - hadoop supergroup 0 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2
-rwxr-xr-x 1 hadoop supergroup 8.6 M 2018-12-05 15:36 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec2/000000_0.lzo_deflate
drwxr-xr-x - hadoop supergroup 0 2018-12-05 17:18 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_no
-rwxr-xr-x 1 hadoop supergroup 85.7 M 2018-12-05 17:18 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_no/000000_0.lzo
drwxr-xr-x - hadoop supergroup 0 2018-12-05 17:22 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test
-rwxr-xr-x 1 hadoop supergroup 85.7 M 2018-12-05 17:16 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test/000000_0.lzo
-rw-r--r-- 1 hadoop supergroup 6.1 K 2018-12-05 17:22 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_test/000000_0.lzo.index
drwxr-xr-x - hadoop supergroup 0 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes
-rwxr-xr-x 1 hadoop supergroup 42.9 M 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes/000000_0.lzo
-rwxr-xr-x 1 hadoop supergroup 42.7 M 2018-12-05 17:23 /user/hive/warehouse/ruozedata_test.db/page_views_lzocodec_181m_index_yes/000001_0.lzo