官方文档地址,建议通读以下2篇文章:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_noncm_installation.html
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_config_options.html
在使用RPM安装之前,我花了一天时间去使用源码编译安装,由于官方文档以及网上没有太多关于源码安装的方法,最后编译完成也无法使用。关于这一点,我感觉Cloudera是故意的。
RPM安装总体比较简单,步骤如下:
1. 新建impala用户及组
groupadd impala
useradd -g impala impal
2. 配置CDH yum源,下面的源是我使用CDH5的改了一下baseurl及gpgkey.
[root@xxxx catalog]# cat /etc/yum.repos.d/cloudera-cdh6.repo
[cloudera-cdh6]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera's Distribution for Hadoop, Version 6
baseurl=https://archive.cloudera.com/cdh6/6.2.0/redhat6/yum/
gpgkey =https://archive.cloudera.com/cdh6/6.2.0/redhat6/yum/RPM-GPG-KEY-cloudera
gpgcheck = 1
3. 安装impala-server, impala-catalogd, impala statestore
$ sudo yum install impala # Binaries for daemons
$ sudo yum install impala-server # Service start/stop script
$ sudo yum install impala-state-store # Service start/stop script
$ sudo yum install impala-catalog # Service start/stop script
$ sudo yum install impala-shell
安装impala会有相当多的依赖包,总共大约1G,我粗略的看了一下,https://archive.cloudera.com/cdh6/6.2.0/redhat6/yum/ 这个地址下的软件包全部安装上去了。
安装完成之后,可以先尝试在同一台机器重启上述3个进程,如果正常,表示安装成功。
4. 新建/etc/impala/conf目录,并拷贝hive-site.xml, hdfs-site.xml, core-site.xml,如果有hbase等以来也一并拷贝进去即可。
[root@xxx conf.dist]# cd /etc/impala/conf
[root@xxx conf]# ls -l
total 180
-rw-r--r--. 1 root root 3707 May 1 07:58 core-site.xml
-rw-r--r--. 1 root root 3102 May 1 07:58 hadoop-env.sh
-rw-r--r--. 1 root root 3288 May 1 07:58 hdfs-site.xml
-rw-r--r--. 1 root root 168441 May 1 05:10 hive-site.xml
5. 修改impala配置文件: /etc/default/impala
安装完RPM会自动生成/etc/default/impala, /usr/lib/impala/lib等目录。以下配置不是默认的,我增加了一些,诸如 -mem_limit=30% -max_log_files=10
[root@xxxx conf]# cat /etc/default/impala
IMPALA_CATALOG_SERVICE_HOST=10.40.2.175
IMPALA_STATE_STORE_HOST=10.40.2.175
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} \
-catalog_service_port=26000 \
-max_log_files=10 \
-enable_webserver=true \
-mem_limit=10%"
IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-state_store_num_server_worker_threads=4 \
-max_log_files=10"
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} \
-max_log_files=10 \
-max_result_cache_size=100000 \
-abort_on_config_error=true \
-mem_limit=30% "
ENABLE_CORE_DUMPS=false
LIBHDFS_OPTS=-Djava.library.path=/usr/lib/impala/lib
MYSQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar
IMPALA_BIN=/usr/lib/impala/sbin
IMPALA_HOME=/usr/lib/impala
HIVE_HOME=/data/hive
#HBASE_HOME=/usr/lib/hbase
IMPALA_CONF_DIR=/etc/impala/conf
HADOOP_CONF_DIR=/etc/impala/conf
HIVE_CONF_DIR=/etc/impala/conf
#HBASE_CONF_DIR=/etc/impala/conf
然后重新启动服务。
6. 测试impala-shell
[root@xxxx conf]# impala-shell
Starting Impala Shell without Kerberos authentication
Opened TCP connection to xxxxxxx:21000
Connected to xxxxxxx:21000
Server version: impalad version 3.2.0-cdh6.2.0 RELEASE (build edc19942b4debdbfd485fbd26098eef435003f5d)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v3.2.0-cdh6.2.0 (edc1994) built on Thu Mar 14 00:14:36 PDT 2019)
The '-B' command line flag turns off pretty-printing for query results. Use this
flag to remove formatting from results you want to save for later, or to benchmark
Impala.
***********************************************************************************
[xxxxxxx:21000] default> show databases;
Query: show databases
+------------------+----------------------------------------------+
| name | comment |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default | Default Hive database |
+------------------+----------------------------------------------+
Fetched 2 row(s) in 0.01s
[xxxxxxxx:21000] default> use default;
Query: use default
[tsczbddbprd3.trinasolar.com:21000] default> select count(*) from jlwang2;
Query: select count(*) from jlwang2
Query submitted at: 2019-05-01 08:37:39 (Coordinator: http://xxxxxx:25000)
Query progress can be monitored at: http://xxxxxx:25000/query_plan?query_id=ec4131fa1d7e78a1:e9ebcc500000000
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Fetched 1 row(s) in 5.64s
[xxxxx:21000] default>
7. 经过上述过程,impala就安装好了,客户端只需要yum install impala-shell即可,而且impala-shell没有任何依赖,任何一台机器作为impala的客户端只需要安装一个impala-shell.
之前提过安装impala守护进程的依赖包有1G,安装完之后会在/usr/bin及/usr/lib等目录下产生文件,很多是命令,比如impala依赖hive, 因此/usr/bin会产生hive的命令。
关于这一点非常头疼,按照官方文档的推荐,每个datanode都安装一个impala进程 (这是因为short read的功能,也就是impala直接读取HDFS数据而不通过datanode服务), 可想而知,每台机器都产生那么多垃圾文件,而且这些文件有可能会影响我们。比如一台机器是HIVE,然后因为安装了impala又产生了一个hive命令,如下:
[root@xxx conf]# hive
Error: Could not find or load main class org.apache.hadoop.util.VersionInfo
Unable to determine Hadoop version information.
'hadoop version' returned:
Error: Could not find or load main class org.apache.hadoop.util.VersionInfo
[root@xxx conf]# hadoop version
Error: Could not find or load main class org.apache.hadoop.util.VersionInfo
[root@tsczbddbprd3 conf]# which hive
/usr/bin/hive
关于这一点好像是用RPM是无解的。
8. 通过imapla web端口分别查看impalad, catalogd, statestore服务相关信息:
http://xxxxxx:25000/25010/25020