目前数据分析人员常用到jupyterlab来进行前期的数据探索,但纯净版只支持简单的python,不能满足数据分析人员的需求,如何为数据分析人员提供大数据集群下的数据访问就成了需要解决的问题。
当前jupyter提供了一些官方kernel供用户使用,如最早的sparkmagic(https://github.com/jupyter-incubator/sparkmagic)和最近新的开源项目Apache Toree(https://toree.incubator.apache.org/)都可以满足通过spark和pyspark方式对大数据集群数据的访问。
对sparkmagic和toree调研分析后发现:
综合考虑下,决定采用sparkmagic实现jupyterlab对spark和pyspark的支持。
前置条件:
1、安装Apache livy
首先在https://livy.incubator.apache.org/download/下载需要的版本(建议下载最新版本)
mkdir -p /usr/local/livy
unzip apache-livy-0.7.1-incubating-bin.zip -d /usr/local/livy
2、配置Apache livy
cd /usr/local/livy
cp livy.conf.template livy.conf
cp livy-env.sh.template livy-env.sh
# livy支持读取hive数据
cp hive-site.xml /usr/local/livy/conf
# livy日志和运行信息存放的目录
mkdir -p /usr/lib/bigdata/livy/log
mkdir -p /usr/lib/bigdata/livy/run
修改livy.conf
# 设置主机名
livy.server.host = 127.0.0.1
# livy服务端口
livy.server.port = 8998
# 集群任务提交方式
livy.spark.master = yarn
# 任务部署模式
livy.spark.deploy-mode = cluster
# livy建立session连接超时时间
livy.server.session.timeout = 1h
# 支持hive-context
livy.repl.enable-hive-context = true
修改livy-env.sh
export JAVA_HOME=/usr/lib/bigdata/openjdk8
export HADOOP_HOME=/usr/lib/bigdata/hadoop
export SPARK_CONF_DIR=/usr/lib/bigdata/spark/conf
export SPARK_HOME=/usr/lib/bigdata/spark
export HADOOP_CONF_DIR=/usr/lib/bigdata/hadoop/etc/hadoop
export LIVY_LOG_DIR=/usr/lib/bigdata/livy/log
export LIVY_PID_DIR=/usr/lib/bigdata/livy/run
export LIVY_SERVER_JAVA_OPTS="-Xmx2g"
安装livy后,需启动在用户访问的集群上启动livy服务。
./bin/livy-server start
pip install sparkmagic
# 安装ipywidgets控件
jupyter nbextension enable --py --sys-prefix widgetsnbextension
# jupyterlab显示spark和pyspark控件
npm install -g "@jupyter-widgets/jupyterlab-manager" --registry=https://registry.npm.taobao.org
#支持spark
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
#支持pyspark
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
修改/root/.sparkmagic/config.json
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://127.0.0.1:8998",
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://127.0.0.1:8998",
"auth": "None"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 30,
"livy_server_heartbeat_timeout_seconds": 0,
"heartbeat_retry_seconds": 10,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
至此,所有的安装就完成了,用户可以通过spark和pyspark访问不同大数据集群的数据。