想到的两种方式:
#sparkConf:
#"spark.eventLog.enabled": "true"
#"spark.eventLog.dir": "hdfs://hdfscluster:port/spark-test/spark-events"
如果遇到了kerberos相关的问题,可以参考后续章节来解决;日志持久化之后,spark提供的web ui在driver执行完成后会shutdown,无法看到spark job执行过程,先想了一个土办法,在spark.stop()前增加了一个sleep,此时web ui不会shutdown,在kubernetes集群中可以看到一个spark-pi driver对应的svc,但是没有nodeport,所以为了可以在集群外访问该svc,可以通过:新增80端口映射的ingress,配置backend到svc上访问,或直接将ingress-controller的backend指向该svc,我采用的是这种,不过蛋疼的是每一次sparkApplication都创建一个新的svc,没法统一在一个界面看到;另外应该还有一种更简单的办法,通过kubectl port-forward让集群外直接访问一个svc,暂时还没有试过;可以记一个思考题以后尝试。
尝试使用kubectl port-forward来实现外部访问集群SVC
每个sparkApplication不可能都保留,所以还是得搭建Spark history server,两种方法:
Dokcerfile如下:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v2.4.4
FROM gcr.io/spark-operator/spark:v2.4.4
RUN chmod 777 /opt/spark/sbin/start-history-server.sh
RUN ls -l /opt/spark/sbin/start-history-server.sh
COPY spark-daemon.sh /opt/spark/sbin/spark-daemon.sh
RUN chmod 777 /opt/spark/sbin/spark-daemon.sh
COPY run.sh /opt/run.sh
RUN chmod 777 /opt/run.sh
RUN mkdir -p /etc/hadoop/conf
RUN chmod 777 /etc/hadoop/conf
COPY core-site.xml /etc/hadoop/conf/core-site.xml
COPY hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml
COPY user.keytab /etc/hadoop/conf/user.keytab
COPY krb5.conf /etc/hadoop/conf/krb5.conf
RUN chmod 777 /etc/hadoop/conf/core-site.xml
RUN chmod 777 /etc/hadoop/conf/hdfs-site.xml
RUN chmod 777 /etc/hadoop/conf/user.keytab
RUN chmod 777 /etc/hadoop/conf/krb5.conf
ENTRYPOINT ["/opt/run.sh"]
docker run起来之后发现出现如下错误:
ps: unrecognized option: p
BusyBox v1.29.3 (2019-01-24 07:45:07 UTC) multi-call binary.
Usage: ps [-o COL1,COL2=HEADER]
Show list of processes
-o COL1,COL2=HEADER Select columns for display
这里为了定位分析报错原因,所以用了一个土办法,不直接在容器加载后执行start-history-server.sh,而是封装到一个自己写的脚本年内,在脚本里sleep一下,然后再用docker exec -it xxxxxxx bash进入容器分析;
#!/bin/bash
sh /opt/spark/sbin/start-history-server.sh "hdfs://xxxxxxxxxx:xxxx/spark-test/spark-events"
while [ 1 == 1 ]
do
cat /opt/spark/logs/*
sleep 60
done
发现启动脚本中用到的spark-daemon.sh中包含ps -p的用法,该用法会直接报错,所以需要修改spark-daemon.sh脚本,将脚本中的ps -p替换为ps,然后再docker打镜像的时候替换一下脚本,过程中还发现了脚本中另一个问题:
execute_command() {
if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then
nohup -- "$@" >> $log 2>&1 < /dev/null &
newpid="$!"
没搞懂这个–是个干啥的,直接替换掉吧,不用execute_command直接起进程:
case "$mode" in
(class)
"${SPARK_HOME}"/bin/spark-class "$command" "$@"
此时会出现和kerberos的问题:
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-7c7f7db06bdc.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 06:38:21 INFO HistoryServer: Started daemon with process name: 334@7c7f7db06bdc
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for TERM
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for HUP
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for INT
20/01/17 06:38:21 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:276)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:312)
at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:392)
at org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:392)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:413)
at org.apache.spark.deploy.history.HistoryServer$.initSecurity(HistoryServer.scala:342)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:289)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
... 9 more
Caused by: KrbException: Cannot locate default realm
at sun.security.krb5.Config.getDefaultRealm(Config.java:1029)
... 15 more
和常见的kerberos问题一样,需要考虑:
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf"
export HADOOP_CONF_DIR=/etc/hadoop/conf
再次docker run后正常启动:
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-spark-history-server-5ccf5dbd4d-f7f8l.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://10.120.16.127:25000/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 07:11:13 INFO HistoryServer: Started daemon with process name: 13@spark-history-server-5ccf5dbd4d-f7f8l
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for TERM
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for HUP
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for INT
20/01/17 07:11:13 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
20/01/17 07:11:14 INFO SparkHadoopUtil: Attempting to login to Kerberos using principal: ossuser/hadoop@HADOOP.COM and keytab: /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO UserGroupInformation: Login successful for user ossuser/hadoop@HADOOP.COM using keytab file /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO SecurityManager: Changing view acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing view acls groups to:
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls groups to:
20/01/17 07:11:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ossuser); groups with view permissions: Set(); users with modify permissions: Set(root, ossuser); groups with modify permissions: Set()
20/01/17 07:11:15 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/01/17 07:11:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/17 07:11:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
20/01/17 07:11:20 INFO Utils: Successfully started service on port 18080.
20/01/17 07:11:20 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://spark-history-server-5ccf5dbd4d-f7f8l:18080
20/01/17 07:11:20 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1 for listing data...
20/01/17 07:11:27 INFO FsHistoryProvider: Finished parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1
20/01/17 07:11:27 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-2b356130dc12418aa526bf56328fe840 for listing data...
接下来要做的就是用这个镜像部署一个deployment,建svc,让集群外也能访问。
这里先偷懒把kerberos/hadoop的配置打到镜像里了,其实也可以和sparkApplication的方式一样把configMap挂载到volume上,包括环境变量,通过configMap或直接在deployment中指定就行,但是算了,重复的事情就不做了。
Deployment yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: spark-history-server
name: spark-history-server
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: spark-history-server
spec:
imagePullSecrets:
- name: dockersecret
containers:
- name: spark-history-server
image: xxxxx/spark-history-server:v1.0
imagePullPolicy: Always
ports:
- containerPort: 18080
protocol: TCP
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
svc yaml
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: spark-history-server
name: spark-history-server
spec:
type: NodePort
ports:
- name: http
port: 18080
targetPort: 18080
nodePort: 30118
selector:
k8s-app: spark-history-server
部署后:
[root@linux100-99-81-13 spark_history]# kubectl get pod
NAME READY STATUS RESTARTS AGE
spark-history-server-5ccf5dbd4d-f7f8l 1/1 Running 0 178m
[root@linux100-99-81-13 spark_history]# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 51d
spark-history-server NodePort 10.111.132.22 <none> 18080:30118/TCP 169m
通过ip:nodeport在浏览器即可看到history的页面
nodeport的路子比较野,好一些的方式是通过ingress配置定向到spark history server的svc上,所以建一个ClusterIP类型的svc:
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: spark-history-server
name: spark-history-server-cluster
namespace: default
spec:
type: ClusterIP
ports:
- port: 5601
protocol: TCP
targetPort: 18080
selector:
k8s-app: spark-history-server
再来一个ingress,把后缀为sparkHistory的url都转发给5601的spark-history-server-cluster这个svc
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
ingress.kubernetes.io/rewrite-target: /
ingress.kubernetes.io/ssl-redirect: "false"
name: spark-history-ingress
namespace: default
spec:
rules:
- http:
paths:
- backend:
serviceName: spark-history-server-cluster
servicePort: 5601
path: /sparkHistory
不过这样目前看还有些问题,需要对ingress做一些优化,history server的页面有些是没有带sparkHistory关键字的,所以页面加载的是残缺的,例如这个资源:http://100.99.65.73/static/historypage-common.js
创建ClusterIP的svc通过ingress访问的时候费了点紧,ingress和svc都建好了但是页面F5总是503,后来发现是svc的配置不对,找不到对应的pod,所以没法返回异常
selector: k8s-app: spark-history-server
(一开始配置的是spark-history-server-cluster)
至此,通过搭建spark history server查看历史application的流程就完了,唯一的不足就是ingress还没研究怎么配,所以就先用nodeport来看吧。
集群起来之后发现有个svc叫littering-woodpecker-webhook,凭感觉像一个spark-operator的管理界面,或者是一个spark-operator提供的rest服务入口
spark-operator littering-woodpecker-webhook ClusterIP 10.101.113.106 <none> 443/TCP 49d
尝试创建了一个ingress指向该svc,但是似乎直接get请求是不合法的。
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
ingress.kubernetes.io/rewrite-target: /
ingress.kubernetes.io/ssl-redirect: "false"
name: spark-operator-ingress
namespace: spark-operator
spec:
rules:
- http:
paths:
- backend:
serviceName: littering-woodpecker-webhook
servicePort: 443
path: /sparkOperator
页面F12可以看到返回的是http 400 bad request