Spark on K8S(spark-on-kubernetes-operator)常见问题(二)

何涵育
2023-12-01

Spark Demo过程中的常见问题(二)

Spark的executor/driver怎么持久化日志

想到的两种方式:

  1. driver/executor在执行过程中都对接到统一日志系统(例如ES)这个需要改改代码搭建新的环境,后续再研究
  2. Spark本身有日志持久化的配置,通过配置持久化到hdfs路径下(一般在yarn上也是这样用的)
    这次采用的方法2,有如下配置可用:
  #sparkConf:
    #"spark.eventLog.enabled": "true"
    #"spark.eventLog.dir": "hdfs://hdfscluster:port/spark-test/spark-events"

如果遇到了kerberos相关的问题,可以参考后续章节来解决;日志持久化之后,spark提供的web ui在driver执行完成后会shutdown,无法看到spark job执行过程,先想了一个土办法,在spark.stop()前增加了一个sleep,此时web ui不会shutdown,在kubernetes集群中可以看到一个spark-pi driver对应的svc,但是没有nodeport,所以为了可以在集群外访问该svc,可以通过:新增80端口映射的ingress,配置backend到svc上访问,或直接将ingress-controller的backend指向该svc,我采用的是这种,不过蛋疼的是每一次sparkApplication都创建一个新的svc,没法统一在一个界面看到;另外应该还有一种更简单的办法,通过kubectl port-forward让集群外直接访问一个svc,暂时还没有试过;可以记一个思考题以后尝试。

尝试使用kubectl port-forward来实现外部访问集群SVC

Spark history server怎么配置生效

每个sparkApplication不可能都保留,所以还是得搭建Spark history server,两种方法:

  1. 就用spark v2.4.4镜像,单独建一个history的镜像然后run起来,放在kubernetes里跑
  2. 集群外部署一个spark history server

Dokcerfile如下:

ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v2.4.4

FROM gcr.io/spark-operator/spark:v2.4.4
RUN chmod 777 /opt/spark/sbin/start-history-server.sh
RUN ls -l /opt/spark/sbin/start-history-server.sh

COPY spark-daemon.sh /opt/spark/sbin/spark-daemon.sh
RUN chmod 777 /opt/spark/sbin/spark-daemon.sh
COPY run.sh /opt/run.sh
RUN chmod 777 /opt/run.sh

RUN mkdir -p /etc/hadoop/conf
RUN chmod 777 /etc/hadoop/conf

COPY core-site.xml /etc/hadoop/conf/core-site.xml
COPY hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml
COPY user.keytab /etc/hadoop/conf/user.keytab
COPY krb5.conf /etc/hadoop/conf/krb5.conf

RUN chmod 777 /etc/hadoop/conf/core-site.xml
RUN chmod 777 /etc/hadoop/conf/hdfs-site.xml
RUN chmod 777 /etc/hadoop/conf/user.keytab
RUN chmod 777 /etc/hadoop/conf/krb5.conf

ENTRYPOINT ["/opt/run.sh"]

docker run起来之后发现出现如下错误:

ps: unrecognized option: p
BusyBox v1.29.3 (2019-01-24 07:45:07 UTC) multi-call binary.

Usage: ps [-o COL1,COL2=HEADER]

Show list of processes

        -o COL1,COL2=HEADER     Select columns for display

这里为了定位分析报错原因,所以用了一个土办法,不直接在容器加载后执行start-history-server.sh,而是封装到一个自己写的脚本年内,在脚本里sleep一下,然后再用docker exec -it xxxxxxx bash进入容器分析;

#!/bin/bash
sh /opt/spark/sbin/start-history-server.sh "hdfs://xxxxxxxxxx:xxxx/spark-test/spark-events"
while [ 1 == 1 ]
do
        cat /opt/spark/logs/*
        sleep 60
done

发现启动脚本中用到的spark-daemon.sh中包含ps -p的用法,该用法会直接报错,所以需要修改spark-daemon.sh脚本,将脚本中的ps -p替换为ps,然后再docker打镜像的时候替换一下脚本,过程中还发现了脚本中另一个问题:

execute_command() {
  if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then
      nohup -- "$@" >> $log 2>&1 < /dev/null &
      newpid="$!"

没搞懂这个–是个干啥的,直接替换掉吧,不用execute_command直接起进程:

  case "$mode" in
    (class)
      "${SPARK_HOME}"/bin/spark-class "$command" "$@"

此时会出现和kerberos的问题:

starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-7c7f7db06bdc.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 06:38:21 INFO HistoryServer: Started daemon with process name: 334@7c7f7db06bdc
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for TERM
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for HUP
20/01/17 06:38:21 INFO SignalUtils: Registered signal handler for INT
20/01/17 06:38:21 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
        at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:276)
        at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:312)
        at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:53)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:392)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:392)
        at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:413)
        at org.apache.spark.deploy.history.HistoryServer$.initSecurity(HistoryServer.scala:342)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:289)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
        at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
        ... 9 more
Caused by: KrbException: Cannot locate default realm
        at sun.security.krb5.Config.getDefaultRealm(Config.java:1029)
        ... 15 more

和常见的kerberos问题一样,需要考虑:

  1. 指定kerberos认证的参数给spark history server,还好spark history server支持参数配置
  2. 指定krb5.conf的路径
    通过两个设置两个环境变量来完成(在Dockerfile中增加了发现没用,所以在run.sh里写了):
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://xxxxx:xxxx/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf"
export HADOOP_CONF_DIR=/etc/hadoop/conf

再次docker run后正常启动:

starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-spark-history-server-5ccf5dbd4d-f7f8l.out
Spark Command: /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/*:/etc/hadoop/conf/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://10.120.16.127:25000/spark-test/spark-events -Dspark.history.kerberos.principal=ossuser/hadoop@HADOOP.COM -Dspark.history.kerberos.keytab=/etc/hadoop/conf/user.keytab -Dspark.history.kerberos.enabled=true -Djava.security.krb5.conf=/etc/hadoop/conf/krb5.conf -Xmx1g org.apache.spark.deploy.history.HistoryServer hdfs://10.120.16.127:25000/spark-test/spark-events
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/17 07:11:13 INFO HistoryServer: Started daemon with process name: 13@spark-history-server-5ccf5dbd4d-f7f8l
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for TERM
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for HUP
20/01/17 07:11:13 INFO SignalUtils: Registered signal handler for INT
20/01/17 07:11:13 WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.
20/01/17 07:11:14 INFO SparkHadoopUtil: Attempting to login to Kerberos using principal: ossuser/hadoop@HADOOP.COM and keytab: /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO UserGroupInformation: Login successful for user ossuser/hadoop@HADOOP.COM using keytab file /etc/hadoop/conf/user.keytab
20/01/17 07:11:15 INFO SecurityManager: Changing view acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls to: root,ossuser
20/01/17 07:11:15 INFO SecurityManager: Changing view acls groups to:
20/01/17 07:11:15 INFO SecurityManager: Changing modify acls groups to:
20/01/17 07:11:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, ossuser); groups with view permissions: Set(); users  with modify permissions: Set(root, ossuser); groups with modify permissions: Set()
20/01/17 07:11:15 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/01/17 07:11:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/17 07:11:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
20/01/17 07:11:20 INFO Utils: Successfully started service on port 18080.
20/01/17 07:11:20 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://spark-history-server-5ccf5dbd4d-f7f8l:18080
20/01/17 07:11:20 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1 for listing data...

20/01/17 07:11:27 INFO FsHistoryProvider: Finished parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-8cc1feeddb5f4e54b87613613752eac1
20/01/17 07:11:27 INFO FsHistoryProvider: Parsing hdfs://xxxxx:xxxx/spark-test/spark-events/spark-2b356130dc12418aa526bf56328fe840 for listing data...

接下来要做的就是用这个镜像部署一个deployment,建svc,让集群外也能访问。

这里先偷懒把kerberos/hadoop的配置打到镜像里了,其实也可以和sparkApplication的方式一样把configMap挂载到volume上,包括环境变量,通过configMap或直接在deployment中指定就行,但是算了,重复的事情就不做了。

Deployment yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: spark-history-server
    spec:
      imagePullSecrets:
      - name: dockersecret
      containers:
      - name: spark-history-server
        image: xxxxx/spark-history-server:v1.0
        imagePullPolicy: Always
        ports:
        - containerPort: 18080
          protocol: TCP
        resources:
          requests:
            cpu: 2
            memory: 4Gi
          limits:
            cpu: 4
            memory: 8Gi

svc yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server
spec:
  type: NodePort
  ports:
  - name: http
    port: 18080
    targetPort: 18080
    nodePort: 30118
  selector:
    k8s-app: spark-history-server

部署后:

[root@linux100-99-81-13 spark_history]# kubectl get pod
NAME                                    READY   STATUS    RESTARTS   AGE
spark-history-server-5ccf5dbd4d-f7f8l   1/1     Running   0          178m
[root@linux100-99-81-13 spark_history]# kubectl get svc
NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
kubernetes                     ClusterIP   10.96.0.1       <none>        443/TCP           51d
spark-history-server           NodePort    10.111.132.22   <none>        18080:30118/TCP   169m

通过ip:nodeport在浏览器即可看到history的页面
nodeport的路子比较野,好一些的方式是通过ingress配置定向到spark history server的svc上,所以建一个ClusterIP类型的svc:

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: spark-history-server
  name: spark-history-server-cluster
  namespace: default
spec:
  type: ClusterIP
  ports:
  - port: 5601
    protocol: TCP
    targetPort: 18080
  selector:
    k8s-app: spark-history-server

再来一个ingress,把后缀为sparkHistory的url都转发给5601的spark-history-server-cluster这个svc

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/rewrite-target: /
    ingress.kubernetes.io/ssl-redirect: "false"
  name: spark-history-ingress
  namespace: default
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: spark-history-server-cluster
          servicePort: 5601
        path: /sparkHistory

不过这样目前看还有些问题,需要对ingress做一些优化,history server的页面有些是没有带sparkHistory关键字的,所以页面加载的是残缺的,例如这个资源:http://100.99.65.73/static/historypage-common.js

创建ClusterIP的svc通过ingress访问的时候费了点紧,ingress和svc都建好了但是页面F5总是503,后来发现是svc的配置不对,找不到对应的pod,所以没法返回异常
selector: k8s-app: spark-history-server(一开始配置的是spark-history-server-cluster)

至此,通过搭建spark history server查看历史application的流程就完了,唯一的不足就是ingress还没研究怎么配,所以就先用nodeport来看吧。

Spark-operator namespace下的xxxxx-webhook是做什么的

集群起来之后发现有个svc叫littering-woodpecker-webhook,凭感觉像一个spark-operator的管理界面,或者是一个spark-operator提供的rest服务入口

spark-operator   littering-woodpecker-webhook   ClusterIP   10.101.113.106   <none>        443/TCP           49d

尝试创建了一个ingress指向该svc,但是似乎直接get请求是不合法的。

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/rewrite-target: /
    ingress.kubernetes.io/ssl-redirect: "false"
  name: spark-operator-ingress
  namespace: spark-operator
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: littering-woodpecker-webhook
          servicePort: 443
        path: /sparkOperator

页面F12可以看到返回的是http 400 bad request

 类似资料: