spark-shell \
--master yarn \
--driver-memory 1G \
--conf spark.executor.memory=1G \
--conf spark.executor.cores=2 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=1G \
--conf spark.locality.wait=0s \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files ${SPARK_RAPIDS_DIR}/getGpusResources.sh \
--jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}
基于Yarn对GPU的调度,使用Spark-rapids提交Spark任务报错,尝试了很多种办法最终找到了问题所在,google 百度暂未看到有相关博客。
报错信息如下(ip、路径等信息已用xxx代替):
[2021-02-27 17:46:46.784]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/xxx/spark-archive-3x.zip/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/xxx/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
. | org.apache.spark.internal.Logging.logWarning(Logging.scala:69)
2021-02-27 17:46:47,483 | INFO | [Reporter] | Will request 1 executor container(s), each with 2 core(s) and 2048 MB memory (including 1024 MB of overhead) with custom resources: <memory:2048, vCores:2, yarn.io/gpu: 1> | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,484 | INFO | [Reporter] | Submitted 1 unlocalized container requests. | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,691 | INFO | [Reporter] | Launching container container_e120_1614074444286_0025_01_000006 on host xxx for executor with ID 3 | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | INFO | [Reporter] | Received 1 containers from YARN, launching executors on 1 of them. | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | INFO | [Reporter] | Completed container container_e120_1614074444286_0025_01_000003 on host: xxx (state: COMPLETE, exit status: 1) | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | WARN | [Reporter] | Container from a bad node: container_e120_1614074444286_0025_01_000003 on host: xxx. Exit status: 1. Diagnostics: [2021-02-27 17:46:47.489]Exception from container-launch.
Container id: container_e120_1614074444286_0025_01_000003
Exit code: 1
Exception message: Launch container failed
从日志看,猜测是yarn无法申请到资源而初始化容器失败,但机器资源足够。
于是进入yarn管理台,重新提交任务,不断刷新container log,进一步找到日志如下:
ERROR | [dispatcher-Executor] | Could not load cudf jni library... | ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:91) java.io.IOException: Error loading dependencies
Caused by: java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /xxx/container_e120_1614074444286_0020_01_000003/tmp/cudf_base5528084761645380967.so: libnvrtc.so.11.0: cannot open shared object file: No such file or directory
到GPU的机器上找到该文件在路径/usr/local/cuda-11.0/targets/x86_64-linux/lib/
。既然存在,结合报错信息,那极有可能是权限问题。于是查看文件权限,为700。由于集群提交任务非root,故修改权限为755,重新提交,仍然失败。
找到报错的jar包cudf-0.17-cuda11.jar,报错类为NativeDepsLoader。
写个类调用类中的方法去机器上测试,类参考:
public class loadso2 {
public static void main(String[] args) {
ai.rapids.cudf.NativeDepsLoader.libraryLoaded();
}
}
编译
javac -p loadso2.java
测试
java -Xbootclasspath/a:cudf-0.17-cuda11.jar:slf4j-api-1.7.30.jar loadso2
测试结果并无报错,所以代码应该是没有bug的。
尝试修改环境变量LD_LIBRARY_PATH和CUDA_HOME,依然不行。
某位大佬提及ldconfig。于是添加/usr/local/cuda-11.0/lib64
文件/etc/ld.so.conf
,然后执行ldconfig
,最后执行ldconfig -p
,确认包含libnvrtc.so.11.0。再次提交任务,竟然还报错!
最后切换到提交spark任务的用户下访问so文件,发现没有权限(明明文件权限都777了呀?!)
一层层往上查看,发现原来是/usr/local/cuda-11.0无权限访问,最后将权限给755后,提交任务终于成功。
这件事学到了ldconfig,同时也告诉我们,服务器上尽量少用root用户。。。