Spark-rapids报错定位:Could not load cudf jni library... | ai.rapids.cudf.NativeDepsLoader.loadNativeDeps

齐驰

2023-12-01

spark-shell提交任务

spark-shell \
     --master yarn \
     --driver-memory 1G \
     --conf spark.executor.memory=1G \
     --conf spark.executor.cores=2 \
     --conf spark.executor.resource.gpu.amount=1 \
     --conf spark.rapids.memory.pinnedPool.size=1G \
     --conf spark.locality.wait=0s \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
     --files ${SPARK_RAPIDS_DIR}/getGpusResources.sh \
     --jars  ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}

报错现象及定位思路

基于Yarn对GPU的调度，使用Spark-rapids提交Spark任务报错，尝试了很多种办法最终找到了问题所在，google 百度暂未看到有相关博客。
报错信息如下(ip、路径等信息已用xxx代替)：

[2021-02-27 17:46:46.784]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/xxx/spark-archive-3x.zip/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/xxx/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


. | org.apache.spark.internal.Logging.logWarning(Logging.scala:69)
2021-02-27 17:46:47,483 | INFO  | [Reporter] | Will request 1 executor container(s), each with 2 core(s) and 2048 MB memory (including 1024 MB of overhead) with custom resources: <memory:2048, vCores:2, yarn.io/gpu: 1> | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,484 | INFO  | [Reporter] | Submitted 1 unlocalized container requests. | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,691 | INFO  | [Reporter] | Launching container container_e120_1614074444286_0025_01_000006 on host xxx for executor with ID 3 | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | INFO  | [Reporter] | Received 1 containers from YARN, launching executors on 1 of them. | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | INFO  | [Reporter] | Completed container container_e120_1614074444286_0025_01_000003 on host: xxx (state: COMPLETE, exit status: 1) | org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
2021-02-27 17:46:47,692 | WARN  | [Reporter] | Container from a bad node: container_e120_1614074444286_0025_01_000003 on host: xxx. Exit status: 1. Diagnostics: [2021-02-27 17:46:47.489]Exception from container-launch.
Container id: container_e120_1614074444286_0025_01_000003
Exit code: 1
Exception message: Launch container failed

从日志看，猜测是yarn无法申请到资源而初始化容器失败，但机器资源足够。
于是进入yarn管理台，重新提交任务，不断刷新container log，进一步找到日志如下：

ERROR | [dispatcher-Executor] | Could not load cudf jni library... | ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:91) java.io.IOException: Error loading dependencies 
Caused by: java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /xxx/container_e120_1614074444286_0020_01_000003/tmp/cudf_base5528084761645380967.so: libnvrtc.so.11.0: cannot open shared object file: No such file or directory

到GPU的机器上找到该文件在路径/usr/local/cuda-11.0/targets/x86_64-linux/lib/。既然存在，结合报错信息，那极有可能是权限问题。于是查看文件权限，为700。由于集群提交任务非root，故修改权限为755，重新提交，仍然失败。

找到报错的jar包cudf-0.17-cuda11.jar，报错类为NativeDepsLoader。
写个类调用类中的方法去机器上测试，类参考：

public class loadso2 {
    public static void main(String[] args) {
        ai.rapids.cudf.NativeDepsLoader.libraryLoaded();
    }
}

编译

javac -p loadso2.java

测试

java -Xbootclasspath/a:cudf-0.17-cuda11.jar:slf4j-api-1.7.30.jar loadso2

测试结果并无报错，所以代码应该是没有bug的。
尝试修改环境变量LD_LIBRARY_PATH和CUDA_HOME，依然不行。
某位大佬提及ldconfig。于是添加/usr/local/cuda-11.0/lib64文件/etc/ld.so.conf，然后执行ldconfig，最后执行ldconfig -p，确认包含libnvrtc.so.11.0。再次提交任务，竟然还报错！
最后切换到提交spark任务的用户下访问so文件，发现没有权限（明明文件权限都777了呀？！）
一层层往上查看，发现原来是/usr/local/cuda-11.0无权限访问，最后将权限给755后，提交任务终于成功。

思考

这件事学到了ldconfig，同时也告诉我们，服务器上尽量少用root用户。。。

Spark-rapids报错定位:Could not load cudf jni library... | ai.rapids.cudf.NativeDepsLoader.loadNativeDeps

spark-shell提交任务

报错现象及定位思路

思考

相关阅读

相关文章

相关问答

相关文档