当前位置: 首页 > 工具软件 > ND4J > 使用案例 >

使用显卡训练DL4J的问题总结

金阳曜
2023-12-01

1、先写结论

1.1 目前测试可行的配置

第一种:

(1)显卡配置:GTX1050Ti

(2)系统环境:win10、cuda=9.2 
(3)pom依赖:cuda=9.2    nd4j=1.0.0-beta6

第二种配置:

(1)显卡配置:RTX3080

(2)系统环境:win10、cuda=11.2 或cuda=11.6
(3)pom依赖:cuda=11.2    nd4j=1.0.0-M1.1 (这里不能用1.0.0-M1,会报错-详见下方,是一个bug,在新版M1.1中不会出现。也不要用1.0.0-M2,因为虽然nd4j-cuda-11.2-platform最高支持1.0.0-M2,但deeplearing4j-cuda-11.2最高只支持到1.0.0-M1.1。)

备注:这里说明cuda大版本(version第一个小数点前的数字)一致时,系统环境pom.xml中使用的cuda小版本可以不一致。

1.2 错误的配置 

(1)系统环境cuda=11.2,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1

或者系统环境cuda=11.6,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1

系统环境:笔记本cuda=11.2 ;pom依赖:cuda=11.2    nd4j=1.0.0-M1
或
或者系统环境cuda=11.6,pom.xml中cuda=11.2 且 nd4j=1.0.0-M1
的报错日志:

[main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
[main] ERROR org.deeplearning4j.common.config.DL4JClassLoading - Cannot create instance of class 'org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper'.
java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.<init>(java.lang.Class, [Ljava.lang.Object;)
	at java.lang.Class.getConstructor0(Class.java:3082)
	at java.lang.Class.getDeclaredConstructor(Class.java:2178)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:103)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:89)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:74)
	at org.deeplearning4j.nn.layers.HelperUtils.createHelper(HelperUtils.java:57)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.initializeHelper(LSTM.java:53)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.<init>(LSTM.java:49)
	at org.deeplearning4j.nn.conf.layers.LSTM.instantiate(LSTM.java:78)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:714)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:604)
	at zj.rnn.effectiveness.train.wordvector.TestWordVector.main(TestWordVector.java:89)
Exception in thread "main" java.lang.RuntimeException: java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.<init>(java.lang.Class, [Ljava.lang.Object;)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:108)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:89)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:74)
	at org.deeplearning4j.nn.layers.HelperUtils.createHelper(HelperUtils.java:57)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.initializeHelper(LSTM.java:53)
	at org.deeplearning4j.nn.layers.recurrent.LSTM.<init>(LSTM.java:49)
	at org.deeplearning4j.nn.conf.layers.LSTM.instantiate(LSTM.java:78)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:714)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:604)
	at zj.rnn.effectiveness.train.wordvector.TestWordVector.main(TestWordVector.java:89)
Caused by: java.lang.NoSuchMethodException: org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.<init>(java.lang.Class, [Ljava.lang.Object;)
	at java.lang.Class.getConstructor0(Class.java:3082)
	at java.lang.Class.getDeclaredConstructor(Class.java:2178)
	at org.deeplearning4j.common.config.DL4JClassLoading.createNewInstance(DL4JClassLoading.java:103)
	... 9 more

Process finished with exit code 1

(2)系统环境cuda=11.6,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7

这里的错误就是系统环境的cuda、cudnn版本和pom.xml中不一致导致的。也有说是RTX3080算力比较高,使用cuda10.2与之不匹配的问题。

解决:升级cuda=11.2,nd4j=1.0.0-M1.1

系统环境cuda=11.6,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7


[main] WARN org.nd4j.linalg.factory.Nd4jBackend - Skipped [JCublasBackend] backend (unavailable): java.lang.UnsatisfiedLinkError: C:\Users\A\.javacpp\cache\rnn-effective-0.0.1-bin.jar\org\bytedeco\cuda\windows-x86_64\jnicudart.dll: Can't find dependent libraries
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable$Builder.<init>(InMemoryLookupTable.java:637)
        at org.deeplearning4j.models.sequencevectors.SequenceVectors$Builder.presetTables(SequenceVectors.java:941)
        at org.deeplearning4j.models.word2vec.Word2Vec$Builder.build(Word2Vec.java:615)
        at zj.rnn.effectiveness.util.PrepareWordVector.trainWordVector(PrepareWordVector.java:133)
        at zj.rnn.effectiveness.train.wordvector.RnnClassifyWithTrainWordVector.main(RnnClassifyWithTrainWordVector.java:64)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5094)
        at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
        ... 5 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:221)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5091)
        ... 6 more

(3)系统环境cuda=10.2,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7

虽然词向量的保存和读取都是用的同一类型方法,但仍然报错。最后选用高版本的cuda=11.2, nd4j=1.0.0-M1.1就可以完美解决所有问题。 

系统环境cuda=10.2,pom.xml中cuda=10.2 且 nd4j=1.0.0-beta7。在读词向量的时候报错。
其中,词向量的训练保存代码:
        // 1、词向量训练
        SentenceIterator iter = null;
        try {
            iter = new BasicLineIterator(hanLpFilePath);
            TokenizerFactory t = new DefaultTokenizerFactory();
            Word2Vec vec = new Word2Vec.Builder().minWordFrequency(3) // 词在文本(整条训练语句,与窗口大小无关)必须出现的最少次数,短文本中设置只要出现一次就拿下
                    .epochs(5) // 迭代次数
                    .layerSize(wordVectorSize) // 每个词用wordVector表示的大小
                    .seed(42).windowSize(8) // 上下文窗口大小,表示每个词需要考虑前8个词和后8个词,和最小词频无关
                    .iterate(iter).tokenizerFactory(t).build();
            vec.fit();
            // 保存词向量
            WordVectorSerializer.writeWord2VecModel(vec, vectorPath);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        // 2、读取词向量
WordVectors wordVectors = WordVectorSerializer.readWord2VecModel(new File(vectorPath));

[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [16]; Memory: [7.1GB];
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 10.2.89
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3080]; cc: [8.6]; Total memory: [10736893952]
[main] ERROR org.deeplearning4j.models.embeddings.loader.WordVectorSerializer - Cannot read binary model
U             syn0.txt\???[q??????χH??B     &??Rw?L?#,?#E??O?ZUk)q?7s?9???CZ?j??9????????k??9?????Zf???3??s??Yu?}V?{??U???~??[??g???m?y????m??????Y??z???z??_????r?~????[W?{?V????7?=G??L?????m?~{?]?????SN)k?>&???e???)s???Vj[?6}?,z????}?y[ie?~??zic???\K??G??????????/??N?E?X{???????????:???\????????Z??T????????f/?\???n|s??????????o?1?.???j??7k?1?V?????+u7?3???z?z?^J??q?v?/??j??u???;?E?(??U??V???/K+Z?,K???t?o{??E?d?it??g??7'*7u??G:??m?V??j?v??;??,?~??1"
        at java.lang.NumberFormatException.forInputString(Unknown Source)
        at java.lang.Integer.parseInt(Unknown Source)
        at java.lang.Integer.parseInt(Unknown Source)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readBinaryModel(WordVectorSerializer.java:278)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2444)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2426)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2413)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372)
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)
[main] ERROR org.deeplearning4j.models.embeddings.loader.WordVectorSerializer - Unable to guess input file format
java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2447)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readAsBinaryNoLineBreaks(WordVectorSerializer.java:2426)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2413)
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372)
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2416) 
        at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2372) 
        at maotiao.train.wordvector.rnn.RnnClassifyWordVector.main(RnnClassifyWordVector.java:79)

 2、cuda和显卡的匹配关系

显卡的和cuda的匹配关系可看英伟达显卡、cuda、cudnn、tensorflow-gpu、torch-gpu版本对应关系

需要说明:官网上的映射关系都是指最高匹配版本,如RTX3080 最高匹配cuda 11.7,也就是cuda <= 11.7都是可以的,但是如果版本低于11可能会和显卡的算力(NVIDIA支持的显卡算力CC(computer-capability)) 不匹配,在模型训练时可能也会报错。

笔者同时在RTX3080 的台式机上同时安装了cuda11.6、cuda11.2、cuda10.2。在GTX1050Ti上同时安装了cuda9.2、cuda9.0。

3、DL4J train on GPU所需的依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <artifactId>maotiao-classify-gpu</artifactId>

    <properties>
         <!-- gpu环境3 -->
       <!--  <nd4j.version>1.0.0-beta6</nd4j.version>
         <dl4j.version>1.0.0-beta6</dl4j.version>
         <cuda.version>9.2</cuda.version>-->
        <!-- gpu环境2 -->
       <!--  <nd4j.version>1.0.0-beta6</nd4j.version>
         <dl4j.version>1.0.0-beta6</dl4j.version>
         <cuda.version>10.2</cuda.version>-->
        <!-- gpu环境1 -->
        <nd4j.version>1.0.0-M1.1</nd4j.version>
        <dl4j.version>1.0.0-M1.1</dl4j.version>
        <cuda.version>11.2</cuda.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.7.25</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>com.hankcs</groupId>
            <artifactId>hanlp</artifactId>
            <version>portable-1.7.1</version>
        </dependency>
        <!-- 读取.xls的excle -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.13</version>
        </dependency>
        <!-- 读取.xlsx的excle -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.13</version>
        </dependency>
        <!-- 有关excel读取-结束 -->


        <!-- cpu依赖开始 -->
        <!--<dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-native-platform</artifactId>
            <version>${nd4j.version}</version>
        </dependency>-->
        <!-- cpu依赖结束 -->

        <!-- gpu版本依赖开始 -->
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-${cuda.version}-platform</artifactId>
            <version>${nd4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-cuda-${cuda.version}</artifactId>
            <version>${dl4j.version}</version>
        </dependency>
        <!-- gpu版本依赖结束 -->

        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-core</artifactId>
            <version>${dl4j.version}</version>
        </dependency>

        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-nlp</artifactId>
            <version>${dl4j.version}</version>
        </dependency>
    </dependencies>

    <version>0.0.1</version>
    <groupId>com.tianque</groupId>

    <build>
        <finalName>${project.artifactId}</finalName>
        <plugins>
            <!-- 资源文件拷贝插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>2.7</version>
                <configuration>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- java编译插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
      
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.4.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>java</executable>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <shadedArtifactAttached>true</shadedArtifactAttached>
                    <shadedClassifierName>bin</shadedClassifierName>
                    <createDependencyReducedPom>true</createDependencyReducedPom>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>org/datanucleus/**</exclude>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                            </excludes>
                        </filter>
                    </filters>

                </configuration>

                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>reference.conf</resource>
                                </transformer>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>

    </build>

</project>

 类似资料: