当前位置: 首页 > 知识库问答 >
问题:

在HPC(Argon)上运行Keras时,如何解决“内存不足”问题?

严昀
2023-03-14

我有一个用Keras编码的CONVLSM神经网络。我向集群上的两个队列(一个GPU和另一个CPU)提交了相同的代码。我在CPU上的代码正在运行,但在GPU上我遇到了一个错误,下面我复制了一行错误文件:

“W tensorflow/core/common_runtime/bfc_分配器。cc:273]分配器(GPU 0_bfc)在尝试分配3.12MB时内存不足。当前分配摘要如下。”

错误文件:

Using TensorFlow backend.
2018-04-05 17:39:59.059431: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-04-05 17:40:00.220946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:81:00.0
totalMemory: 15.90GiB freeMemory: 332.94MiB
2018-04-05 17:40:00.221266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:81:00.0, compute capability: 6.0)
/opt/apps/python/2.7.14_openmpi-2.1.2_parallel_studio-2017.4/lib/python2.7/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype uint8 was converted to float64 by MinMaxScaler.
  warnings.warn(msg, DataConversionWarning)
2018-04-05 17:40:50.577736: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.12MiB.  Current allocation summary follows.
2018-04-05 17:40:50.578144: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (256):   Total Chunks: 296, Chunks in use: 294. 74.0KiB allocated for chunks. 73.5KiB in use in bin. 9.3KiB client-requested in use in bin.
2018-04-05 17:40:50.578167: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (512):   Total Chunks: 39, Chunks in use: 39. 22.0KiB allocated for chunks. 22.0KiB in use in bin. 16.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578179: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578192: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2048):  Total Chunks: 14, Chunks in use: 14. 36.8KiB allocated for chunks. 36.8KiB in use in bin. 34.5KiB client-requested in use in bin.
2018-04-05 17:40:50.578203: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578216: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8192):  Total Chunks: 62, Chunks in use: 61. 882.2KiB allocated for chunks. 869.2KiB in use in bin. 857.8KiB client-requested in use in bin.
2018-04-05 17:40:50.578228: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16384):     Total Chunks: 13, Chunks in use: 12. 223.0KiB allocated for chunks. 198.8KiB in use in bin. 190.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578239: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (32768):     Total Chunks: 46, Chunks in use: 46. 2.53MiB allocated for chunks. 2.53MiB in use in bin. 2.53MiB client-requested in use in bin.
2018-04-05 17:40:50.578251: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (65536):     Total Chunks: 168, Chunks in use: 168. 13.19MiB allocated for chunks. 13.19MiB in use in bin. 13.10MiB client-requested in use in bin.
2018-04-05 17:40:50.578263: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (131072):    Total Chunks: 1, Chunks in use: 1. 135.8KiB allocated for chunks. 135.8KiB in use in bin. 80.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578276: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (262144):    Total Chunks: 243, Chunks in use: 243. 76.74MiB allocated for chunks. 76.74MiB in use in bin. 75.94MiB client-requested in use in bin.
2018-04-05 17:40:50.578287: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (524288):    Total Chunks: 3, Chunks in use: 3. 1.64MiB allocated for chunks. 1.64MiB in use in bin. 960.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578297: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578309: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2097152):   Total Chunks: 4, Chunks in use: 4. 12.50MiB allocated for chunks. 12.50MiB in use in bin. 12.50MiB client-requested in use in bin.
2018-04-05 17:40:50.578336: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578348: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8388608):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578358: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16777216):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578367: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578376: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (67108864):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578386: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578395: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578406: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin for 3.12MiB was 2.00MiB, Chunk State: 
2018-04-05 17:40:50.578417: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000000 of size 1280
2018-04-05 17:40:50.578426: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000500 of size 256
2018-04-05 17:40:50.578433: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000600 of size 256
2018-04-05 17:40:50.578440: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000700 of size 57600
2018-04-05 17:40:50.578448: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00e800 of size 512
2018-04-05 17:40:50.578456: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ea00 of size 768
2018-04-05 17:40:50.578464: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ed00 of size 256
2018-04-05 17:40:50.578471: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ee00 of size 256
2018-04-05 17:40:50.578478: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ef00 of size 256
2018-04-05 17:40:50.578485: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f000 of size 256
2018-04-05 17:40:50.578493: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f100 of size 256
2018-04-05 17:40:50.578500: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f200 of size 256
2018-04-05 17:40:50.578507: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f300 of size 256
2018-04-05 17:40:50.578514: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f400 of size 256
2018-04-05 17:40:50.578522: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f500 of size 256
2018-04-05 17:40:50.578529: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f600 of size 57600
2018-04-05 17:40:50.578536: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d700 of size 512
2018-04-05 17:40:50.578544: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d900 of size 3072
2018-04-05 17:40:50.578551: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01e500 of size 57600
2018-04-05 17:40:50.578559: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c600 of size 512
2018-04-05 17:40:50.578571: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c800 of size 768
2018-04-05 17:40:50.578579: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cb00 of size 256
2018-04-05 17:40:50.578586: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cc00 of size 256
2018-04-05 17:40:50.578593: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cd00 of size 256
2018-04-05 17:40:50.578600: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02ce00 of size 256
2018-04-05 17:40:50.578607: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cf00 of size 256
2018-04-05 17:40:50.578614: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d000 of size 256
2018-04-05 17:40:50.578622: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d100 of size 256
2018-04-05 17:40:50.578629: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d200 of size 14592
2018-04-05 17:40:50.578637: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030b00 of size 256
2018-04-05 17:40:50.578644: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030c00 of size 256
2018-04-05 17:40:50.578652: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030d00 of size 256
2018-04-05 17:40:50.578659: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030e00 of size 256
2018-04-05 17:40:50.578666: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030f00 of size 256
2018-04-05 17:40:50.578673: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031000 of size 256
2018-04-05 17:40:50.578681: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031100 of size 256
2018-04-05 17:40:50.578688: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031200 of size 256
2018-04-05 17:40:50.578695: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031300 of size 512
2018-04-05 17:40:50.578702: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031500 of size 14592
2018-04-05 17:40:50.578709: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034e00 of size 256
2018-04-05 17:40:50.578717: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034f00 of size 256
2018-04-05 17:40:50.578724: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035000 of size 256
2018-04-05 17:40:50.578731: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035100 of size 256
2018-04-05 17:40:50.578738: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035200 of size 256
2018-04-05 17:40:50.578746: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035300 of size 256
2018-04-05 17:40:50.578753: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035400 of size 256
2018-04-05 17:40:50.578760: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035500 of size 256
2018-04-05 17:40:50.578767: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035600 of size 512
2018-04-05 17:40:50.578775: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035800 of size 23296
2018-04-05 17:40:50.578782: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c03b300 of size 57600
2018-04-05 17:40:50.578789: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049400 of size 512
2018-04-05 17:40:50.578797: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049600 of size 57600
2018-04-05 17:40:50.578804: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c057700 of size 57600
2018-04-05 17:40:50.578811: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065800 of size 256
2018-04-05 17:40:50.578823: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065900 of size 256
2018-04-05 17:40:50.578830: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065a00 of size 256
2018-04-05 17:40:50.578838: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065b00 of size 256
2018-04-05 17:40:50.578845: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065c00 of size 256
2018-04-05 17:40:50.578852: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065d00 of size 256
2018-04-05 17:40:50.578859: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065e00 of size 256
2018-04-05 17:40:50.578867: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065f00 of size 256
2018-04-05 17:40:50.578874: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066000 of size 512
2018-04-05 17:40:50.578881: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066200 of size 14592
2018-04-05 17:40:50.578888: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069b00 of size 256
2018-04-05 17:40:50.578896: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069c00 of size 256

共有1个答案

冯星剑
2023-03-14

CPU上的tensorflow需要将数据加载到内存中,而GPU上的tensorflow需要将数据加载到GPU内存中。这很可能是您出错的原因。您可以尝试减少批量大小。

 类似资料:
  • 我试图从IntelliJ的想法运行JUnits,当我试图运行test.java文件时,它给了我一个错误,上面写着 Java:OutofMemoryError:内存不足 我尝试将分配给Idea的内存增加到6GB,但它仍然给我带来同样的错误,我遗漏了什么:/

  • 在我的构造函数中,我可能不得不缩短它。虽然我第一次尝试就开始工作了。记录cat消息: 05-08 19:09:46.035:E/dalvikvm-heap(420):28字节分配内存不足。

  • 我已经编写了一个简单的map reduce作业来在一些点上执行KMeans聚类。 运行会得到以下输出: 问题出在哪里,有没有应对的建议?

  • 问题内容: 我在集群上使用Keras和Tensorflow后端(创建神经网络)。我如何在群集上(在多个内核上)以多线程方式运行它,还是Keras自动完成此操作?例如,在Java中,可以创建多个线程,每个线程在一个内核上运行。 如果可能,应使用多少个内核? 问题答案: Tensorflow会自动在一台计算机上在尽可能多的内核上运行计算。 如果您有分布式集群,请确保按照https://www.tens

  • 我是Tomcat/tech新手,所以如果我在问题描述中犯了任何错误,我很抱歉。 我读过很多关于相关主题的帖子,但由于我已经运行了一个应用程序十年,突然出现这个问题对我来说真的很奇怪。此外,这个环境就像一个测试服务器,目前我是唯一一个使用它的仪表板。因此,系统负载增加的可能性似乎不对;也没有内存泄漏的可能性。 有人能告诉我在这种情况下可能出了什么问题吗? 谢了! 编辑:我认为我的问题是不同的,因为它

  • 我正在使用Java/J2EE开发web应用程序。当我在服务器中部署该应用程序时,它将运行两天后,tomcat将自动停止并打印此错误消息,如果一天内没有访问该应用程序,请帮助我解决此问题。 我已经设置了堆大小-xms1024m-xmx1536m-xx:maxpermsize=1024m' 内存不足,Java运行时环境无法继续。本机内存分配(malloc)未能为chunk::new分配32776字节。