Python：如何将巨大的文本文件读取到内存中

龚招

2023-03-14

问题内容：

我在具有1GB RAM的Mac Mini上使用Python 2.6。我想阅读一个巨大的文本文件

$ ls -l links.csv; file links.csv; tail links.csv 
-rw-r--r--  1 user  user  469904280 30 Nov 22:42 links.csv
links.csv: ASCII text, with CRLF line terminators
4757187,59883
4757187,99822
4757187,66546
4757187,638452
4757187,4627959
4757187,312826
4757187,6143
4757187,6141
4757187,3081726
4757187,58197

因此，文件中的每一行都由两个逗号分隔的整数值组成的元组。我想阅读整个文件，并根据第二列对其进行排序。我知道，我可以进行排序而无需将整个文件读入内存。但我认为对于500MB的文件，由于我有1GB的可用空间，因此仍应该可以在内存中进行处理。

但是，当我尝试读取文件时，Python似乎分配了比磁盘上的文件所需更多的内存。因此，即使有1GB的RAM，我也无法将500MB的文件读入内存。我用于读取文件和打印一些有关内存消耗信息的Python代码是：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

infile=open("links.csv", "r")

edges=[]
count=0
#count the total number of lines in the file
for line in infile:
 count=count+1

total=count
print "Total number of lines: ",total

infile.seek(0)
count=0
for line in infile:
 edge=tuple(map(int,line.strip().split(",")))
 edges.append(edge)
 count=count+1
 # for every million lines print memory consumption
 if count%1000000==0:
  print "Position: ", edge
  print "Read ",float(count)/float(total)*100,"%."
  mem=sys.getsizeof(edges)
  for edge in edges:
   mem=mem+sys.getsizeof(edge)
   for node in edge:
    mem=mem+sys.getsizeof(node)

  print "Memory (Bytes): ", mem

我得到的输出是：

Total number of lines:  30609720
Position:  (9745, 2994)
Read  3.26693612356 %.
Memory (Bytes):  64348736
Position:  (38857, 103574)
Read  6.53387224712 %.
Memory (Bytes):  128816320
Position:  (83609, 63498)
Read  9.80080837067 %.
Memory (Bytes):  192553000
Position:  (139692, 1078610)
Read  13.0677444942 %.
Memory (Bytes):  257873392
Position:  (205067, 153705)
Read  16.3346806178 %.
Memory (Bytes):  320107588
Position:  (283371, 253064)
Read  19.6016167413 %.
Memory (Bytes):  385448716
Position:  (354601, 377328)
Read  22.8685528649 %.
Memory (Bytes):  448629828
Position:  (441109, 3024112)
Read  26.1354889885 %.
Memory (Bytes):  512208580

在仅读取500MB文件的25％之后，Python消耗了500MB。因此，似乎将文件的内容存储为int元组的列表并不是很节省内存。是否有更好的方法可以将500MB文件读入1GB内存？

问题答案：

此页面上有一个用于对大于RAM的文件进行排序的方法，尽管您必须针对涉及CSV格式数据的情况进行调整。那里也有指向其他资源的链接。

编辑： 确实，磁盘上的文件不是“大于RAM”，但是内存中的表示形式很容易变得比 可用RAM
大得多。一方面，您自己的程序无法获得全部1GB（操作系统开销等）。另外，即使您以最紧凑的形式将其存储为纯Python（两个整数列表，假设使用32位计算机等），对于这30M对整数，您仍将使用934MB。

使用numpy，您也可以完成这项工作，仅使用约250MB。以这种方式加载并不是特别快，因为您必须计算行数并预分配数组，但是考虑到它在内存中，它可能是最快的实际排序：

import time
import numpy as np
import csv

start = time.time()
def elapsed():
    return time.time() - start

# count data rows, to preallocate array
f = open('links.csv', 'rb')
def count(f):
    while 1:
        block = f.read(65536)
        if not block:
             break
        yield block.count(',')

linecount = sum(count(f))
print '\n%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array
m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])
f.seek(0)
f = csv.reader(open('links.csv', 'rb'))
for i, row in enumerate(f):
    m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()
# sort in-place
m.sort(order='b')

print '%.3fs: sorted' % elapsed()

在我的机器上输出的示例文件类似于您显示的内容：

6.139s: file has 33253213 lines
238.130s: read into memory
517.669s: sorted

numpy中的默认值为Quicksort。ndarray.sort（）例程（就地排序）也可以采用关键字参数kind="mergesort"，kind="heapsort"但是似乎这两个方法均不能在Record
Array
上进行排序，顺便说一句，我用它作为唯一的排序方法列 一起使用
，而不是默认值，后者将对它们进行独立排序（完全弄乱了您的数据）。

Python：如何将巨大的文本文件读取到内存中

相关阅读

相关文章

相关问答

相关工具

相关文档