提高从二进制文件读取和转换的速度？

司徒钱青

2023-03-14

问题内容：

我知道以前有一些关于文件读取，二进制数据处理和整数转换的问题struct，所以我来这里询问我有一段代码，我认为这花费了太多时间。所读取的文件是多通道数据样本记录（短整数），具有插入的数据间隔（因此有嵌套for语句）。代码如下：

# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
    for channel in channel_names:
        channel_content[channel]['recording'].extend(
            [struct.unpack( "h", f.read(2))[0]
            for iteration in xrange(int(channel_content[channel]['nsamples']))])

使用此代码，使用具有2Mb RAM的双核，每兆字节读取2.2秒，并且我的文件通常具有20+
Mb，这会带来一些非常令人讨厌的延迟（特别是考虑到另一个我试图镜像加载文件的基准共享软件程序）更快）。

我想知道的是：

如果存在违反“良好做法”的情况：循环安排不当，重复操作所需的时间超过必要，使用效率低下的容器类型（字典？）等。
如果此读取速度正常，或对于Python正常，并且读取速度
如果创建C ++编译的扩展，则可能会提高性能，并且是推荐的方法。
（当然）如果有人建议对此代码进行一些修改，最好是基于以前对类似操作的经验。

谢谢阅读

（我已经发布了有关我的这项工作的一些问题，我希望它们在概念上都无关紧要，并且我也希望不要过于重复。）

编辑： channel_names是一个列表，所以我进行了@eumiro建议的更正（删除错字括号）

编辑：
我目前正接受塞巴斯蒂安（Sebastian）建议使用arraywithfromfile()方法，并将很快将最终代码放在此处。此外，每一次竞争对我都非常有用，我非常高兴地感谢每个回答的人。

完成array.fromfile()一次之后的最终形式，然后通过对大数组进行切片来为每个通道交替扩展一个数组：

fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(
                                                fullsamples[position:position+samples])
        position += samples

通过一次读取文件或以任何形式使用文件，速度的提高非常令人印象深刻struct。

问题答案：

您可以array用来读取数据：

import array
import os

fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)

它比快40个倍struct.unpack从@samplebias的答案。

如果文件只有20-30M，为什么不读取整个文件，在一次调用中将数字解码unpack，然后通过遍历数组在通道之间分配它们：

data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices

快速测试显示，这使struct.unpack('h', f.read(2))20M文件的速度提高了10倍。

提高从二进制文件读取和转换的速度？

相关阅读

相关文章

相关问答

相关工具

相关文档