本文翻译自:Generating an MD5 checksum of a file
Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? 是否有任何简单的方法来生成(和检查)Python中文件列表的MD5校验和? (I have a small program I'm working on, and I'd like to confirm the checksums of the files). (我正在处理一个小程序,我想确认文件的校验和)。
参考:https://stackoom.com/question/EOm1/生成文件的MD-校验和
I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer: 我显然没有添加任何根本上没有新的内容,而是在我要评论状态之前添加了此答案,并且代码区域使事情更加清晰了-无论如何,特别是要从Omnifarious的答案中回答@Nemo的问题:
I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. 我碰巧在考虑校验和(在这里专门寻找有关块大小的建议),并且发现此方法可能比您期望的要快。 Taking the fastest (but pretty typical) timeit.timeit
or /usr/bin/time
result from each of several methods of checksumming a file of approx. 从几种校验和约为文件的文件的方法中,以最快的(但很典型的) timeit.timeit
或/usr/bin/time
结果。 11MB: 11MB:
$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f /tmp/test.data.300k
real 0m0.043s
user 0m0.032s
sys 0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400
So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. 因此,对于11MB的文件来说,Python和/ usr / bin / md5sum大约都需要30毫秒。 The relevant md5sum
function ( md5sum_read
in the above listing) is pretty similar to Omnifarious's: 相关的md5sum
函数(上面的清单中的md5sum_read
)与Omnifarious的函数非常相似:
import hashlib
def md5sum(filename, blocksize=65536):
hash = hashlib.md5()
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
hash.update(block)
return hash.hexdigest()
Granted, these are from single runs (the mmap
ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize)
after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum
on the command line is not necessarily faster than a Python implementation... 当然,这些都是单次运行的(至少进行了几十次运行时, mmap
总是更快一些),并且在缓冲区用完后,我的通常会获得额外的f.read(blocksize)
,但是它是相当可重复的,并且显示命令行上的md5sum
不一定比Python实现要快...
EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. 编辑:抱歉,很长的延迟,已经有一段时间没有看到了,但是为了回答@EdRandall的问题,我将写下一个Adler32实现。 However, I haven't run the benchmarks for it. 但是,我还没有运行基准测试。 It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32()
call: 它基本上与CRC32相同:除了初始化,更新和摘要调用外,其他所有操作都是zlib.adler32()
调用:
import zlib
def adler32sum(filename, blocksize=65536):
checksum = zlib.adler32("")
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
checksum = zlib.adler32(block, checksum)
return checksum & 0xffffffff
Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for ""
, which is 1
-- CRC can start with 0
instead. 请注意,这必须从空字符串开始,因为从零开始,与Adler的总和确实不同,因为""
的总和为1
,而CRC可以从0
开始。 The AND
-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions. 需要使用AND
-ing使其成为32位无符号整数,以确保其在Python版本之间返回相同的值。
There is a way that's pretty memory inefficient . 有一种方法使内存效率很低 。
single file: 单个文件:
import hashlib
def file_as_bytes(file):
with file:
return file.read()
print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()
list of files: 文件列表:
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. 但是,请记住, 已知MD5已损坏,并且不应将其用于任何目的,因为漏洞分析可能非常棘手,并且无法分析代码将来可能用于安全性问题的任何可能。 IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. 恕我直言,应该从库中将其完全删除,以便使用它的每个人都必须进行更新。 So, here's what you should do instead: 因此,这是您应该做的:
[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
If you only want 128 bits worth of digest you can do .digest()[:16]
. 如果只需要128位摘要,则可以执行.digest()[:16]
。
This will give you a list of tuples, each tuple containing the name of its file and its hash. 这将为您提供一个元组列表,每个元组都包含其文件名和哈希值。
Again I strongly question your use of MD5. 我再次强烈质疑您对MD5的使用。 You should be at least using SHA1, and given recent flaws discovered in SHA1 , probably not even that. 您至少应该使用SHA1,并且鉴于SHA1中发现的最新缺陷 ,可能甚至没有。 Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. 有人认为,只要您不将MD5用于“加密”目的,就可以了。 But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. 但是,事情的范围最终趋向于超出您最初预期的范围,并且偶然的漏洞分析可能证明是完全有缺陷的。 It's best to just get in the habit of using the right algorithm out of the gate. 最好只是养成使用正确算法的习惯。 It's just typing a different bunch of letters is all. 只是输入了不同的字母而已。 It's not that hard. 没那么难。
Here is a way that is more complex, but memory efficient : 这是一种更复杂但内存有效的方法 :
import hashlib
def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
for block in bytesiter:
hasher.update(block)
return hasher.hexdigest() if ashexstr else hasher.digest()
def file_as_blockiter(afile, blocksize=65536):
with afile:
block = afile.read(blocksize)
while len(block) > 0:
yield block
block = afile.read(blocksize)
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
for fname in fnamelst]
And, again, since MD5 is broken and should not really ever be used anymore: 再说一次,由于MD5损坏了,不再应该再使用了:
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
for fname in fnamelst]
Again, you can put [:16]
after the call to hash_bytestr_iter(...)
if you only want 128 bits worth of digest. 同样,如果只需要128位摘要,可以在对hash_bytestr_iter(...)
的调用之后放置[:16]
。
You can use hashlib.md5() 您可以使用hashlib.md5()
Note that sometimes you won't be able to fit the whole file in memory. 请注意,有时您将无法在内存中容纳整个文件。 In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function: 在这种情况下,您将必须顺序读取4096个字节的块并将其提供给Md5函数:
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
Note: hash_md5.hexdigest()
will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest()
, so you don't have to convert back. 注意: hash_md5.hexdigest()
将返回摘要的十六进制字符串表示形式,如果您只需要打包的字节,请使用return hash_md5.digest()
,因此您不必进行转换。
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
I think relying on invoke package and md5sum binary is a bit more convenient than subprocess or md5 package 我认为依靠invoke包和md5sum二进制文件比子进程或md5包更方便
import invoke
def get_file_hash(path):
return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]
This of course assumes you have invoke and md5sum installed. 当然,这假定您已经安装了invoke和md5sum。