为什么“ new_file + =行+字符串”比“ new_file = new_file +行+字符串”这么快？

利俊迈

2023-03-14

问题内容：

使用以下代码时，我们的代码需要10分钟才能虹吸68,000条记录：

new_file = new_file + line + string

但是，当我们执行以下操作时，仅需1秒钟：

new_file += line + string

这是代码：

for line in content:
import time
import cmdbre

fname = "STAGE050.csv"
regions = cmdbre.regions
start_time = time.time()
with open(fname) as f:
        content = f.readlines()
        new_file_content = ""
        new_file = open("CMDB_STAGE060.csv", "w")
        row_region = ""
        i = 0
        for line in content:
                if (i==0):
                        new_file_content = line.strip() + "~region" + "\n"
                else:
                        country = line.split("~")[13]
                        try:
                                row_region = regions[country]
                        except KeyError:
                                row_region = "Undetermined"
                        new_file_content += line.strip() + "~" + row_region + "\n"
                print (row_region)
                i = i + 1
        new_file.write(new_file_content)
        new_file.close()
        end_time = time.time()
        print("total time: " + str(end_time - start_time))

我曾经用python编写过的所有代码都使用第一个选项。这只是基本的字符串操作…我们正在从文件中读取输入，对其进行处理并将其输出到新文件中。我100％确信第一种方法的运行时间比第二种方法长约600倍，但是为什么呢？

正在处理的文件是csv，但使用〜而不是逗号。我们在这里所做的就是使用此csv，其中有一个针对国家/地区的列，并为国家/地区区域添加一列，例如LAC，EMEA，NA等…
cmdbre.regions只是一个词典，涵盖了所有200个国家/地区作为键，每个区域作为值。

一旦我更改为附加字符串操作…循环将在1秒而不是10分钟内完成… csv中有68,000条记录。

问题答案：

CPython（参考解释器）对就地字符串连接进行了优化（当附加的字符串没有其他参考时）。+仅当执行时，它不能可靠地应用此优化+=（+涉及两个实时引用，即赋值目标和操作数，而前者不参与+运算，因此很难对其进行优化）。

但是，按照PEP 8，您不应该依赖于此：

应该以不损害Python其他实现（PyPy，Jython，IronPython，Cython，Psyco等）的方式编写代码。

例如，对于形式为+ = b或a = a +
b的语句，请不要依赖CPython对原位字符串连接的有效实现。即使在CPython中，这种优化也很脆弱（仅适用于某些类型），并且在不使用引用计数的实现中根本不存在这种优化。在库的性能敏感部分中，应改用’‘.join（）形式。这将确保在各种实现中串联发生在线性时间内。

根据问题编辑进行更新
：是的，您破坏了优化。您串联了许多字符串，而不仅仅是一个字符串，Python从左到右求值，因此它必须首先执行最左侧的串联。从而：

new_file_content += line.strip() + "~" + row_region + "\n"

与以下内容完全不同：

new_file_content = new_file_content + line.strip() + "~" + row_region + "\n"

因为前者将所有新
片段连接在一起，然后将它们一次全部添加到累加器字符串中，而后者必须使用不涉及new_file_content自身的临时变量从左到右评估每个加法。为清楚起见添加括号，就像您所做的那样：

new_file_content = (((new_file_content + line.strip()) + "~") + row_region) + "\n"

因为它实际上直到到达类型才知道类型，所以不能假设所有这些都是字符串，因此不会启动优化。

如果将第二位代码更改为：

new_file_content = new_file_content + (line.strip() + "~" + row_region + "\n")

或稍慢一些，但仍比慢代码快很多倍，因为它保持了CPython优化：

new_file_content = new_file_content + line.strip()
new_file_content = new_file_content + "~"
new_file_content = new_file_content + row_region
new_file_content = new_file_content + "\n"

因此对于CPython来说，积累是显而易见的，您可以解决性能问题。但坦率地说，您应该+=在执行这样的逻辑附加操作时使用它。+=存在是有原因的，它为维护者和解释者提供了有用的信息。除此之外，就DRY而言，这是一个好习惯；为什么在不需要时为变量命名两次？

当然，按照PEP8准则，即使+=在此处使用也是错误的形式。在大多数具有不可变字符串的语言中（包括大多数非CPython
Python解释器），重复的字符串连接是画家算法Schlemiel的一种形式，这会导致严重的性能问题。正确的解决方案是构建一个list字符串，然后将join它们全部合并在一起，例如：

    new_file_content = []
    for i, line in enumerate(content):
        if i==0:
            # In local tests, += anonymoustuple runs faster than
            # concatenating short strings and then calling append
            # Python caches small tuples, so creating them is cheap,
            # and using syntax over function calls is also optimized more heavily
            new_file_content += (line.strip(), "~region\n")
        else:
            country = line.split("~")[13]
            try:
                    row_region = regions[country]
            except KeyError:
                    row_region = "Undetermined"
            new_file_content += (line.strip(), "~", row_region, "\n")

    # Finished accumulating, make final string all at once
    new_file_content = "".join(new_file_content)

即使使用CPython字符串连接选项，它通常也更快，并且在非CPython
Python解释器上也将可靠地快速，因为它使用可变变量list有效地累积结果，然后允许''.join预先计算字符串的总长度，分配最后一个字符串一次（而不是逐步增加大小），并只填充一次。

旁注：对于您的特定情况，您根本不应该累加或串联。您有一个输入文件和一个输出文件，并且可以逐行处理。每次您要追加或累积文件内容时，只需将它们写出来（我在整理代码时已对PEP8合规性和其他较小的样式改进进行了一些清理）：

start_time = time.monotonic()  # You're on Py3, monotonic is more reliable for timing

# Use with statements for both input and output files
with open(fname) as f, open("CMDB_STAGE060.csv", "w") as new_file:
    # Iterate input file directly; readlines just means higher peak memory use
    # Maintaining your own counter is silly when enumerate exists
    for i, line in enumerate(f):
        if not i:
            # Write to file directly, don't store
            new_file.write(line.strip() + "~region\n")
        else:
            country = line.split("~")[13]
            # .get exists to avoid try/except when you have a simple, constant default
            row_region = regions.get(country, "Undetermined")
            # Write to file directly, don't store
            new_file.write(line.strip() + "~" + row_region + "\n")
end_time = time.monotonic()
# Print will stringify arguments and separate by spaces for you
print("total time:", end_time - start_time)

实施细节深入探讨

对于那些对实现细节感到好奇的人，CPython字符串连接优化是在字节码解释器中实现的，而不是在str类型本身上实现的（从技术上讲，PyUnicode_Append突变优化是实现的，但是需要解释器的帮助来固定引用计数，以便它知道可以安全地使用优化；在没有解释程序帮助的情况下，只有C扩展模块才能从该优化中受益。

当解释器检测到两个操作数均为Python级别str类型时（在C层，在Python
3中仍称为PyUnicode，不值得更改的2.x天的遗留值），它将调用特殊unicode_concatenate函数，该函数将检查下一条指令是否是三个基本STORE_*指令之一。如果是，并且目标与左操作数相同，它将清除目标引用，因此PyUnicode_Append将仅看到对该操作数的单个引用，从而允许它str使用单个引用为a调用优化的代码。

这意味着您不仅可以通过

a = a + b + c

您也可以在有问题的变量不是顶级（全局，嵌套或本地）名称时随时将其断开。如果您使用的是对象属性，list索引，dict值等，即使+=对您没有帮助，它也不会显示“
simple STORE”，因此它不会清除目标引用，而所有这些获得超慢的非原位行为：

foo.x += mystr
foo[0] += mystr
foo['x'] += mystr

它也特定于str类型；在Python 2中，优化对unicode对象无济于事；在Python
3中，bytes优化对对象无济于事；在这两个版本中，都不会针对;的子类进行优化str。那些总是走慢路。

基本上，对于刚接触Python的人来说，最简单的常见情况下，优化应该尽可能的好，但对于中等程度的复杂情况，也不会造成严重的麻烦。这只是加强了PEP8的建议：如果您可以通过执行正确的操作并使用，可以针对每个存储目标在
每个解释器上更快地运行，则取决于解释器的实现细节是一个坏主意str.join。

为什么“ new_file + =行+字符串”比“ new_file = new_file +行+字符串”这么快？

实施细节深入探讨

相关阅读

相关文章

相关问答

相关工具

相关文档