何时在ZODB中提交数据

水铭晨

2023-03-14

问题内容：

我正在尝试处理以下代码生成的数据：

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

由于字典很大（10000个键X
10000个列表，每个包含3个元素），很难将其保存在内存中。我一直在寻找一种解决方案，该解决方案将在生成键：值（以列表的形式）对后立即对其进行存储。建议在此处以特定格式（Python）编写和阅读字典，以结合使用ZODB和Btree。

如果这太天真，请忍受我，我的问题是，何时应该调用一次transaction.commit()提交数据？如果我在内循环的末尾调用它，则生成的文件将非常大（不确定原因）。这是一个片段：

storage = FileStorage('Data.fs')
db = DB(store)
connection = db.open()
root = connection.root()
btree_container = IOBTree
root[0] = btree_container 
for nodes in G.nodes()
    btree_container[nodes] = PersistentList () ## I was loosing data prior to doing this

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
        transaction.commit()

如果在两个循环之外都调用它怎么办？就像是：

    ......
       ......
          score = SomeOperation on (Gvalue,Hvalue)
          btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    transaction.commit()

在我调用transaction.commit（）之前，所有数据都将保存在内存中 吗？同样，我不确定为什么，但这会导致磁盘上的文件较小。

我想最小化保存在内存中的数据。任何指导将不胜感激！

问题答案：

您的目标是使过程在内存限制内可管理。为了能够使用ZODB作为工具来执行此操作，您需要了解ZODB事务如何工作以及如何使用它们。

为什么您的ZODB会变得如此之大

首先，您需要了解事务提交在这里的作用，这也解释了您的Data.fs为何变得如此之大。

ZODB将每个事务写出数据，任何已更改的持久对象都将写入磁盘。这里的重要细节是 已更改的持久对象 ；ZODB以 持久对象 为单位工作。

并非每个python值都是一个持久对象。如果我定义了一个简单的python类，它将不会是持久的，也不会是任何内置的python类型，例如int或list。另一方面，您定义的任何继承自其的类persistence.Persistent
都是一个持久对象。这些BTrees类集以及PeristentList您在代码中使用的类都继承自Persistent。

现在，在事务提交时，任何 已更改的 持久对象都会作为该事务的一部分写入磁盘。因此，任何PersistentList已追加到将被写入对象
在它的整个
磁盘。BTrees处理这一点更有效率；它们存储自己持久的存储桶，这些存储桶又存储实际存储的对象。因此，对于您创建的每个新节点，将桶存储到事务中，而不是整个BTree结构中。请注意，由于树中保存的项目本身就是持久性对象，因此只有对它们的引用存储在Bucket记录中。

现在，ZODB通过将事务数据追加到Data.fs文件中来写入事务数据，并且它不会自动删除旧数据。它可以通过从商店中查找给定对象的最新版本来构造数据库的当前状态。这就是为什么您Data.fs增长如此之快的原因，因为您PersistentList在提交事务时正在写出越来越大的实例的新版本。

删除旧数据称为 packing
，它类似于VACUUMPostgreSQL和其他关系数据库中的命令。只需调用该.pack()方法的db变量删除 所有的
旧版本，或者使用t与days该方法的参数设置限制多少历史保留，第一个是time.time()时间戳（秒从epoch），它之前，你可以打包，并days为从当前时间开始保留的过去天数（t如果指定）。打包将减少旧事务中的部分列表，从而大大减少您的数据文件。请注意，打包是一项昂贵的操作，因此可能需要一段时间，具体取决于数据集的大小。

使用事务管理内存

您正在尝试通过使用持久性来解决内存约束来构建大型数据集，并正在使用事务尝试将事物刷新到磁盘。但是，通常情况下，使用事务提交信号可以完成数据集的构造，您可以将其用作一个原子整体。

您需要在这里使用一个 保存点 。保存点本质上是子事务，是整个事务中的一个点，您可以在其中请求将数据临时
存储在磁盘上。当您提交交易时，它们将被永久化。要创建保存点，请在事务上调用.savepoint方法：

for Gnodes in G.nodes():      # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append(
            [Hnodes, score, -1 ])
    transaction.savepoint(True)
transaction.commit()

在上面的示例中，我将optimistic标志设置为True，这意味着： 我无意回滚到该保存点
；一些存储不支持回滚，并且通过信号通知您不需要这样做，可以使您的代码在这种情况下工作。

还要注意，在transaction.commit()处理完整个数据集后会发生这种情况，这就是应该实现的提交。

保存点要做的一件事是调用ZODB缓存的垃圾回收，这意味着将从内存中删除当前未使用的任何数据。

Note the ‘not currently in use’ part there; if any of your code holds on to
large values in a variable the data cannot be cleared from memory. As far as I
can determine from the code you’ve shown us, this looks fine. But I do not
know how your operations work or how you generate the nodes; be careful to
avoid building complete lists in memory there when an iterator will do, or
build large dictionaries where all your lists of lists are referenced, for
example.

You can experiment a little as to where you create your savepoints; you could
create one every time you’ve processed one HNodes, or only when done with a
GNodes loop like I’ve done above. You are constructing a list per GNodes,
so it would be kept in memory while looping over all the H.nodes() anyway,
and flushing to disk would probably only make sense once you’ve completed
constructing it in full.

If, however, you find that you need to clear memory more often, you should
consider using either a BTrees.OOBTree.TreeSet class or a
BTrees.IOBTree.BTree class instead of a PersistentList to break up your
data into more persistent objects. A TreeSet is ordered but not (easily)
indexable, while a BTree could be used as a list by using simple
incrementing index keys:

for i, Hnodes in enumerate(H.nodes()):
    ...
    btree_container.setdefault(Gnodes, IOBTree())[i] = [Hnodes, score, -1]
    if i % 100 == 0:
        transaction.savepoint(True)

The above code uses a BTree instead of a PersistentList and creates a
savepoint every 100 HNodes processed. Because the BTree uses buckets, which
are persistent objects in themselves, the whole structure can be flushed to a
savepoint more easily without having to stay in memory for all H.nodes() to
be processed.

何时在ZODB中提交数据

为什么您的ZODB会变得如此之大

使用事务管理内存

相关阅读

相关文章

相关问答

相关工具

相关文档