问题：

读写大文本文件python太慢了

印振国

2023-03-14

此代码检查一个5.1GB的大型文本文件，并检查是否有出现少于100次的单词。然后将5.1GB重写到输出文本文件中，并将这些单词替换为unk。主要问题是创建output.txt需要很长时间。我怀疑方法write_text（）打开数据集文件和输出文件的方式会导致问题。

这个脚本背后的目标是:我有一个预构建的vocab和一个文本。这篇课文可能有新单词不在我的词汇表中，所以我想把它们添加到我的词汇表中。但我只想添加相关的新单词(出现超过100次)。课文中出现不到100次的生词是一次性的，不重要，所以我想把它们改成“unk”。


from collections import Counter

extra_words = []
new_words = []
add_words = []


def get_vocab():
    vocab = set()
    with open('vocab.txt', 'r', encoding='utf-8') as rd:
        lines = rd.readlines()

    for line in lines:
        tokens = line.split(' ')
        word = tokens[0]
        vocab.add(word)

    return vocab


def _count(text):

    vocab = get_vocab()

    with open(text, 'r', encoding='utf-8') as fd:

        for line in fd.readlines():

            for token in line.split():

                if token not in vocab:
                    extra_words.append(token)

    word_count = Counter(extra_words)

    # add del word_count[punctuation] to remove it from list

    #del word_count['"']

    for word in word_count:

        if word_count[word] < 100:
            new_words.append(word)

        else:
            add_words.append(word)

    write_text()

    #return len(new_words), word_count.most_common()[0]


def write_text():

    with open('dataset', 'r', encoding='utf-8') as fd:

        f = fd.readlines()

    with open('output.txt', 'w', encoding='utf-8') as rd:
        new_text = []
        for line in f:
            new_line = []
            for token in line.split():

                

                if token in new_words:

                    new_line.append('<unk>')

                else:

                    new_line.append(token)

            new_text.append(' '.join(new_line))
        print('\n'.join(new_text), file=rd)
            #print(' '.join(new_line), file=rd)


def add_vocab():

    ln = len(get_vocab())

    with open('vocab.txt', 'w', encoding='utf-8') as fd:

        for idx, word in add_words:

            print(f'{word} {ln + idx + 1}\n', file=fd)

    pass


print(_count('dataset'))
add_vocab()

共有1个答案

赵佐

2023-03-14

我用莎士比亚全集测试了这个。你还有一堆关于大小写和标点符号的工作要做。它在大约15秒内为我复制了100份他的作品(500meg)。如果这要花更多的时间，你可能想检查一下你的代码。请注意，我使用了您的词汇表文件的简化版本，因为我没有遵循您想要在其中看到的内容。我用的版本只是一行一行的文字。

import collections

def get_vocabulary(path):
    with open(path, 'r', encoding='utf-8') as file_in:
        tokens = [line.strip("\n") for line in file_in]
    return set(tokens)

def get_interesting_word_counts(path, vocabulary):
    word_counts = collections.Counter()
    with open(path, 'r', encoding='utf-8') as file_in:
        for line in file_in:
            word_counts.update([token for token in line.split() if token not in vocabulary])
    return word_counts

def get_cleaned_text(path, vocabulary, uncommon_words):
    with open(path, 'r', encoding='utf-8') as file_in:
        for line in file_in:
            #line_out = " ".join(["<unk>" if token in uncommon_words else token for token in line.strip("\n").split()])
            line_out = " ".join([
                token if token in vocabulary or token not in uncommon_words else "<unk>"
                for token in line.strip("\n").split()
            ])
            yield "{}\n".format(line_out)

vocabulary = get_vocabulary("vocabulary.txt")
word_counts = get_interesting_word_counts("shakespeare.txt", vocabulary)

## --------------------------------------
## Add frequent but missing words to vocabulary
## --------------------------------------
common_words = set([item[0] for item in word_counts.items() if item[1] >= 100])
with open('vocabulary.txt', 'a', encoding='utf-8') as file_out:
    for word in common_words:
        file_out.write("{}\n".format(word))
## --------------------------------------

## --------------------------------------
## Rewite the text censuring uncommon words
## --------------------------------------
uncommon_words = set([item[0] for item in word_counts.items() if item[1] < 100])
cleaned_text = get_cleaned_text("shakespeare.txt", vocabulary, uncommon_words)
with open('shakespeare_out.txt', 'w', encoding='utf-8') as file_out:
    file_out.writelines(cleaned_text)
## --------------------------------------

你可以在这里得到我使用的文本：http://www.gutenberg.org/ebooks/100

来源开始：

The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare

生成的文件开始于:

<unk> <unk> <unk> <unk> of The <unk> <unk> of <unk> <unk> by <unk> <unk>

更新后的词汇文件开始于：

as
run
he’s
this.
there’s
like
you.

类似资料：

Python复制较大的文件太慢

问题内容：我正在尝试使用将大文件（> 1 GB）从硬盘复制到USB驱动器。一个描述我正在尝试做的简单脚本是：在Linux上只需要2-3分钟。但是在Windows下，同一文件上的同一文件副本要花费10-15分钟以上的时间。有人可以解释为什么并给出一些解决方案，最好使用python代码吗？更新1 将文件另存为test.pySource文件大小为1 GB。目的地目录位于USB驱动器中。使用ptim
读/写文本文件

问题内容：我正在尝试更改文本文件中的某些行，而不影响其他行。这就是文本文件“ text.txt”中的内容我的目标是更改第4行和第5行，但其余部分保持不变。即使代码有效，我想知道是否有更好，更有效的方法？是否可以仅通过行号读取文件？问题答案：您没有什么可以改善的。但是您必须将所有行都写入一个新文件，无论已更改还是未更改。较小的改进将是：使用该语句；避免将行存储在列表中；子句中不带
使用Perl6处理大型文本文件，速度太慢。（2014-09）

https://github.com/yeahnoob/perl6-perf 中的代码宿主，如下所示：在“wordpairs.txt”很小的情况下运行良好。但是当“单词对.txt”文件大约有140，000行（每行，两个单词）时，它的运行非常非常慢。它不能自己完成，即使在运行20秒后也是如此。它有什么问题？代码中是否有任何错误？？感谢任何人的帮助！代码（目前，2014-09-04）：运行时
读写本地文件

使用 electron 的一大好处是可以访问用户的文件系统。这使你可以读取和写入本地系统上的文件。为了避免 Chromium 的限制以及对应用程序内部文件的改写，请确保使用 electron 的 API，特别是 app.getPath(name) 函数。这个帮助函数可以使你获得指向系统目录的文件路径，如用户的桌面、系统临时文件等等。使用案例假设我们想为我们的应用程序提供本地的数据库存储。在这
文件和目录 - 读写文本文件

读写文件是最常见的 IO 操作。通常，我们使用 input 从控制台读取输入，使用 print 将内容输出到控制台。实际上，我们也经常从文件读取输入，将内容写到文件。读文件在 Python 中，读文件主要分为三个步骤：打开文件读取内容关闭文件一般使用形式如下： try: f = open('/path/to/file', 'r') # 打开文件 data = f.
读取一个大型文本文件，并使用Python写入另一个文件

我正在尝试转换一个大的文本文件（大小为5 gig），但得到了一个从这篇文章中，我设法将文本文件的编码格式转换为可读的格式：这里的问题是，当我试图转换一个大尺寸(5 GB)的文本文件时。我会得到这个错误我知道它无法读取这么大的文件。我从几个链接中发现，我可以逐行阅读。那么，我如何应用于我必须让它逐行读取的代码呢？我对逐行读取的理解是，我需要从中读取一行，并将其添加到中，直到行尾，对吗？

读写大文本文件python太慢了

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档