问题：

有没有办法删除字符串中重复和连续的单词/短语？

公良渝

2023-03-14

有没有办法删除字符串中重复和连续的单词/短语？例如。

[输入]：foo foo bar foo bar

[out]：foobar foobar

我试过这个：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'

当它变得更复杂，我想删除短语（假设短语最多由5个单词组成）时会发生什么？怎样才能做到呢？例如。

[输入]：foobar foobar foobar

[out]：foo bar

另一个例子：

[in]：这是一个句子这是一个短语重复的句子短语重复的地方。句子不是句子。

这是一个短语重复的句子。句子不是短语。

共有3个答案

陶永望

2023-03-14

txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
txt2 =  'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'

def remove_duplicates(txt):
    result = []
    for word in txt.split():
        if word not in result:
            result.append(word)
    return ' '.join(result)

输出：

In [7]: remove_duplicate_words(txt1)                                                                                                                                  
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                                                                                  

In [8]: remove_duplicate_words(txt2)                                                                                                                                 
Out[8]: 'this is a sentence where phrases duplicate'

澹台俊晖

2023-03-14

我喜欢itertools。好像每次我想写东西的时候，itertools都已经有了。在这种情况下，groupby获取一个列表，并将该列表中重复的、连续的项分组到的元组中（项值、迭代器值）。在这里使用它，就像：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

让我们用一个函数来扩展它，该函数返回一个列表，并删除重复值：

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

这对一个单词的短语很好，但对较长的短语没有帮助。要怎么做？好的，首先，我们要检查较长的短语，跨过我们的原始短语：

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

现在我们在做饭！好啊因此，我们的策略是首先删除所有单个单词的重复项。接下来，我们将删除两个重复的单词，从偏移量0开始，然后是1。在此之后，从偏移量0、1和2开始的三个单词重复，依此类推，直到我们找到五个单词重复：

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

总而言之：

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

公孙琛

2023-03-14

您可以使用re模块来实现这一点。

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

如果要匹配任意数量的连续事件：

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'

编辑最后一个例子的补充。要做到这一点，您必须在存在重复短语时调用re.sub。因此：

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

类似资料：

如何有效地删除字符串中连续重复的单词或短语[重复]

我有一个字符串，其中包含重复出现的短语，或者它甚至可能是一个连续出现多次的单词。尝试了各种方法，但找不到更好的节省时间和空间的方法。这是我尝试过的方法 Groupby（） re 在我的情况下，他们两个似乎都不起作用我的预期结果：这些是我提到的一些帖子有没有办法删除字符串中重复和连续的单词/短语？-不工作我怎么能删除重复的字在一个字符串与Python？-工程部分，但需要一个最佳的方式为大
有没有办法从API返回的数据中删除字符？[重复]

这是我的代码：
从单词中删除重复字符

问题内容：我想知道将“ haaaaapppppyyyy”转换为“ haappyy”的最佳方法是什么。基本上，解析语时，人们有时会重复字符以增加重点。我想知道这样做的最好方法是什么？使用不起作用，因为字母的顺序显然很重要。有任何想法吗？我正在使用Python + nltk。问题答案：可以使用正则表达式来完成：用一个字符的两倍来补充任何字符（）后跟一个或多个相同字符（由于backref必
消除字符串中的连续重复项

我想从字符串中消除连续重复，如这是我的密码我得到了错误非穷举模式，我想这是第二行的错误，当只剩下1个字符时，程序不知道如何处理。我该怎么修？
有没有办法知道字符串中是否只有数字？[重复]

我正试图在表格上做错误检查。我想看看一个电话号码是否有效，看它是否包含所有号码。有没有办法确定一个字符串中是否只有NMBER？
如何使用Python删除字符串中的重复单词？

问题内容：以下示例：如何删除后两个重复项和？结果应该看起来像仅应删除第二个重复项，并且不应更改单词的顺序！问题答案：

有没有办法删除字符串中重复和连续的单词/短语？

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档