问题：

如何找到字符串的最长重复序列

林博厚

2023-03-14

我在一个文本文件中有一个长字符串（DNA序列，超过20000个字符），我试图找到其中最长的序列，它至少重复了三次。实现这一目标的最佳方式是什么？

我能找到的唯一现有主题是在两个或多个单独的字符串中查找重复，但是如何使用一个长字符串？

共有2个答案

谭裕

2023-03-14

str = "ababaeabadefgdefaba"

n = 3
n.times.
  flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }.
  uniq.
  each_with_object({}) do |a,h|
    r = /#{a.join('')}/
    h[a.join('')] = str.scan(r).size
end.max_by { |_,v| v }
  #=> ["aba", 3]

情况2：给定长度的子字符串可以重叠

只需更改定义正则表达式的行：

n = 3
n.times.
  flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }.
  uniq.
  each_with_object({}) do |a,h|
    r = /#{a.first}(?=#{a.drop(1).join('')})/
    h[a.join('')] = str.scan(r).size
  end.max_by { |_,v| v }
  #=> ["aba", 4]

n = 3
b = n.times
  #=> #<Enumerator: 3:times> 
c = b.flat_map { |i| str[i..-1].each_char.each_cons(n).to_a }
  #=> [["a", "b", "a"], ["b", "a", "b"], ["a", "b", "a"], ["b", "a", "e"],
  #    ["a", "e", "a"], ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"],
  #    ["a", "d", "e"], ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"],
  #    ["g", "d", "e"], ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"],
  #    ["a", "b", "a"], ["b", "a", "b"], ["a", "b", "a"], ["b", "a", "e"],
  #    ["a", "e", "a"], ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"],
  #    ["a", "d", "e"], ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"],
  #    ["g", "d", "e"], ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"],
  #    ["a", "b", "a"], ["a", "b", "a"], ["b", "a", "e"], ["a", "e", "a"],
  #    ["e", "a", "b"], ["a", "b", "a"], ["b", "a", "d"], ["a", "d", "e"],
  #    ["d", "e", "f"], ["e", "f", "g"], ["f", "g", "d"], ["g", "d", "e"],
  #    ["d", "e", "f"], ["e", "f", "a"], ["f", "a", "b"], ["a", "b", "a"]]
d = c.uniq
  #=> [["a", "b", "a"], ["b", "a", "b"], ["b", "a", "e"], ["a", "e", "a"],
  #    ["e", "a", "b"], ["b", "a", "d"], ["a", "d", "e"], ["d", "e", "f"], 
  #    ["e", "f", "g"], ["f", "g", "d"], ["g", "d", "e"], ["e", "f", "a"],
  #    ["f", "a", "b"]] 
e = d.each_with_object({}) do |a,h|
      r = /#{a.first}(?=#{a.drop(1).join('')})/
      puts "  str.scan(#{r.inspect}) = #{str.scan(r)}" if a == d.first
      h[a.join('')] = str.scan(r).size
      puts "  h[#{a.join('')}] = #{h[a.join('')]}" if a == d.first
    end
  #=>   str.scan(/a(?=ba)/) = ["a", "a", "a", "a"]
  #=>   h[aba] = 4
  #=> {"aba"=>4, "bab"=>1, "bae"=>1, "aea"=>1, "eab"=>1, "bad"=>1, "ade"=>1,
  #    "def"=>2, "efg"=>1, "fgd"=>1, "gde"=>1, "efa"=>1, "fab"=>1}
e.max_by { |_,v| v }
  #=> ["aba", 4]

在计算e时，对于传递给块的d的第一个元素，块变量a等于[“a”、“b”、“a”]，并且str.scan（/a（？=ba）/）中的regex/a（？=ba）匹配str中紧跟ba的每个a<代码>（？=ba）是一个正向前瞻（不是匹配的一部分）。

壤驷阳波

2023-03-14

如果我理解正确，您希望解决“最长重复子串问题”：https://en.wikipedia.org/wiki/Longest_repeated_substring_problem

看看http://rubyquiz.com/quiz153.html

此gem可能会帮助您解决问题：https://github.com/luikore/triez

CTRL F:“解决最长的公共子字符串问题：”

如何找到字符串的最长重复序列

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档