问题：

如何找到数组中每个元素中出现的最长子串？

宰父跃

2023-03-14

我收集了一些作者的文章。每个作者都有一个独特的签名或链接，出现在他们所有的文本中。

Author1的示例：

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Author1的预期输出为：@jhsad。萨达斯。com

Author2的示例：

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Author2的预期输出为：

This is the
 *author's* signature.

请特别注意，没有可靠的识别字符（或位置）表示签名的开始或结束。它可以是一个url，一个Twitter地址，任何类型的纯文本，等等，任何长度，包含字符串开头、结尾或中间出现的任何字符序列。

我正在寻找一种方法，可以为单个作者提取所有$text元素中存在的最长子字符串。

为了完成这项任务，所有作者都应该在每一篇文章/文本中都有一个签名子字符串。

想法：我正在考虑将单词转换成向量，并在每个文本之间找到相似之处。我们可以使用余弦相似性来寻找签名。我认为解决办法一定是这样的想法。

mickmackusa的注释代码抓住了所需内容的本质，但我想看看是否有其他方法可以实现所需的结果。

共有2个答案

韶英达

2023-03-14

您可以将preg_match（）与正则表达式一起使用来实现这一点。

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";

preg_match("/\@[^\s]+/", $str, $match);

var_dump($match); //Will output the signature

公冶子琪

2023-03-14

以下是我的想法：

按字符串长度（升序）对作者的文章集进行排序，以便从较小的文本到较大的文本
在一个或多个空白字符上拆分每篇文章的文本，以便在处理过程中只处理完全非空白的子字符串
查找每个后续post中出现的匹配子字符串与不断缩小的子字符串数组（重叠）
通过分析索引值对连续匹配的子字符串进行分组
将分组的连续子字符串“重新组合”为其原始字符串形式（当然，前导和尾随空格字符会被修剪）
按字符串长度（降序）对重建的字符串进行排序，以便为最长的字符串分配0索引
打印以根据通用性和长度筛选假定为作者签名（作为最佳猜测）的子字符串

代码：（演示）

$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2'] = ['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach ($posts as $author => $texts) {
    echo "Author: $author\n";
    
    usort($texts, function($a, $b) {
        return strlen($a) <=> strlen($b);  // sort ASC by strlen; mb_strlen probably isn't advantageous
    });
    var_export($texts);
    echo "\n";

    foreach ($texts as $index => $string) {
        if (!$index) {
            $overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        } else {
            $overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "\n";
    
    // batch consecutive substrings
    $group = null;
    $consecutives = [];  // clear previous iteration's data
    foreach ($overlaps as $i => $word) {
        if ($group === null || $i - $last > 1) {
            $group = $i;
        }
        $last = $i;
        $consecutives[$group][] = $word;
    }
    var_export($consecutives);
    echo "\n";
    
    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) {  // make alternatives characters literal using \Q & \E
            $potential_signatures = $out[0];
        }
    }
    usort($potential_signatures, function($a,$b){
        return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
    });
    
    echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}

输出：

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.

类似资料：

在C ++中找到Numpy数组中每个字符串元素的长度

本文向大家介绍在C ++中找到Numpy数组中每个字符串元素的长度，包括了在C ++中找到Numpy数组中每个字符串元素的长度的使用技巧和注意事项，需要的朋友参考一下在这里，我们将看到如何获取Numpy数组中每个字符串元素的长度。Numpy是适用于Numeric Python的库，它具有非常强大的数组类。使用此方法，我们可以将数据存储在类似结构的数组中。为了得到长度，我们可以采用两种不同的方法，
Java数组中每个元素的长度

问题内容：这是我要使用的。.length方法对我尝试的任何操作均无效，因此我什至不知道从哪里开始。问题答案：您正在尝试遍历单个数组而不是字符串数组。更改至以便通过字符串列表循环，收集每个字符串和存储它诠释的，你以后。
如何在数组中找到多个ki最小元素？

我正在努力完成作业，需要一点推动-问题是设计一个算法，在O（nlogm）时间内找到多个最小元素希望您能指点一下方向。谢谢
如何找到数组中所有元素出现的索引？

问题内容：我试图在JavaScript数组中找到元素的所有实例的索引，例如“ Nano”。我尝试了jQuery.inArray或类似的.indexOf（），但是它只给出了元素的最后一个实例的索引，在这种情况下为5。如何获得所有实例的信息？问题答案：该方法有一个可选的第二个参数，用于指定从其开始搜索的索引，因此您可以在循环中调用它以查找特定值的所有实例：您并没有真正弄清楚如何使用索引，因
查找包含多数元素的最长子数组

我正在尝试解决这个算法问题： https://dunjudge.me/analysis/problems/469/ 为了方便起见，我总结了下面的问题陈述。给定一个长度为（多数元素定义为发生的元素时限：1.5s 例如: 如果给定的数组是[1，2，1，2，3，2]，答案是5，因为从位置1到5 (0索引)的长度为5的子数组[2，1，2，3，2]具有出现为3的数字2 首先想到的是一个明显的强力(
Java计数整数数组中每个元素的出现

问题内容：我编写了以下代码段来计算每个元素的出现次数。有可能以更短的方式实现这一目标吗？另外，我只想显示出现1次以上的元素。所以我尝试如下修改，这导致了错误。正确的方法是什么？问题答案：对于后一个问题，您必须进行更改至对于第一部分，尚不清楚为什么需要第一条管道，然后需要第二条管道。如果目的是将转换为，请使用：正如Dici所建议的，您还可以将Collectors链接起来，将每个数字与

如何找到数组中每个元素中出现的最长子串？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档