当前位置: 首页 > 面试题库 >

正则表达式可在多种情况下匹配版权声明中的公司名称

龚振
2023-03-14
问题内容

我的计划很紧,想出一个python正则表达式来匹配许多可能的不同版权声明中的公司名称,例如:

Copyright © 2019 Apple Inc. All rights reserved.  
© 2019 Quid, Inc. All Rights Reserved.  
© 2009 Database Designs  
© 2019 Rediker Software, All Rights Reserved  
©2019 EVOSUS, INC. ALL RIGHTS RESERVED  
© 2019 Walmart. All Rights Reserved.  
© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.  
Copyright © 1978-2019 Berkshire Hathaway Inc.  
© 2019 McKesson Corporation  
© 2019 UnitedHealth Group. All rights reserved.  
© Copyright 1999 - 2019 CVS Health  
Copyright 2019 General Motors. All Rights Reserved.  
© 2019 Ford Motor Company  
©2019 AT&T Intellectual Property. All rights reserved.  
© 2019 GENERAL ELECTRIC  
Copyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.  
© 2019 Verizon  
© 2019 Fannie Mae  
Copyright © 2018 Jonas Construction Software Inc. All rights reserved.  
All Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved  
© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121  
© 2019 JPMorgan Chase & Co.  
Copyright © 1995 - 2018 Boeing. All Rights Reserved.  
© 2019 Bank of America Corporation. All rights reserved.  
© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801  
©2019 Cardinal Health. All rights reserved.

我所知道的正则表达式只是非常基本的内容,目前还不足以快速提出一个好的解决方案。

从我看来,至少对于这些示例,正确捕获公司名称的要求如下:

If there's a '©' or 'Copyright' in the sentence:
    After '©' or 'Copyright' - look for a year, e.g. '2019', or a year range, e.g. '1995 - 2018' or '2003-2019' (spaces are to catch as well]):
        If there's a dot somewhere after this year/year range, capture  the text until the dot. E.g. in 'Copyright © 1978-2019 Berkshire Hathaway Inc.' capture 'Berkshire Hathaway Inc'
        If there's no dot but there's the sentence 'All rights reserved', capture from the year/year range until there and also ignore any possible non-alphanumeric characters that precede it, such as spaces and commas. E.g. from '© 2019 Rediker Software, All Rights Reserved' capture 'Rediker Software'
        If there's no dot nor the sentence 'All rights reserved', capture from the year/year range until the end. E.g. from '© 2019 Verizon' Capture 'Verizon'

关于好的正则表达式有什么建议吗?


问题答案:

您可以考虑使用正则表达式

(?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)

请参阅regex演示。使用不区分大小写的修饰符re.I

细节

  • (?:©(?:\s*Copyright)?|Copyright(?:\s*©)?) -要么
    • ©(?:\s*Copyright)?- ©焦炭跟随以0+空格可选子,然后Copyright
    • | - 要么
    • Copyright(?:\s*©)?-Copyright后跟0+空格和©char的可选子字符串
  • \s* -0+空格
  • \d+-1个以上的数字(\d{4}如果年份始终包含4个数字,则使用)
  • (?:\s*-\s*\d+)?--用0+空格和1+数字括起来的可选序列(\d{4}如果年份始终包含4位数字,则使用)
  • \s* -0+空格
  • (.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)- 捕获第1组 :以下任何一种方法:
    • .*?(?=\W*All\s+rights\s+reserved)-除换行符以外的任何0+个字符,尽可能少,最多0+个非单词字符,后跟All rights reserved字符串
    • [^.]*(?=\.)-.尽可能多的0+个字符,.不包括在内.
    • .* -其余部分

Python演示:

import re
s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved."
rx = r"(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.\n]*(?=\.)|.*)"
for m in re.findall(rx, s, re.I):
    print(m)

输出:

Apple Inc
Quid, Inc
Database Designs 
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger | The Kroger Co
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc


 类似资料:
  • 问题内容: Ruby 有一个很好的方法,可以在单元测试中使用它来断言正则表达式与字符串匹配。 JUnit中是否有类似的东西?目前,我这样做: 问题答案: 如果与用于测试正则表达式匹配项的Hamcrest匹配器一起使用,则如果断言失败,您将收到一条漂亮的消息,指出预期的模式和实际文本。该断言也将更加流利,例如

  • 问题内容: 我需要一个与Java方法声明匹配的正则表达式。我想出了一个将与方法声明匹配的方法,但是它要求方法的左括号与声明在同一行。如果您对改善我的正则表达式有任何建议,或者只是有更好的建议,请提交答案。 这是我的正则表达式: 对于那些不知道Java方法是什么样子的人,我将提供一个基本的方法: java方法也可以添加几个可选部分,但是这些是保证方法唯一的部分。 更新:我当前的Regex是为了防止M

  • 有没有人试图描述与正则表达式匹配的正则表达式? 由于重复的关键字,这个主题几乎不可能在网上找到。 它可能在实际应用程序中不可用,因为支持正则表达式的语言通常具有解析它们的方法,我们可以将其用于验证,以及一种在代码中分隔正则表达式的方法,可用于搜索目的。 但是我仍然想知道匹配所有正则表达式的正则表达式是什么样子的。应该可以写一个。

  • 我们得到了一些这样的内容:

  • 给定下面的字符串 [NeMo(PROD)]10.10.100.100(EFA-B-3)[博科FC-Switch]传感器:电源#1(SNMP自定义表)关闭(无此名称(SNMP错误#2)) 我尝试获取多个匹配项以提取以下值: 因为我是正则表达式的初学者,所以我试图定义一些“规则”: 提取第一个圆括号内的第一个值,例如PROD 提取第一个闭合方括号和第二个开口圆括号之间的值,例如10.10.100.10

  • 主要内容:基本模式匹配,字符簇,确定重复出现基本模式匹配 一切从最基本的开始。模式,是正则表达式最基本的元素,它们是一组描述字符串特征的字符。模式可以很简单,由普通的字符串组成,也可以非常复杂,往往用特殊的字符表示一个范围内的字符、重复出现,或表示上下文。例如: 这个模式包含一个特殊的字符 ^,表示该模式只匹配那些以 once 开头的字符串。例如该模式与字符串 "once upon a time" 匹配,与 "There once was