(可能它甚至在更早的时候就已修复,但我不知道确切是哪个版本。nhahtdh的答案中有关类似问题的错误报告建议使用Java
9)。
TL; DR (修正前):
为什么[^\\D2]
,[^[^0-9]2]
,[^2[^0-9]]
得到不同的结果在Java中?
用于测试的代码。您现在可以跳过它。
String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };
System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
test.matches(regexes[0]), test.matches(regexes[1]),
test.matches(regexes[2]), test.matches(regexes[3]),
test.matches(regexes[4]), test.matches(regexes[5]));
可以说我需要正则表达式,它将接受以下字符
2
。所以,这样的正则表达式应该表示除了每个字符0
,1
,3
,4
,…, 9
。我至少可以在两个方面,这将是的总和写 的一切是不是数字 与
2 :
[[^0-9]2]
[\\D2]
这两个正则表达式均按预期工作
match , [[^0-9]2] , [\D2]
--------------------------
x , true , true
1 , false , false
2 , true , true
3 , false , false
^ , true , true
[ , true , true
] , true , true
现在说我要撤消接受的字符。 (所以我想接受除2以外的所有数字)我可以创建正则表达式,它明确包含所有接受的字符,例如
[013-9]
或尝试通过将其包装在另一个[^...]
类似的文件中来否定两个先前描述的正则表达式
[^\\D2]
[^[^0-9]2]
甚至
[^2[^0-9]]
但令我惊讶的是,只有前两个版本能按预期工作
match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]
------+--------------------+-------------------------------------------
x | true , true | false , false , true , true
1 | false , false | true , true , false , true
2 | true , true | false , false , false , false
3 | false , false | true , true , false , true
^ | true , true | false , false , true , true
[ | true , true | false , false , true , true
] | true , true | false , false , true , true
所以我的问题是为什么[^[^0-9]2]
或[^2[^0-9]]
不这样做[^\D2]
?我可以以某种方式更正这些正则表达式,以便[^0-9]
在它们内部使用吗?
Oracle的Pattern
类的实现的字符类解析代码中存在一些奇怪的巫毒,如果从Oracle网站下载JRE / JDK或使用OpenJDK,则JDO /
JDK会附带这些巫毒。我还没有检查其他JVM(尤其是GNU
Classpath)实现如何解析问题中的正则表达式。
从这一点出发,对Pattern
类及其内部工作的任何引用都严格限于Oracle的实现(参考实现)。
Pattern
如问题所示,阅读和理解类如何解析嵌套的否定将花费一些时间。但是,我编写了一个程序1从Pattern
对象中提取信息(使用Reflection
API),以查看编译结果。下面的输出来自在Java
HotSpot Client VM版本1.7.0_51上运行我的程序。
1:目前,该程序令人尴尬。完成并重构后,我将使用链接更新此帖子。
[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
这没什么奇怪的。
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
上面接下来的2种情况被编译为与相同的程序[^0-9]
,这是 违反直觉的 。
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
如问题中所述,在上述2种情况下没有什么奇怪的。
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
如问题中所述,这两个案例按预期工作。但是,请注意引擎如何对第一个字符类(\D
)进行补充,并将集差异应用于由剩余字符组成的字符类。
[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
正如Keppil在评论中进行的测试所证实的那样,上面的输出显示上述所有3个正则表达式都编译到了同一程序中!
[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
取而代之的NOT(UNION(2, NOT(0-9))
是0-13-9
,UNION(NOT(2), NOT(0-9))
等于NOT(2)
。
[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
由于相同的错误,正则表达式可[^2[^[^0-9]]]
编译到相同的程序[^2[^0-9]]
。
有一个未解决的错误,似乎具有相同的性质:JDK-6609854。
下面是Pattern
在进一步阅读之前应该知道的类的实现细节:
Pattern
类将a编译String
成一个节点链,每个节点负责一个明确的小责任,并将工作委托给链中的下一个节点。Node
class是所有节点的基类。CharProperty
class是所有与字符类相关Node
的的基类。BitClass
class是CharProperty
使用boolean[]
数组加快对Latin-1字符(代码点<= 255)的匹配的类的子类。它有一个add
方法,允许在编译过程中添加字符。CharProperty.complement
,Pattern.union
,Pattern.intersection
是组对应的操作方法。他们所做的是不言自明的。Pattern.setDifference
是不对称集差。在查看完整的CharProperty clazz(boolean consume)
方法代码(负责解析字符类的方法)之前,让我们看一下代码的极为简化的版本以了解代码流程:
private CharProperty clazz(boolean consume) {
// [Declaration and initialization of local variables - OMITTED]
BitClass bits = new BitClass();
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
// [CODE OMITTED]
ch = next();
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
// [CODE OMITTED]
ch = peek();
continue;
case '&':
// [CODE OMITTED]
continue;
case 0:
// [CODE OMITTED]
// Unclosed character class is checked here
break;
case ']':
// [CODE OMITTED]
// The only return statement in this method
// is in this case
break;
default:
// [CODE OMITTED]
break;
}
node = range(bits);
// [CODE OMITTED]
ch = peek();
}
}
代码基本上会读取输入(输入String
转换为代码点的以 null结束
int[]
的输入),直到命中]
或到达String的末尾(未封闭字符类)为止。
该代码在该模块内部有点混乱continue
和break
混在一起switch
。但是,只要您意识到它continue
属于外for
循环并break
属于switch
块,代码就很容易理解:
continue
将永远不会在switch
语句后执行代码。break
可以在switch
语句之后执行代码(如果return
尚未执行)。通过上面的观察,我们可以看到,每当 发现一个非特殊字符并且应将其包括在字符类中时 ,我们将在switch
语句之后执行代码,这node = range(bits);
是第一个语句。
如果检查源代码,则该方法CharProperty range(BitClass bits)
将解析“字符类中的单个字符或字符范围”。该方法要么返回BitClass
传入的相同对象(添加新字符),要么返回CharProperty
类的新实例。
接下来,让我们看一下完整的代码版本(&&
省略解析字符类交集的部分):
private CharProperty clazz(boolean consume) {
CharProperty prev = null;
CharProperty node = null;
BitClass bits = new BitClass();
boolean include = true;
boolean firstInClass = true;
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
if (temp[cursor-1] != '[')
break;
ch = next();
include = !include;
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
firstInClass = false;
node = clazz(true);
if (prev == null)
prev = node;
else
prev = union(prev, node);
ch = peek();
continue;
case '&':
// [CODE OMITTED]
// There are interesting things (bugs) here,
// but it is not relevant to the discussion.
continue;
case 0:
firstInClass = false;
if (cursor >= patternLength)
throw error("Unclosed character class");
break;
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
} else {
if (prev == null) {
prev = node.complement();
} else {
if (prev != node)
prev = setDifference(prev, node);
}
}
ch = peek();
}
}
纵观代码case '[':
中的switch
语句和之后的代码switch
语句:
node
变量存储解析的结果 单元 (一个独立的字符,字符范围,一个速记字符类,一POSIX / Unicode字符类或嵌套字符类)prev
变量存储编译的结果,到目前为止,始终更新我们编译一个之后 单元 在node
。由于boolean include
记录字符类是否被否定的局部变量永远不会传递给任何方法调用,因此只能在此方法中对其进行操作。唯一include
的读取和处理的地方是switch
语句之后。
这是我之前问题的后续。我意识到我需要更具体地说明我的regex案例,以获得适用于我的案例的答案。 我已经与这个正则表达式斗争了很长一段时间(也使用我上一个问题的答案),我似乎无法构建我需要的东西。 我需要将所有字符串中出现的两个重复出现的单引号替换为(因此字符串内部意味着单引号)。这是因为在一种语言(语法)中,字符串中的引号使用<code>‘转义。 这里有一个例子(实际的例子可以包含用< code
本文向大家介绍解释Java正则表达式中的字符类,包括了解释Java正则表达式中的字符类的使用技巧和注意事项,需要的朋友参考一下 Java正则表达式中的字符类使用方括号“ []”定义,该子表达式与指定字符或一组可能的字符中的单个字符匹配。 例如,正则表达式[abc]匹配单个字符a或b或c。同样,“ [az]”匹配从a到z的单个字符。 以下是字符Java正则表达式类的其他变体: 否定-字符类的否定变体
我正在比较以字符串形式交付的版本编号。 我需要检测某个版本是否为beta版本,我想检查该版本的最后一部分是否包含beta、build或b,我正在尝试使用regex执行此操作。 到目前为止,它可以工作,但一旦版本部分有一个空间,它就不行了。 也许有人有更好的主意? 迄今为止我的正则表达式: 字符串示例: v1。23 Beta125-- v1。25.458 beta 129-- 示例代码:
问题内容: 确定字符串是否仅由单个重复字符组成的正则表达式模式是什么? 例如 “ aaaaaaa” =真 “ aaabbbb” =假 “ $$$$$$$” =真 此问题检查字符串是否仅包含重复字符(例如“aabb”),但是我需要确定它是否为 单个 重复字符。 问题答案: 您可以尝试使用反向参考 演示 模式说明: 后向引用与捕获组先前匹配的文本匹配。反引用(反斜杠一个)引用第一个捕获组。匹配与第一个
我有一个包含数千行的文本文件。这里有一个例子 我试图提取'nt60'、'nt50'末尾的字符串。 问题是会包含行尾字符() 我想使用正则表达式搜索来匹配从 (') 开始的字符串,但我不知道我应该用什么来匹配 。 有人能帮忙吗?
问题内容: 有没有办法使用正则表达式来匹配重复的字符集?例如: 我知道那是错的。但是有什么可以匹配这种效果的吗? 更新: 您可以使用嵌套捕获组吗?像什么? 问题答案: 将要重复的正则表达式放在括号中。例如,如果您要重复5次: 或者,如果您想要任意数量的重复(0或更多): 或一个或多个重复: 编辑 以回应更新 正则表达式中的括号有两个作用:它们将正则表达式中的一系列项目组合在一起,以便您可以将运算符