正则表达式在Java中没有明显的最大长度

申屠秦斩

2023-03-14

问题内容：

我一直认为，Java的regex-API（以及与此相关的许多其他语言）中的后置断言必须具有明显的长度。因此，STAR和PLUS量词在内部回顾中是不允许的。

优秀的在线资源regular-expressions.info似乎证实了我的一些假设：

“ […] Java通过允许有限重复而向前迈进了一步。你仍然不能使用星号或加号，但是可以使用问号和花括号以及指定的max参数。Java认识到有限重复的事实可以重写为具有不同但固定长度的字符串的替代形式。不幸的是，当你在后向内部使用替代方式时，JDK 1.4和1.5会出现一些错误。这些错误已在JDK 1.6中修复。[…]“

http://www.regular-expressions.info/lookaround.html

只要大括号内的字符的范围的总长度小于或等于Integer.MAX_VALUE，就可以使用大括号。因此，这些正则表达式有效：

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"

但是这些不是：

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"

但是，我不了解以下内容：

当我在后面的内部使用*和+量词运行测试时，一切顺利（请参见输出Test 1和Test 2）。

但是，当我在开始添加一个字符向后看，从测试1和测试2，它打破（见输出测试3）。

使来自测试3的贪婪* 勉强没有效果，它仍然会中断（请参阅测试4）。

这是测试工具：

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}

输出：

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^

我的问题可能很明显，但我仍然会问：有人可以向我解释为什么测试1和2失败，而测试3和4失败吗？我本以为他们都会失败，而不是一半会工作，一半会失败。

问题答案：

浏览Pattern.java的源代码可以发现，“ *”和“ +”是作为Curly实例（这是为curl运算符创建的对象）的实例而实现的。所以，

a*

被实现为

a{0,0x7FFFFFFF}

和

a+

被实现为

a{1,0x7FFFFFFF}

这就是为什么你看到冰壶和恒星行为完全相同的原因。

正则表达式在Java中没有明显的最大长度

相关阅读

相关文章

相关问答

相关工具

相关文档