数千个文件中的模式匹配

墨安阳

2023-03-14

问题内容：

我有一个类似welcome1|welcome2|changeme… 的正则表达式模式，我需要搜索成千上万个文件（大小从1KB到24
MB不等）以成千上万个文件（介于100到8000之间）。

我想知道是否有比我尝试过的模式匹配更快的方法。

环境：

杰克1.8
Windows 10
Unix4j库

这是我到目前为止尝试过的

try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                                    .filter(FilePredicates.isFileAndNotDirectory())) {

        List<String> obviousStringsList = Strings_PASSWORDS.stream()
                                                .map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this

        Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));

        GrepOptions options = new GrepOptions.Default(GrepOption.count,
                                                        GrepOption.ignoreCase,
                                                        GrepOption.lineNumber,
                                                        GrepOption.matchingFiles);
        Instant startTime = Instant.now();

        final List<Path> filesWithObviousStringss = stream
                .filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
                .collect(Collectors.toList());

        System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}

我明白了Time taken = 60 seconds，这让我觉得我做错了什么。

我对流使用了不同的方法，平均每种方法需要大约一分钟的时间来处理当前的6660个文件文件夹。

mysys2 / mingw64上的Grep持续约15秒，exec('grep...')而node.js中的Grep 持续约12秒。

我选择Unix4j是因为它提供了Java本机grep和干净的代码。

可惜的是，有没有一种方法可以在Java中产生更好的结果呢？

问题答案：

本机工具可以更快地处理此类文本文件的主要原因是它们假定一个特定的字符集，尤其是当它具有基于ASCII的8位编码，而Java执行字节到字符的转换时，其抽象能力可以支持任意字符集。

当我们类似地假设具有上述属性的单个字符集时，我们可以使用低级工具，这些工具可能会大大提高性能。

对于此类操作，我们定义了以下辅助方法：

private static char[] getTable(Charset cs) {
    if(cs.newEncoder().maxBytesPerChar() != 1f)
        throw new UnsupportedOperationException("Not an 8 bit charset");
    byte[] raw = new byte[256];
    IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
    char[] table = new char[256];
    cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
      .decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
    for(int i = 0; i < 128; i++)
        if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
    return table;
}

和

private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
    try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
        long actualSize = fch.size();
        int size = (int)actualSize;
        if(size != actualSize) throw new UnsupportedOperationException("file too large");
        MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
        final class MappedCharSequence implements CharSequence {
            final int start, size;
            MappedCharSequence(int start, int size) {
                this.start = start;
                this.size = size;
            }
            public int length() {
                return size;
            }
            public char charAt(int index) {
                if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
                byte b = mbb.get(start + index);
                return b<0? table[b+256]: (char)b;
            }
            public CharSequence subSequence(int start, int end) {
                int newSize = end - start;
                if(start<0 || end < start || end-start > size)
                    throw new IndexOutOfBoundsException();
                return new MappedCharSequence(start + this.start, newSize);
            }
            public String toString() {
                return new StringBuilder(size).append(this).toString();
            }
        }
        return new MappedCharSequence(0, size);
    }
}

这允许将文件映射到虚拟内存并将其直接投影到CharSequence，而无需复制操作，前提是可以使用一个简单的表进行映射，并且对于基于ASCII的字符集，大多数字符甚至都不需要一个表查找，因为它们的数值与Unicode代码点相同。

使用这些方法，您可以将操作实现为

// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
    Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
    long startTime = System.nanoTime();
    final List<Path> filesWithObviousStringss = stream//.parallel()
            .filter(path -> {
                try {
                    return pattern.matcher(mapAsciiBasedText(path, table)).find();
                } catch(IOException ex) {
                    throw new UncheckedIOException(ex);
                }
            })
            .collect(Collectors.toList());
    System.out.println("Time taken = "
        + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

它的运行速度比普通文本转换快得多，但仍支持并行执行。

除了需要基于ASCII的单字节编码外，还有一个限制是该代码不支持大于2
GiB的文件。尽管可以扩展解决方案以支持更大的文件，但除非确实需要，否则我不会添加这种复杂性。

数千个文件中的模式匹配

相关阅读

相关文章

相关问答

相关工具

相关文档