Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

酆英达

2023-03-14

问题内容：

我正在尝试使用Apache PDFBOX API V2.0.2拆分300页的文档。尝试使用以下代码将pdf文件拆分为单个页面时：

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

这表明GC需要花费大量时间来清除没有被回收量证明合理的堆。

有许多JVM调优方法可以解决这种情况，但是，所有这些方法都只是在解决症状而不是真正的问题。

最后一点，我正在使用JDK6，因此在我的情况下，不能使用新的Java 8 Consumer。

编辑：

这不是http://codingdict.com/questions/159530的重复问题，如下所示：

 1.我没有上述提到的尺寸问题
    话题。我将270页的13.8MB切片，然后切片
    每个切片的大小平均为80KB，总大小为
    30.7兆字节
 2.即使在拆分之前，拆分也会引发异常。

我发现只要不传递整个文档，拆分就可以通过，而是将其作为“批量”传递，每个批量20-30页，即可完成工作。

问题答案：

PDF
Box将拆分操作产生的零件作为PDDocument类型的对象存储为堆中的对象，这会导致堆快速填充，即使在循环的每一轮之后调用close（）操作，GC仍会无法以与填充相同的方式回收堆大小。

一种选择是将文档拆分操作拆分为多个批次，其中每个批次是一个相对易于管理的块（10至40页）

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

相关阅读

相关文章

相关问答

相关工具

相关文档