问题：

使用PDFBox拆分一个大的Pdf文件将得到大的结果文件

鲍鸿波

2023-03-14

这是代码：

 PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("myFile.pdf"), 
                           new RandomAccessFile(new File("./tmp/temp"), "rw"));

    int numPages = documentoPdf.getNumberOfPages();
    List pages = documentoPdf.getDocumentCatalog().getAllPages();

    int previusQR = 0;
    for(int i =0; i<numPages; i++){
       PDPage page = (PDPage) pages.get(i);
       BufferedImage firstPageImage =    
           page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);

       String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);

       if(qrText != null and i!=0){
         PDDocument outputDocument = new PDDocument();
         for(int j = previusQR; j<i; j++){
           outputDocument.importPage((PDPage)pages.get(j));
          }
         File f = new File("./splitting_files/"+previusQR+".pdf");
         outputDocument.save(f);
         outputDocument.close();
         documentoPdf.close();
    }

我还尝试了以下存储新文件的代码：

PDDocument outputDocument = new PDDocument();

for(int j = previusQR; j<i; j++){
 PDStream src = ((PDPage)pages.get(j)).getContents();
 PDStream streamD = new PDStream(outputDocument);
 streamD.addCompression();

 PDPage newPage = new PDPage(new   
           COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
 newPage.setContents(streamD);

 byte[] buf = new byte[10240];
 int amountRead = 0;
 InputStream is = null;
 OutputStream os = null;
 is = src.createInputStream();
 os = streamD.createOutputStream();
 while((amountRead = is.read(buf,0,10240)) > -1) {
    os.write(buf, 0, amountRead);
  }

 outputDocument.addPage(newPage);
}

File f = new File("./splitting_files/"+previusQR+".pdf");

outputDocument.save(f);
outputDocument.close();

但是这段代码创建的文件缺少一些内容，并且大小与原始文件相同。

THX！

共有1个答案

苏富

2023-03-14

THX！Tilman您是对的，PDFSplit命令生成更小的文件。我检查了PDFSplit代码，发现它删除了页面链接，以避免不需要的资源。

从splitter.class中提取的代码：

private void processAnnotations(PDPage imported) throws IOException
    {
        List<PDAnnotation> annotations = imported.getAnnotations();
        for (PDAnnotation annotation : annotations)
        {
            if (annotation instanceof PDAnnotationLink)
            {
                PDAnnotationLink link = (PDAnnotationLink)annotation;   
                PDDestination destination = link.getDestination();
                if (destination == null && link.getAction() != null)
                {
                    PDAction action = link.getAction();
                    if (action instanceof PDActionGoTo)
                    {
                        destination = ((PDActionGoTo)action).getDestination();
                    }
                }
                if (destination instanceof PDPageDestination)
                {
                    // TODO preserve links to pages within the splitted result  
                    ((PDPageDestination) destination).setPage(null);
                }
            }
            else
            {
                // TODO preserve links to pages within the splitted result  
                annotation.setPage(null);
            }
        }
    }

所以最终我的代码如下所示：

PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));

        int numPages = documentoPdf.getNumberOfPages();
        List pages = documentoPdf.getDocumentCatalog().getAllPages();


        int previusQR = 0;
        for(int i =0; i<numPages; i++){
            PDPage firstPage = (PDPage) pages.get(i);
            String qrText ="";


            BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);


            firstPage =null;

            try {
                qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
            } catch (NotFoundException e) {
                e.printStackTrace();
            } finally {
                firstPageImage = null;
            }


        if(i != 0 && qrText!=null){
                    PDDocument outputDocument = new PDDocument();
                    outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
                    outputDocument.getDocumentCatalog().setViewerPreferences(
                            documentoPdf.getDocumentCatalog().getViewerPreferences());


                    for(int j = previusQR; j<i; j++){
                        PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));

                        importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
                        importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
                        // only the resources of the page will be copied
                        importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
                        importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );

                        processAnnotations(importedPage);


                    }


                    File f = new File("./splitting_files/"+previusQR+".pdf");

                    previusQR = i;

                    outputDocument.save(f);
                    outputDocument.close();
                }
            }


        }

类似资料：

使用PDFBOX拆分和合并pdf文件生成大文件

我有一个大的pdf打印文件，它包含5544页，大约36MB大小。该文件由MS Word 2010创建，仅包含文本和每个信件/文档上的徽标。我将它拆分为5544个文件，然后根据关键字合并成2770个字母。每个字母约为。140-145kb。当我将所有的字母合并到一个新的pdf打印文件（仍然包含5544页）时，文件的大小增长到396MB。所有文本提取、拆分和合并都是通过从PHP调用Apache P
使用PDFBox 2.0.2拆分PDF会生成非常大的PDF文档

问题内容：我想使用命令将一个PDF拆分为许多其他PDF。但是我发现有一个问题：拆分的PDF为“ ActiveMQ In Action（Manning-2011）.pdf”，它的大小为14.1MB。但是当我跑步时每个PDF都大于79MB！我该如何预防？问题答案：这是PDFBox 2.0.2中的一个已知错误。拆分在2.0.1中工作正常，在2.0.3中又可以工作。“错误的”代码已经恢复。问题的
如何在java pdfbox中按结果拆分pdf文件

我需要根据发票编号拆分pdf。例如发票号D0000003011,所有pdf页面应合并为单个pdf,依此类推。我怎样才能做到。..
PDFBox膨胀的PDF文件大小

使用PDFBox可以读取livecycle创建的动态PDF。下面的代码读取然后写回用于创建动态PDF的xml文件。我有点担心，因为生成的文件很大，从647kb pdf开始。新的pdf 14000kb。任何人都知道如何减少生成的新文件的大小。写回pdf文件时可以设置某种类型的压缩吗？
使用python将多页pdf文件拆分为多个pdf文件？

问题内容：我想要一个多页的pdf文件，并每页创建单独的pdf文件。我已经下载了reportlab并浏览了文档，但它似乎是针对pdf生成的。我还没有看到有关处理PDF文件本身的任何信息。有没有一种简单的方法可以在python中做到这一点？问题答案：等等
拆分大摇大摆的文件到单独的集

我正在使用. net core的swagger，我想知道是否有可能拆分2套或更多通过不同网址访问的swagger文档。这里不讨论版本控制。举个例子，如果我有一个用于移动应用程序、web应用程序和另一个客户端的API。我想将它们分别分开，并且只为移动和web api添加授权，而不是客户端。我有这样一个想法，将各自的api划分为多个区域，但我仍然不知道如何将其划分为多个区域。我知道我能得到同样结果

使用PDFBox拆分一个大的Pdf文件将得到大的结果文件

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档