当前位置: 首页 > 知识库问答 >
问题:

如何密集合并PDF文件使用PDFBox 2没有空白附近的分页符?

乜安志
2023-03-14

我们一直在使用基于iText的PdfVeryDenseMergeTool,我们在这个问题中发现了如何在合并时删除空白,以便将多个PDF文件合并到单个PDF文件中。该工具可以合并PDF,而不会在两者之间留下任何空格,并且在可能的情况下,单个PDF也可以跨页面进行拆分。

我们想把PdfVeryDenseMergeTool移植到PDFBox。我们发现了一个基于PDFBox 2的PDFDenseMerge工具,可以像这样合并PDF:

个别PDF:

密集合并PDF:

我们正在寻找类似的东西(这已经是基于iText的PdfVeryDenseMergeTool中的一个,但我们希望使用PDFBox 2来实现):

在我们尝试进行移植时,我们发现PdfVeryDenseMergeTool使用PageVerticalAnalyzer扩展iText PDF渲染监听器,并且每次在PDF中绘制文本、图像或弧时都会执行一些操作。然后,所有的渲染信息都用于将单个PDF分割成多个页面。我们尝试在PDFBox 2中寻找类似的PDF渲染监听器,但发现可用的PDFRenader类只有图像渲染方法。因此,我们不确定如何将PageVerticalAnalyzer端口到PDFBox。

如果有人能提出前进的方法,我们将非常感谢他们的帮助。

非常感谢!

编辑7二月2020

目前,我们正在从PDFBox扩展PDFGraphicsStreamEngine,以创建一个自定义渲染引擎,在绘制图像、文本线和圆弧时跟踪它们的坐标。该自定义引擎将是PageVerticalAnalyzer的端口。之后,我们希望能够将PdfVeryDenseMergeTool移植到PDFBox。

编辑8二月2020

这里是一个非常简单的PageVerticalAnalyzer端口,用于处理图像和文本。我是一个PDFBox新手,所以我处理图像的逻辑可能不稳定。以下是基本方法:

文本:对于打印的每个字形,获取底部并使topY=底部charHeight,标记这些顶部/底部点。

图像:对于每次调用DrawImage(),似乎有两种方法可以找出它是在哪里绘制的。首先是使用最后一次调用appendRectgle()的代码,其次是使用最后一次调用moveTo()、多lineTo()和ClosePath()。我优先考虑后者。如果我找不到任何路径(我在一个PDF中找到了它,在另一个PDF中,在绘图图像()之前,我只找到appendRectgle()),我使用前者。如果它们都不存在,我不知道该怎么办。下面是我假设PDFBox如何使用moveTo()/lineTo()/ClosePath()标记图像代码:

以下是我当前的实现:

import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFGraphicsStreamEngine;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;


public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine
{
    /**
     * This is a port of iText based PageVerticalAnalyzer found here
     * https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/merge/PageVerticalAnalyzer.java
     *
     * @param page PDF Page
     */
    protected PageVerticalAnalyzer(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        File file = new File("q2.pdf");

        try (PDDocument doc = PDDocument.load(file))
        {
            PDPage page = doc.getPage(0);
            PageVerticalAnalyzer engine = new PageVerticalAnalyzer(page);
            engine.run();

            System.out.println(engine.verticalFlips);
        }
    }

    /**
     * Runs the engine on the current page.
     *
     * @throws IOException If there is an IO error while drawing the page.
     */
    public void run() throws IOException
    {
        processPage(getPage());

        for (PDAnnotation annotation : getPage().getAnnotations())
        {
            showAnnotation(annotation);
        }
    }

    // All path related stuff

    @Override
    public void clip(int windingRule) throws IOException
    {
        System.out.println("clip");
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        System.out.printf("moveTo %.2f %.2f%n", x, y);
        lastPathBottomTop = new float[] {(Float) null, y};
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        System.out.printf("lineTo %.2f %.2f%n", x, y);
        lastLineTo = new float[] {x, y};
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        System.out.printf("curveTo %.2f %.2f, %.2f %.2f, %.2f %.2f%n", x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        // if you want to build paths, you'll need to keep track of this like PageDrawer does
        return new Point2D.Float(0, 0);
    }

    @Override
    public void closePath() throws IOException
    {
        System.out.println("closePath");
        lastPathBottomTop[0] = lastLineTo[1];
        lastLineTo = null;
    }

    @Override
    public void endPath() throws IOException
    {
        System.out.println("endPath");
    }

    @Override
    public void strokePath() throws IOException
    {
        System.out.println("strokePath");
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        System.out.println("fillPath");
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        System.out.println("fillAndStrokePath");
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException
    {
        System.out.println("shadingFill " + shadingName.toString());
    }

    // Rectangle related stuff

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.printf("appendRectangle %.2f %.2f, %.2f %.2f, %.2f %.2f, %.2f %.2f%n",
                p0.getX(), p0.getY(), p1.getX(), p1.getY(),
                p2.getX(), p2.getY(), p3.getX(), p3.getY());

        lastRectBottomTop = new float[] {(float) p0.getY(), (float) p3.getY()};
    }

    // Image drawing

    @Override
    public void drawImage(PDImage pdImage) throws IOException
    {
        System.out.println("drawImage");
        if (lastPathBottomTop != null) {
            addVerticalUseSection(lastPathBottomTop[0], lastPathBottomTop[1]);  
        } else if (lastRectBottomTop != null ){
            addVerticalUseSection(lastRectBottomTop[0], lastRectBottomTop[1]);
        } else {
            throw new Error("Drawing image without last reference!");
        }

        lastPathBottomTop = null;
        lastRectBottomTop = null;

    }

    // All text related stuff

    @Override
    public void showTextString(byte[] string) throws IOException
    {
        System.out.print("showTextString \"");
        super.showTextString(string);
        System.out.println("\"");
    }

    @Override
    public void showTextStrings(COSArray array) throws IOException
    {
        System.out.print("showTextStrings \"");
        super.showTextStrings(array);
        System.out.println("\"");
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode,
                             Vector displacement) throws IOException
    {
        // print the actual character that is being rendered 
        System.out.print(unicode);

        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);

        // rendering matrix seems to contain bounding box of dimensions the char
        // and an x/y point where bounding box starts
        //System.out.println(textRenderingMatrix.toString());

        // y of the bottom of the char 
        // not sure why the y value is in the 8th column
        // when I print the matrix, it shows up in the 6th column
        float yBottom = textRenderingMatrix.getValue(0, 7);

        // height of the char
        // using the value in the first column as the char height
        float yTop =  yBottom + textRenderingMatrix.getValue(0, 0);

        addVerticalUseSection(yBottom, yTop);
    }

    // Keeping track of bottom/top point pairs
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
    private float[] lastRectBottomTop;
    private float[] lastPathBottomTop;
    private float[] lastLineTo;

}

我正在寻找以下问题的答案:

  • 我如何改进这个实现?
  • 如何处理我没有处理过的曲线等其他事情?

共有1个答案

史良哲
2023-03-14

这个答案与最初的iText版本存在相同的问题。

可以按如下方式将PageVerticalAnalyzer从iText移植到PDFBox:

public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine {
    protected PageVerticalAnalyzer(PDPage page) {
        super(page);
    }

    public List<Float> getVerticalFlips() {
        return verticalFlips;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            addVerticalUseSection(rect.getMinY(), rect.getMaxY());
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        Section section = null;
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                Point2D.Float point = ctm.transformPoint(x, y);
                if (section == null)
                    section = new Section(point.y);
                else
                    section.extendTo(point.y);
            }
        }
        addVerticalUseSection(section.from, section.to);
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        subPath = null;
        Section section = new Section(p0.getY());
        section.extendTo(p1.getY()).extendTo(p2.getY()).extendTo(p3.getY());
        currentPoint = p0;
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        subPath = new Section(y);
        path.add(subPath);
        currentPoint = new Point2D.Float(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        if (subPath == null) {
            subPath = new Section(y);
            path.add(subPath);
        } else
            subPath.extendTo(y);
        currentPoint = new Point2D.Float(x, y);
    }

    /**
     * Beware! This is incorrect! The control points may be outside
     * the vertically used range 
     */
    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        if (subPath == null) {
            subPath = new Section(y1);
            path.add(subPath);
        } else
            subPath.extendTo(y1);
        subPath.extendTo(y2).extendTo(y3);
        currentPoint = new Point2D.Float(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
        subPath = null;
    }

    @Override
    public void strokePath() throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
        // TODO Auto-generated method stub
    }

    Point2D currentPoint = null;

    List<Section> path = new ArrayList<Section>();
    Section subPath = null;

    static class Section {
        Section(double value) {
            this((float)value);
        }

        Section(float value) {
            from = value;
            to = value;
        }

        Section extendTo(double value) {
            return extendTo((float)value);
        }

        Section extendTo(float value) {
            if (value < from)
                from = value;
            else if (value > to)
                to = value;
            return this;
        }

        private float from;
        private float to;
    }

    void addVerticalUseSection(double from, double to) {
        addVerticalUseSection((float)from, (float)to);
    }

    void addVerticalUseSection(float from, float to) {
        if (to < from) {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++) {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++) {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

(PageVerticalAnalyzer.java)

实现实际上类似于这个答案中的BoundingBoxFinder。就像在那里,我借用了PDFBox示例DrawPrintTextLocations来确定文本大纲。

此外,与原始iText5PageVerticalAnalyzer处理相对应的曲线中存在一个问题。根据该回答,控制点被视为位于实际曲线上,但实际上通常不是,并且可能远远超出曲线的垂直使用范围。我们可以使用相应的AWT类来代替这里实现的路径处理,但这在Android等平台上可能是不可能的。

就像这里一样,这个类忽略了注释,但是iText5也忽略了注释。这个类也会忽略剪辑路径。。。

public class PdfVeryDenseMergeTool {
    public PdfVeryDenseMergeTool(PDRectangle size, float top, float bottom, float gap)
    {
        this.pageSize = size;
        this.topMargin = top;
        this.bottomMargin = bottom;
        this.gap = gap;
    }

    public void merge(OutputStream outputStream, Iterable<PDDocument> inputs) throws IOException
    {
        try
        {
            openDocument();
            for (PDDocument input: inputs)
            {
                merge(input);
            }
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.save(outputStream);
        }
        finally
        {
            closeDocument();
        }
        
    }

    void openDocument() throws IOException
    {
        document = new PDDocument();
        newPage();
    }

    void closeDocument() throws IOException
    {
        try
        {
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.close();
        }
        finally
        {
            this.document = null;
            this.yPosition = 0;
        }
    }
    
    void newPage() throws IOException
    {
        if (currentContents != null) {
            currentContents.close();
            currentContents = null;
        }
        currentPage = new PDPage(pageSize);
        document.addPage(currentPage);
        yPosition = pageSize.getUpperRightY() - topMargin;
        currentContents = new PDPageContentStream(document, currentPage);
    }

    void merge(PDDocument input) throws IOException
    {
        for (PDPage page : input.getPages())
        {
            merge(input, page);
        }
    }

    void merge(PDDocument sourceDoc, PDPage page) throws IOException
    {
        PDRectangle pageSizeToImport = page.getCropBox();

        PageVerticalAnalyzer analyzer = new PageVerticalAnalyzer(page);
        analyzer.processPage(page);
        List<Float> verticalFlips = analyzer.getVerticalFlips();
        if (verticalFlips.size() < 2)
            return;

        LayerUtility layerUtility = new LayerUtility(document);
        PDFormXObject form = layerUtility.importPageAsForm(sourceDoc, page);

        int startFlip = verticalFlips.size() - 1;
        boolean first = true;
        while (startFlip > 0)
        {
            if (!first)
                newPage();

            float freeSpace = yPosition - pageSize.getLowerLeftY() - bottomMargin;
            int endFlip = startFlip + 1;
            while ((endFlip > 1) && (verticalFlips.get(startFlip) - verticalFlips.get(endFlip - 2) < freeSpace))
                endFlip -=2;
            if (endFlip < startFlip)
            {
                float height = verticalFlips.get(startFlip) - verticalFlips.get(endFlip);

                currentContents.saveGraphicsState();
                currentContents.addRect(0, yPosition - height, pageSizeToImport.getWidth(), height);
                currentContents.clip();
                Matrix matrix = Matrix.getTranslateInstance(0, (float)(yPosition - (verticalFlips.get(startFlip) - pageSizeToImport.getLowerLeftY())));
                currentContents.transform(matrix);
                currentContents.drawForm(form);
                currentContents.restoreGraphicsState();

                yPosition -= height + gap;
                startFlip = endFlip - 1;
            }
            else if (!first) 
                throw new IllegalArgumentException(String.format("Page %s content sections too large.", page));
            first = false;
        }
    }

    PDDocument document = null;
    PDPage currentPage = null;
    PDPageContentStream currentContents = null;
    float yPosition = 0; 

    final PDRectangle pageSize;
    final float topMargin;
    final float bottomMargin;
    final float gap;
}

(PdfVeryDenseMergeTool.java)

这本质上是iText 5PdfVeryDenseMergeTool的一个简单端口,没有什么特别之处。

一个简单地创建一个带有格式信息的PdfVeryDenseMergeTool实例,然后使用PD文档实例作为源开始合并:

PDDocument document1 = ...;
...
PDDocument documentN = ...;

PdfVeryDenseMergeTool tool = new PdfVeryDenseMergeTool(PDRectangle.A4, 30, 30, 10);
tool.merge(new FileOutputStream(RESULT_FILE), Arrays.asList(document1, ..., documentN));

(DenseMerging testtestVeryDenseMerging

 类似资料:
  • 我需要将N个PDF文件合并成一个。我先创建一个空白文件 稍后,我将遍历html字符串数组 我不太明白PdfWriter和PDFCopy之间的区别。

  • 我有一个pdf,里面总共有6页的图片。我想将第1页和第2页合并为单个pdf,以此类推,共3到6页。 我将所有6页的pdf拆分为单独的pdf。 从PyPDF2导入操作系统导入PdfFileReader、PdfFileWriter pdf_splitter: fname=os.path.splitext(os.path.basename(path))[0] if name=='main': path=

  • 问题内容: 我正在尝试将类路径中的文件复制到另一个临时位置。 这是它的代码: readMeFile有2页,在tempFilesOutputPath文件夹中复制的文件也有2页,但没有任何内容。 如果我犯了一些错误,或者必须以其他方式进行处理,请告诉我。 干杯,马杜 问题答案: 问题完全无关。我正在使用Maven复制资源来复制src / main / resources /下的资源 这是我的行家资源:

  • 问题内容: 我有普通的PDF文件,我想使用,在PDF 的末尾插入空白页,而不会打扰PDF内容。 问题答案: Dinup Kandel的答案是错误的,因为它是关于从头开始创建文档的。 NK123的答案 非常错误, 因为它使用/ 连接文件。该示例假定原始文档中的所有页面的尺寸均为A4。并非总是如此。如记录所示,这也将丢弃所有交互性。 唯一的好答案是这样的: 如果引用的文档有10页,则上面的代码将使用与

  • 我看了一个视频,学习如何将PDF文件合并成一个PDF文件。我试图修改一点代码,以便处理一个文件夹,其中有PDF文件主文件夹(Spyder)有,这是代码 我有一个名为的子文件夹进入主文件夹,在这个子文件夹中,我把PDF文件和子文件夹内的我创建了一个名为的文件夹。我得到了错误文件没有找到1.pdf虽然当打印的内循环,我得到了PDF名称。 错误的追溯

  • 问题内容: 我的概念是-网站中有10个pdf文件。用户可以选择一些pdf文件,然后选择合并以创建一个包含所选页面的pdf文件。我该如何用PHP做到这一点? 问题答案: 我以前做过 我有一个用fpdf生成的pdf,我需要在其中添加可变数量的PDF。 因此,我已经设置了fpdf对象和页面),并使用fpdi导入了文件通过扩展PDF类来添加FDPI: 基本上,这会将每个pdf转换为图像以放入您的其他pdf