问题：

PDFBox在特定pdf文档中获取错误的文本位置

滕无尘

2023-03-14

背景

我一直在开发一个程序，它可以获取一个pdf，突出显示一些单词（通过pdfbox标记注释）并保存新的pdf。

为此，我扩展了PDFTextStripper类，以覆盖writeString（）方法并获取每个单词（框）的TextPositions，这样我就可以准确地知道文本在PDF文档中的坐标位置（TextPosition对象为我提供每个单词框的坐标）。然后，在此基础上，我画了一个矩形，突出显示我想要的单词。

问题

它适用于我迄今为止尝试过的所有文档，但有一个文档除外，我从文本帖子中获得的位置似乎是错误的，导致了错误的突出显示。

这是原始文档：
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf

这是在writeString（）提供给我的第一个单词框中有一个突出显示的文档，带有setSortByPosition（false），即MicroRNA:
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf
它应该高亮显示MicroRNA，但它高亮显示了其上方的一个空白区域（粉色HL矩形）。

这是在writeString（）提供给我的第一个字框中突出显示的文档，带有setSortByPosition（true），它是原始的：
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf
它应该突出显示原始文档，但它突出显示了PDF文档开头的一个空格（粉色HL矩形）。

我想，这个PDF可能包含一些PDFBox很难找到正确位置的东西，或者这可能是PDFBox中的一种缺陷。

技术规格：

PDFBox 2.0.17
Java 11.0.6 10，采用OpenJDK
MacOS Catalina 10.15.4，16gb，x86_64

坐标值

例如，对于MicroRNA单词框的开头和结尾，writeString（）给出的TextPosition坐标是：

M字母

endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0

一封信

endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0

这会导致我在上面分享的错误HL注释，而对于所有其他PDF文档，这是非常精确的，我已经测试了许多不同的注释。我在这里一无所知，而且我不是PDF定位方面的专家。我尝试过使用PDFbox调试器工具，但无法正确阅读。我们将非常感谢您的帮助。如果我能提供更多证据，请告诉我。谢谢

编辑

请注意，文本提取工作正常。

我的代码

首先，我创建一个数组的坐标与几个值从TextPotion对象的第一个和最后一个字符我想HL：

private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
    double firstPositionX = firstPosition.getX();
    double firstPositionY = firstPosition.getY();
    double lastPositionEndX = lastPosition.getEndX();
    double lastPositionY = lastPosition.getY();

    double height = firstPosition.getHeight();
    double width = firstPosition.getWidth();
    int rotation = firstPosition.getRotation();

    double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber, 
    height, width, rotation};

    
    ...
}

现在是基于提取的坐标绘制时间：

for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {

    DPage page = pdDocument.getPage(pageIndex);
    List<PDAnnotation> annotations = page.getAnnotations();

    int rotation;
    double pageHeight = page.getMediaBox().getHeight();
    double pageWidth  = page.getMediaBox().getWidth();
    
    // each CoordinatePoint obj holds the double array with the 
    // coordinates of each word I want to HL - see the previous method
    for (CoordinatePoint coordinate : coordinates) {
        double[] wordCoordinates = coordinate.getCoordinates();
        
        int pageNumber = (int) wordCoordinates[4];

        // if the current coordinates are not related to the current page, 
        //ignore them
        if ((int) pageNumber == (pageIndex + 1)) {
            // getting rotation of the page: portrait, landscape...
            rotation = (int) wordCoordinates[7];

            firstPositionX = wordCoordinates[0];
            firstPositionY = wordCoordinates[1];
            lastPositionEndX = wordCoordinates[2];
            lastPositionY = wordCoordinates[3];
            height = wordCoordinates[5];

            double height;
            double minX;
            double maxX;
            double minY;
            double maxY;
            
            if (rotation == 90) {

                double width = wordCoordinates[6];
                width = (pageHeight * width) / pageWidth;

                //defining coordinates of a rectangle
                maxX = firstPositionY;
                minX = firstPositionY - height;
                minY = firstPositionX;
                maxY = firstPositionX + width;
            } else {
                minX = firstPositionX;
                maxX = lastPositionEndX;
                minY = pageHeight - firstPositionY;
                maxY = pageHeight - lastPositionY + height;
            }
                    
            // Finally I draw the Rectangle
            PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);

            PDRectangle pdRectangle = new PDRectangle();
            pdRectangle.setLowerLeftX((float) minX);
            pdRectangle.setLowerLeftY((float) minY);
            pdRectangle.setUpperRightX((float) maxX);
            pdRectangle.setUpperRightY((float) ((float) maxY + height));

            txtMark.setRectangle(pdRectangle);

            // And the QuadPoints
            float[] quads = new float[8];
            quads[0] = pdRectangle.getLowerLeftX();  // x1
            quads[1] = pdRectangle.getUpperRightY() - 2; // y1
            quads[2] = pdRectangle.getUpperRightX(); // x2
            quads[3] = quads[1]; // y2
            quads[4] = quads[0];  // x3
            quads[5] = pdRectangle.getLowerLeftY() - 2; // y3
            quads[6] = quads[2]; // x4
            quads[7] = quads[5]; // y5

            txtMark.setQuadPoints(quads);
            ...
        }
    }

共有1个答案

裴嘉许

2023-03-14

您的四点坐标是相对于CropBox计算的，但它们需要相对于MediaBox。对于本文档，CropBox比MediaBox小，因此突出显示的位置不正确。用CropBox调整x。LLX-MediaBox。带MediaBox的LLY和y。URY-CropBox。而亮点将位于正确的位置
上述调整适用于旋转=0的页面。如果旋转！=0则可能需要进一步调整，具体取决于PDFBox返回坐标的方式（我对PDFBox API不太熟悉）。

操作编辑

在这里发布我对代码所做的更改，以便帮助他人。请注意，我还没有尝试旋转=90的任何方法。一旦我有了这篇文章，我会在这里更新。

之前

...
if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX;
    maxX = lastPositionEndX;
    minY = pageHeight - firstPositionY;
    maxY = pageHeight - lastPositionY + height;
}
...

之后

...

PDRectangle mediaBox = page.getMediaBox();
PDRectangle cropBox = page.getCropBox();

if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    maxX = lastPositionEndX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    minY = pageHeight - firstPositionY - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
    maxY = pageHeight - lastPositionY + height - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
}
...

类似资料：

使用Apache POI和Apache PDFBox读取文档、pdf文件时的文本框位置错误

我正在尝试读取和处理Java中的.doc、.docx、.pdf文件，方法是使用Apache POI（用于doc、docx)和Apache PDFBox（用于pdf）库将它们转换为单个字符串。在遇到文本框之前，它工作得很好。如果格式是这样的: 第1段文本框1 第2段文本框2 第3段那么输出应该是: 第1段文本框1第2段文本框2第3段但我得到的输出是: 第1段文本框1文本框2 似乎是在结尾处
使用PDFBox从PDF文档中读取特定页面

问题内容：如何使用PDFBox从PDF文档中读取特定页面（具有页码）？问题答案：这应该工作：如本教程的“ 书签”部分中所示更新2015年，版本2.0.0快照似乎已将其删除并放回（？）。 getPage 在2.0.0 javadoc中。要使用它：该 getAllPages 方法已更名 GETPAGES
PDFBox PDF文档中的JavaScript

主要内容：将JavaScript添加到PDF文档,示例在前一章中，我们学习了如何将图像插入到PDF文档中。在本章中，将学习如何将JavaScript添加到PDF文档。将JavaScript添加到PDF文档可以使用类将JavaScript操作添加到PDF文档。它代表了JavaScript操作。以下是将JavaScript操作添加到现有PDF文档的步骤。第1步:加载现有的PDF文档使用类的静态方法加载现有的PDF文档。此方法接受一个文件对
PDFBox分割PDF文档

主要内容：分割PDF文档中的页面,示例在前一章中，我们已经看到了如何将JavaScript添加到PDF文档。现在来学习如何将给定的PDF文档分成多个文档。分割PDF文档中的页面可以使用类将给定的PDF文档分割为多个PDF文档。该类用于将给定的PDF文档分成几个其他文档。以下是拆分现有PDF文档的步骤第1步:加载现有的PDF文档使用类的静态方法加载现有的PDF文档。此方法接受一个文件对象作为参数，因为这是一个静态方法，可
PDFBox加密PDF文档

主要内容：加密PDF文档,示例在前一章中，我们已经看到了如何在PDF文档中插入图像。在本章中，我们将学习如何加密PDF文档。加密PDF文档使用和类提供的方法加密PDF文档。类用于通过为其分配访问权限来保护PDF文档。使用此教程，您可以限制用户执行以下操作。打印文档修改文档的内容复制或提取文档的内容添加或修改注释填写交互式表单域提取文字和图形以便视障人士使用汇编文件打印质量下降类用于向文档添加基于密码
PDFBox创建PDF文档

主要内容：创建一个空的PDF文档,实例现在让我们了解如何使用PDFBox库创建PDF文档。创建一个空的PDF文档可以通过实例化类来创建一个空的PDF文档。使用这个类的方法将文档保存在所需的位置。以下是创建一个空的PDF文档的步骤。第1步: 创建空白文档包中的类是PDF文档的内存中表示形式。因此，通过实例化这个类，可以创建一个空的，如下面的代码块所示。第2步: 保存文档创建文档后，需要将此文档保存在所需的路径中，可以使用

PDFBox在特定pdf文档中获取错误的文本位置

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档