问题：

如何协调这些文本位置和行位置与PDFBox？

厍晋鹏

2023-03-14

我为表格中的线条得到的y坐标似乎超出了文本的坐标。似乎正在进行一些转换，但我找不到它。如果可能的话，我想在下面扩展的PDFGraphicsStreamEngine范围内解决这个问题，而不必回到绘图板上使用PDFBox中可用的其他输入流。

我扩展了PDFTextStripper，以获取页面上每个文本图示符的位置：

public class MyPDFTextStripper extends PDFTextStripper {

    private List<TextPosition> tps;

    public MyPDFTextStripper() throws IOException {
        tps = new ArrayList<>();
    }

    @Override
    protected void writeString
            (String text,
             List<TextPosition> textPositions)
            throws IOException {
        tps.addAll(textPositions);
    }

    List<TextPosition> getTps() {
        return tps;
    }
}

我已经扩展了PDFGraphicsStreamEngine来提取页面上的每一行作为Line2D：

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private List<Line2D> lines;

    LineCatcher(PDPage page)
    {
        super(page);
        lines = new ArrayList<>();
    }

    List<Line2D> getLines() {
        return lines;
    }

    @Override
    public void strokePath() throws IOException
    {
        Rectangle2D rect = linePath.getBounds2D();
        Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
                rect.getX() + rect.getWidth(),
                rect.getY() + rect.getHeight());
        lines.add(line);
        linePath.reset();
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {linePath.moveTo(x, y);}
    @Override
    public void lineTo(float x, float y) throws IOException
    {linePath.lineTo(x, y);}
    @Override
    public Point2D getCurrentPoint() throws IOException
    {return linePath.getCurrentPoint();}

    //all other overridden methods can be left empty for the purposes of this problem.
}

我编写了一个简单的程序来演示这个问题：

public class PageAnalysis {
    public static void main(String[] args) {
        try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
            PDPage page = doc.getPage(0);

            MyPDFTextStripper ts = new MyPDFTextStripper();
            ts.getText(doc);
            List<TextPosition> tps = ts.getTps();

            System.out.println("Y coordinates in text:");
            Set<Integer> ySet = new HashSet<>();
            for (TextPosition tp: tps) {
                ySet.add((int)tp.getY());
            }
            List<Integer> yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();


            System.out.println("Y coordinates in lines:");
            LineCatcher lineCatcher = new LineCatcher(page);
            lineCatcher.processPage(page);
            List<Line2D> lines = lineCatcher.getLines();
            ySet = new HashSet<>();
            for (Line2D line: lines) {
                ySet.add((int)line.getY2());
            }
            yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

这项研究的结果是：

Y coordinates in text:
66  79  106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780 
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713

文本列表中的最后一个数字对应于底部页码的y坐标。我找不到线条的y坐标是怎么回事，尽管它似乎已经被转换了（媒体框在这里与文本相同，并且与文本位置相符）。对于yScaling，当前变换矩阵也有1.0。

共有1个答案

商嘉木

2023-03-14

事实上，PDFTextStripper有一个坏习惯，那就是将坐标转换成一个非常un-PDF'ish的坐标系，一个原点在页面左上角，y坐标向下增加的坐标系。

因此，对于文本位置tp，您不应该使用

tp.getY()

但是相反

tp.getTextMatrix().getTranslateY()

不幸的是，即使这些坐标更接近实际的PDF默认坐标系，也可能会被转换。请参见：这些坐标仍然会被转换，以使原点位于裁剪框的左下角。

因此，你真的需要这样的东西：

tp.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

其中cropBox是检索为

PDRectangle cropBox = doc.getPage(n).getCropBox();

其中n是包含该内容的页面的编号。

类似资料：

JavaScript与脚本文件放置位置混淆

问题内容：从技术上讲，将脚本放在html页面底部是JavaScript最佳实践。但是我很困惑为什么某些脚本应该像Angular那样在页面顶部调用。因此，当我使用类似Angular的库时，是否违反了JavaScript最佳做法？有什么解释吗？问题答案：从技术上讲，这仅是最佳实践，如果您不关心顺序文件的“太多”加载。您确定要先调用一个库。因此，人们在加载HTML之后将所有自定义脚本加载到底部，
标签文本位置

我有一个带有图像和文本的标签我得到了一个直观的结果：如何更改文本位置？我想在图像下面设置文本？
提取和打印文本位置

我在pdfbox上做了一些实验，我现在遇到了一个问题，我怀疑这个问题与坐标系有关。我正在扩展PDFTextStripper以获得pdf页面中每个字符的X和Y。最初我是用ImageIO创建一个图像，在我收到的位置打印文本，并在我想要的每个引用的底部加上一个小标记（不同颜色的矩形），看起来一切都很好。但现在，为了避免丢失pdf的样式，我只想覆盖pdf并添加前面说过的标记，但我得到的坐标在PDPag
文本位置边界框PDFBox

下面是我的函数，它从glyph空间到用户空间进行计算下面是绘制提取的矩形的函数：我不知道我做错了什么。有什么想法吗？
文件位置

你可以更改保存 Navicat Data Modeler 内部文件的“配置文件位置”。
文件位置

你可以为不同的文件类型更改文件夹。在默认情况下，大部分的文件保存于设置位置。然而，某些文件是保存于“配置文件位置”路径，以及所有日志文件是保存于“记录位置”路径。在配置文件位置的文件服务器类型扩展名自动运行 MySQL .nbatmysql Oracle .nbatora PostgreSQL .nbatpgsql SQLite .nbatsqlite SQL Server .nbatms

如何协调这些文本位置和行位置与PDFBox？

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档