问题：

从PDFBox中剥离时的文本坐标

邢飞鸿

2023-03-14

我试图使用PDFBox从pdf文件中提取带有坐标的文本。

我混合了一些在互联网上找到的方法/信息（stackoverflow也是），但是我有坐标的问题似乎是不对的。例如，当我试图使用坐标在tex上画一个矩形时，矩形被画在了其他地方。

这是我的代码（请不要判断风格，写得很快只是为了测试）

TextLine.java

    import java.util.List;
    import org.apache.pdfbox.text.TextPosition;

    /**
     *
     * @author samue
     */
    public class TextLine {
        public List<TextPosition> textPositions = null;
        public String text = "";
    }

myStripper.java

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;

    /*
     * To change this license header, choose License Headers in Project Properties.
     * To change this template file, choose Tools | Templates
     * and open the template in the editor.
     */

    /**
     *
     * @author samue
     */
    public class myStripper extends PDFTextStripper {
        public myStripper() throws IOException
        {
        }

        @Override
        protected void startPage(PDPage page) throws IOException
        {
            startOfLine = true;
            super.startPage(page);
        }

        @Override
        protected void writeLineSeparator() throws IOException
        {
            startOfLine = true;
            super.writeLineSeparator();
        }

        @Override
        public String getText(PDDocument doc) throws IOException
        {
            lines = new ArrayList<TextLine>();
            return super.getText(doc);
        }

        @Override
        protected void writeWordSeparator() throws IOException
        {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();

            super.writeWordSeparator();
        }


        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextLine tmpline = null;

            if (startOfLine) {
                tmpline = new TextLine();
                tmpline.text = text;
                tmpline.textPositions = textPositions;
                lines.add(tmpline);
            } else {
                tmpline = lines.get(lines.size() - 1);
                tmpline.text += text;
                tmpline.textPositions.addAll(textPositions);
            }

            if (startOfLine)
            {
                startOfLine = false;
            }
            super.writeString(text, textPositions);
        }

        boolean startOfLine = true;
        public ArrayList<TextLine> lines = null;

    }

单击AWT按钮上的事件

 private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {                                      
    // TODO add your handling code here:
    try {
        File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
        PDDocument doc = PDDocument.load(file);

        myStripper stripper = new myStripper();

        stripper.setStartPage(1); // fix it to first page just to test it
        stripper.setEndPage(1);
        stripper.getText(doc);

        TextLine line = stripper.lines.get(1); // the line i want to paint on

        float minx = -1;
        float maxx = -1;

        for (TextPosition pos: line.textPositions)
        {
            if (pos == null)
                continue;

            if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
                minx = pos.getTextMatrix().getTranslateX();
            }
            if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
                maxx = pos.getTextMatrix().getTranslateX();
            }
        }

        TextPosition firstPosition = line.textPositions.get(0);
        TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();
        float w = (maxx - minx) + lastPosition.getWidth();
        float h = lastPosition.getHeightDir();

        PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);

        contentStream.setNonStrokingColor(Color.RED);
        contentStream.addRect(x, y, w, h);
        contentStream.fill();
        contentStream.close();

        File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
        doc.save(fileout);
        doc.close();
    } catch (Exception ex) {

    }
}

有什么建议吗？我做错了什么？

共有2个答案

宣瀚

2023-03-14

以下代码适用于我：

    // Definition of font baseline, ascent, descent: https://en.wikipedia.org/wiki/Ascender_(typography)
    //
    // The origin of the text coordinate system is the top-left corner where Y increases downward.
    // TextPosition.getX(), getY() return the baseline.
    TextPosition firstLetter = textPositions.get(0);
    TextPosition lastLetter = textPositions.get(textPositions.size() - 1);

    // Looking at LegacyPDFStreamEngine.showGlyph(), ascender and descender heights are calculated like
    // CapHeight: https://stackoverflow.com/a/42021225/14731
    float ascent = firstLetter.getFont().getFontDescriptor().getAscent() / 1000 * lastLetter.getFontSize();
    Point topLeft = new Point(firstLetter.getX(), firstLetter.getY() - ascent);

    float descent = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
    // Descent is negative, so we need to negate it to move downward.
    Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
        lastLetter.getY() - descent);

    float descender = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
    // Descender height is negative, so we need to negate it to move downward
    Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
        lastLetter.getY() - descender);

换句话说，我们正在创建一个从字体上升到下降的边界框。

如果要使用左下角的原点渲染这些坐标，请参见https://stackoverflow.com/a/28114320/14731更多细节。您需要应用如下转换：

contents.transform(new Matrix(1, 0, 0, -1, 0, page.getHeight()));

潘向明

2023-03-14

这只是过度的PdfTextStripper坐标标准化的另一种情况。就像你一样，我也认为通过使用TextPosition。getTextMatrix（）（而不是getX（）和getY）可以得到实际的坐标，但是没有，即使是这些矩阵值也必须进行校正（至少在PDFBox 2.0.x中，我没有选中1.8.x），因为矩阵乘以一个平移，使裁剪框的左下角成为原点。

因此，在您的情况下（裁剪框的左下角不是原点），必须纠正这些值，例如通过替换

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();

通过

        PDRectangle cropBox = doc.getPage(0).getCropBox();

        float x = minx + cropBox.getLowerLeftX();
        float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();

而不是

你现在得到

不过，很明显，你还需要稍微调整一下高度。这是由于PdfTextStripper确定文本高度的方式造成的：

    // 1/2 the bbox is used as the height todo: why?
    float glyphHeight = bbox.getHeight() / 2;

（从showGlyph（…）在LegacyPDFStreamEngine中，PdfTextStripper的父类

虽然字体边界框通常确实太大，但它的一半通常是不够的。

类似资料：

Python元素树-从元素中提取文本，剥离标签

问题内容：使用Python中的ElementTree，如何从节点中提取所有文本，剥离该元素中的所有标签并仅保留文本？例如，说我有以下内容：我想回来。我该怎么做呢？到目前为止，我所采用的方法产生了相当灾难性的结果。问题答案：如果您在Python 3.2+下运行，则可以使用。创建一个文本迭代器，该迭代器按文档顺序循环遍历此元素和所有子元素，并返回所有内部文本：如果您在较低版本的Pytho
从字符串中剥离HTML标记

问题内容：如何从字符串中删除HTML标签，以便可以输出纯文本？问题答案：嗯，我尝试了您的功能，并在一个小例子上工作了：你能举一个例子吗？ Swift 4和5版本：
jQuery html（）剥离脚本标签

问题内容：我需要用ajax调用产生的html替换页面中div的内容。问题是html中包含一些必要的脚本，并且jquery html（）函数似乎将它们剥离了，我需要过滤响应并仅获取特定的div。我正在考虑一种解决方法，该方法是从ajax响应中提取所有脚本标签，然后将其附加到DOM中，但是这样做很麻烦。这是我的代码；但这是任何结论。我尝试了那里提出的解决方案，但没有一个起作用。编辑：我似乎找
在python中剥离时区信息

问题内容：我使用tz_localize将时区分配给datetime对象，因为我需要使用tz_convert转换为另一个时区。这将以“ -06：00”的方式添加UTC偏移量。我需要摆脱此偏移量，因为当我尝试将数据框导出到Excel时会导致错误。实际产量所需的输出我尝试使用str（）方法获取想要的字符，但是tz_localize的结果似乎不是字符串。到目前为止，我的解决方案是将数据帧导出到cs
从日期时间开始的TSQL剥离日期

问题内容：从DATETIME中删除日期的最佳方法是什么，以便仅剩下时间进行比较？我知道我可以执行以下操作：但这涉及转换和字符。如果我想检查DATETIME列中是否存储了另外两个时间之间的时间（包括分钟），是否有一种优雅的方法可以执行此操作而不必依赖转换为字符串？问题答案：从Essential SQL Server日期，时间和DateTime函数中尝试使用TimeOnly函数：
在Python中从字符串中剥离HTML

问题内容：当在HTML文件中打印一行时，我试图找到一种仅显示每个HTML元素的内容而不显示格式本身的方法。如果找到，它将仅打印“某些文本”，打印“ hello”，等等。如何去做呢？问题答案：我一直使用此函数来剥离HTML标记，因为它仅需要Python stdlib：对于Python 3：对于Python 2：

从PDFBox中剥离时的文本坐标

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档