问题：

使用PDFbox确定文档中单词的坐标

班宏毅

2023-03-14

我正在使用PDFbox来提取PDF文档中单词/字符串的坐标，并且到目前为止已经成功地确定了单个字符的位置。

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

这将生成一系列包含每个字符位置的行，包括空格，如下所示：

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

其中“P”是字符。我还没有在PDFbox中找到查找单词的函数，而且我对Java还不够熟悉，无法将这些字符准确地连接回单词中进行搜索，即使空格也包括在内。有没有其他人遇到过类似的情况，如果有，你是如何处理的？我真的只需要单词中第一个字符的坐标，这样部分就可以简化了，但是我不知道如何将字符串与这种输出进行匹配。

共有3个答案

毋举

2023-03-14

看看这个，我想这就是你需要的。

https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-java/

以下是代码：

import java.io.File;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

public class PrintTextLocations extends PDFTextStripper {

public static StringBuilder tWord = new StringBuilder();
public static String seek;
public static String[] seekA;
public static List wordList = new ArrayList();
public static boolean is1stChar = true;
public static boolean lineMatch;
public static int pageNo = 1;
public static double lastYVal;

public PrintTextLocations()
        throws IOException {
    super.setSortByPosition(true);
}

public static void main(String[] args)
        throws Exception {
    PDDocument document = null;
    seekA = args[1].split(",");
    seek = args[1];
    try {
        File input = new File(args[0]);
        document = PDDocument.load(input);
        if (document.isEncrypted()) {
            try {
                document.decrypt("");
            } catch (InvalidPasswordException e) {
                System.err.println("Error: Document is encrypted with a password.");
                System.exit(1);
            }
        }
        PrintTextLocations printer = new PrintTextLocations();
        List allPages = document.getDocumentCatalog().getAllPages();

        for (int i = 0; i < allPages.size(); i++) {
            PDPage page = (PDPage) allPages.get(i);
            PDStream contents = page.getContents();

            if (contents != null) {
                printer.processStream(page, page.findResources(), page.getContents().getStream());
            }
            pageNo += 1;
        }
    } finally {
        if (document != null) {
            System.out.println(wordList);
            document.close();
        }
    }
}

@Override
protected void processTextPosition(TextPosition text) {
    String tChar = text.getCharacter();
    System.out.println("String[" + text.getXDirAdj() + ","
            + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
            + text.getXScale() + " height=" + text.getHeightDir() + " space="
            + text.getWidthOfSpace() + " width="
            + text.getWidthDirAdj() + "]" + text.getCharacter());
    String REGEX = "[,.\\[\\](:;!?)/]";
    char c = tChar.charAt(0);
    lineMatch = matchCharLine(text);
    if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) {
        if ((!is1stChar) && (lineMatch == true)) {
            appendChar(tChar);
        } else if (is1stChar == true) {
            setWordCoord(text, tChar);
        }
    } else {
        endWord();
    }
}

protected void appendChar(String tChar) {
    tWord.append(tChar);
    is1stChar = false;
}

protected void setWordCoord(TextPosition text, String tChar) {
    tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar);
    is1stChar = false;
}

protected void endWord() {
    String newWord = tWord.toString().replaceAll("[^\\x00-\\x7F]", "");
    String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1);
    if (!"".equals(sWord)) {
        if (Arrays.asList(seekA).contains(sWord)) {
            wordList.add(newWord);
        } else if ("SHOWMETHEMONEY".equals(seek)) {
            wordList.add(newWord);
        }
    }
    tWord.delete(0, tWord.length());
    is1stChar = true;
}

protected boolean matchCharLine(TextPosition text) {
    Double yVal = roundVal(Float.valueOf(text.getYDirAdj()));
    if (yVal.doubleValue() == lastYVal) {
        return true;
    }
    lastYVal = yVal.doubleValue();
    endWord();
    return false;
}

protected Double roundVal(Float yVal) {
    DecimalFormat rounded = new DecimalFormat("0.0'0'");
    Double yValDub = new Double(rounded.format(yVal));
    return yValDub;
}
}

依赖项：

PDFBox，FontBox，Apache通用日志记录接口。

您可以通过在命令行上键入来运行它：

javac PrintTextLocations.java 
sudo java PrintTextLocations file.pdf WORD1,WORD2,....

输出类似于：

[(1)[190.3 : 286.8] WORD1, (1)[283.3 : 286.8] WORD2, ...]

蒋星雨

2023-03-14

基于最初的想法，这里是PDFBox 2文本搜索的一个版本。代码本身很粗糙，但很简单。这会让你很快开始。

import java.io.IOException;
import java.io.Writer;
import java.util.List;
import java.util.Set;
import lu.abac.pdfclient.data.PDFTextLocation;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

public class PrintTextLocator extends PDFTextStripper {

    private final Set<PDFTextLocation> locations;

    public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException {
        super.setSortByPosition(true);
        this.document = document;
        this.locations = locations;
        this.output = new Writer() {
            @Override
            public void write(char[] cbuf, int off, int len) throws IOException {
            }
            @Override
            public void flush() throws IOException {
            }

            @Override
            public void close() throws IOException {
            }
        };
    }

    public Set<PDFTextLocation> doSearch() throws IOException {

        processPages(document.getDocumentCatalog().getPages());
        return locations;
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
        super.writeString(text);

        String searchText = text.toLowerCase();
        for (PDFTextLocation textLoc:locations) {
            int start = searchText.indexOf(textLoc.getText().toLowerCase());
            if (start!=-1) {
                // found
                TextPosition pos = textPositions.get(start);
                textLoc.setFound(true);
                textLoc.setPage(getCurrentPageNo());
                textLoc.setX(pos.getXDirAdj());
                textLoc.setY(pos.getYDirAdj());
            }
        }

    }


}

邵阳辉

2023-03-14

PDFBox中没有允许自动提取单词的功能。我目前正在提取数据，将其收集到块中，以下是我的过程：

我对循环在列表上的每个字形的坐标进行分析。如果它们重叠（如果当前字形的顶部包含在前一个字形的顶部和底部之间/或者当前字形的底部包含在前一个字形的顶部和底部之间），我将其添加到同一行。

在这一点上，我已经提取了文档的不同行（小心，如果你的文档是多列的，表达式“线”意味着垂直重叠的所有字形，即具有相同垂直坐标的所有列的文本）。

然后，您可以将当前图示符的左坐标与前一个图示符的右坐标进行比较，以确定它们是否属于同一个单词（PDFTextStripper类提供了一个getSpacingTolerance（）方法，该方法根据尝试和错误为您提供“正常”空间的值。如果左右坐标之间的差值小于该值，则两个图示符属于同一个单词。

我把这种方法应用到我的工作中，效果很好。

类似资料：

PDFBox 中文文档

Apache PDFBox 是一个开源 Java 库，支持 PDF 文档的开发和转换。在本教程中，我们将学习如何使用 PDFBox 开发可以创建，转换和操作 PDF 文档的 Java 程序。
使用java提取文本文件中特定单词旁边的单词

我想读一个文本文件，打印出已知单词前面的单词，比如Java中的xxx。我使用Scanner类用java编写了这段代码。但是这段代码只打印了“xxx”前面的一半单词，而“xxx”前面的一些单词则丢失了。我想知道是什么问题，你能解决这个代码吗。测试文件包含类似的内容
使用PDFBox从PDF文档中读取特定页面

问题内容：如何使用PDFBox从PDF文档中读取特定页面（具有页码）？问题答案：这应该工作：如本教程的“ 书签”部分中所示更新2015年，版本2.0.0快照似乎已将其删除并放回（？）。 getPage 在2.0.0 javadoc中。要使用它：该 getAllPages 方法已更名 GETPAGES
PDFBox PDF文档中的JavaScript

主要内容：将JavaScript添加到PDF文档,示例在前一章中，我们学习了如何将图像插入到PDF文档中。在本章中，将学习如何将JavaScript添加到PDF文档。将JavaScript添加到PDF文档可以使用类将JavaScript操作添加到PDF文档。它代表了JavaScript操作。以下是将JavaScript操作添加到现有PDF文档的步骤。第1步:加载现有的PDF文档使用类的静态方法加载现有的PDF文档。此方法接受一个文件对
从PDFBox中剥离时的文本坐标

我试图使用PDFBox从pdf文件中提取带有坐标的文本。我混合了一些在互联网上找到的方法/信息（stackoverflow也是），但是我有坐标的问题似乎是不对的。例如，当我试图使用坐标在tex上画一个矩形时，矩形被画在了其他地方。这是我的代码（请不要判断风格，写得很快只是为了测试） TextLine.java myStripper.java 单击AWT按钮上的事件有什么建议吗？我做错了什么？
如何使用POI读取单词文档中每个单词的字体大小？

问题内容：我试图找出word文档中是否存在字体为2的任何内容。但是，我无法做到这一点。首先，我尝试读取只有一行和7个单词的示例单词文档中每个单词的字体。我没有得到正确的结果。这是我的代码：但是，以上代码始终使字体大小加倍。也就是说，如果文档中的实际字体大小是12，则输出24；如果实际字体是8，则输出16。这是从Word文档读取字体大小的正确方法吗？问题答案：是的，那是正确的方法；测量单

使用PDFbox确定文档中单词的坐标

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档