无法使用java apache pdfbox从PDF中提取特定坐标的值



我使用Apache Pdfbox客户端进行数据提取。

为了从PDF中获取x、y、高度和宽度坐标,我使用PDF x更改工具,单位为毫米。当我在矩形中传递值时,值不是空值。

public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
                double height) throws IOException {
            String extractedText = "";
            // PDDocument Creates an empty PDF document. You need to add at least
            // one page for the document to be valid.
            // Using load method we can load a PDF document
            PDDocument document = null;
            PDPage page = null;
            try {
                if (pdfLocation.endsWith(".pdf")) {
                    document = PDDocument.load(new File(pdfLocation));
                    int getDocumentPageCount = document.getNumberOfPages();

                    // Get specific page. THe parameter is pageindex which starts with // 0. If we need to
                    // access the first page then // the pageIdex is 0 PDPage
                    if (getDocumentPageCount > 0) {
                        page = document.getPage(pageNumber + 1);
                    } else if (getDocumentPageCount == 0) {
                        page = document.getPage(0);
                    // To create a rectangle by passing the x axis, y axis, width and height 
                    Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
                    String regionName = "region1";

                    // Strip the text from PDF using PDFTextStripper Area with the
                    // help of Rectangle and named need to given for the rectangle
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.addRegion(regionName, rect);
                    System.out.println("Region is " + stripper.getTextForRegion("region1"));
                    extractedText = stripper.getTextForRegion("region1");
                } else {
                    System.out.println("No data return");
            } catch (IOException e) {
                System.out.println("The file  not found" + "");
            } finally {
            // Return the extracted text and this can be used for assertion
            return extractedText;




我已经使用了这个PDF教程点。com/uipath/uipath\u教程。pdf。。我试图找到文本“竞赛的一部分”,其中x=55.6 mm y=168.8宽度=210.0 mm,高度=297.0。但我得到的是空值


System.out.println("Extracting like Venkatachalam Neelakantan from uipath_tutorial.pdf\n");
float MM_TO_UNITS = 1/(10*2.54f)*72;
String text = getTextUsingPositionsUsingPdf("src/test/resources/mkl/testarea/pdfbox2/extract/uipath_tutorial.pdf",
        0, 55.6 * MM_TO_UNITS, 168.8 * MM_TO_UNITS, 210.0 * MM_TO_UNITS, 297.0 * MM_TO_UNITS);
System.out.printf("\n---\nResult:\n%s\n", text);



 part of contents of this e-book in any manner without written consent 

te the contents of our website and tutorials as timely and as precisely as 
, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. 
guarantee regarding the accuracy, timeliness or completeness of our 
tents including this tutorial. If you discover any errors on our website or 
ease notify us at contact@tutorialspoint.com 


假设你实际上是在寻找“一部分内容”,而不是“比赛的一部分”,只缺少“a”;可能在测量时,您查找的是可见字母绘图的开头,但实际的图示符原点稍早于此。如果选择稍小的x,例如54.6 mm,则也会得到“a”。


如果您想知道MM\u TO\u单位的常数,请看看这个答案。

