我正在使用PDFBox通过扩展PDFTextStripper从文档中提取文本。我注意到其中一些文档包含正在提取的不可见字符。我想过滤掉这些不可见的字符。
我看到已经有一些关于这个的stackoverflow帖子,例如:
我尝试对此处找到的PDFVisibleTextStripper
类进行子类化:
但是,我发现这过滤掉了实际上可见的文本。我将其用作PDFTextStripper
的下拉替换。
package com.example.foo;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
public class ExtractChars extends PDFVisibleTextStripper {
Processor processor;
public static void extract(PDDocument document, Processor processor) throws IOException {
ExtractChars instance = new ExtractChars();
instance.processor = processor;
instance.setSortByPosition(true);
instance.setStartPage(0);
instance.setEndPage(document.getNumberOfPages());
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Writer streamWriter = new OutputStreamWriter(stream);
instance.writeText(document, streamWriter);
}
ExtractChars() throws IOException {}
protected void writeString(String _string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text: textPositions) {
float height = text.getHeightDir();
String character = text.getUnicode();
int pageIndex = getCurrentPageNo() - 1;
float left = text.getXDirAdj();
float right = left + text.getWidthDirAdj();
float bottom = text.getYDirAdj();
float top = bottom - height;
BoundingBox box = new BoundingBox(pageIndex, left, right, top, bottom);
this.processor.process(character, box);
}
}
public interface Processor {
void process(String character, BoundingBox box);
}
}
我不知道我的子类中是否有什么需要更改以使其正常工作。如果有帮助,我可以提供一个展示这种行为的PDF,尽管它包含敏感内容,所以我需要先删除它。
相反,我创建了一个最小的例子(如下),展示了我看到的“看不见的文本”行为。带项目符号的列表在24年末包含一个项目。a、 “可以在macOS预览和复制粘贴等PDF查看器中突出显示。
这个“a.”目前正由PDFTextStripper
提取,我不希望这样。我真的不明白为什么会这样。我猜这与剪辑有关,但如果有人能解释一下发生了什么,我将不胜感激。
我的最终目标是过滤掉这些字符,因此如果您对如何以最简单的方式处理此特定情况有建议,将不胜感激。我不认为我需要PDFVisibleTextStripper
中的所有通用方法。
非常感谢!
%PDF-1.3
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 612 792]
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 4 0 R
/Contents 6 0 R
/MediaBox [0 0 612 792]
>>
endobj
4 0 obj
<<
/Font <<
/TT2 5 0 R
>>
>>
endobj
5 0 obj
<<
/BaseFont
/OXRDVC+Helvetica
/Subtype /TrueType
/Type /Font
>>
endobj
6 0 obj
<<
>>
stream
q 0 54 612 648 re W n /Cs1 cs 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm Q
q 48 93.30545 516 569.4218 re W n /Cs1 cs 1 1 1 sc 48 93.30545 516 569.4218 re f 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 66.86 589.28 Tm /TT2 1 Tf (24. ) Tj ET Q
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 96.86 40.39 Tm /TT2 1 Tf (a. ) Tj ET Q
endstream
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF
我知道发生了什么。PDF包含一个不包含“a”的剪辑矩形。我尝试使用PDFVisibleTextStripper
,但它删除了其他文档中实际上可见的文本。
最后,我编写了一个继承自PageDrawer
的类,并实现了showGlyph
方法来访问页面上绘制的字符。此方法检查字符的边界框是否在getGraphicsState()之外。getCurrentClippingPath()。getBounds2D()
。
不幸的是,这意味着我不再使用PDFTextStripper
,所以我必须重新实现它的一些行为,比如按位置对字符进行排序(我使用的是setSortByPosition(true)
)。根据字体大小和位移计算字符的正确边界框也有点棘手。
提取器。JAVA
package com.example.foo;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import org.apache.pdfbox.util.Vector;
import java.awt.geom.*;
import java.io.*;
// This class effectively renders the PDF document in order to extract its
// text. It intercepts the showGlyph function provided by PageDrawer. We used to
// use PDFTextStripper but that has no way to exclude clipped characters.
public class ExtractChars extends PageDrawerHelper {
// Skip erroneous characters smaller than this height. This might never happen
// but there are places in the code that divide by height, so guard against it.
static final float MIN_CHARACTER_HEIGHT = 0.01f;
Processor processor;
ExtractChars(PageDrawerParameters params, float pageHeight, int pageIndex, Processor processor) throws IOException {
super(params, pageHeight, pageIndex);
this.processor = processor;
}
// We can't move this method up to the superclass because the Renderer is
// different each time. It needs to build an instance of the current class.
public static void extract(PDDocument document, Processor processor) throws IOException {
Renderer renderer = new Renderer(document);
renderer.processor = processor;
for (int i = 0; i < document.getNumberOfPages(); i += 1) {
PDPage page = document.getPage(i);
renderer.pageHeight = page.getMediaBox().getHeight();
renderer.pageIndex = i;
renderer.renderImage(i);
}
}
@Override
public void showGlyph(Matrix matrix, PDFont font, int _code, String unicode, Vector displacement) throws IOException {
if (unicode == null) { return; }
// Get the width and height of the character relative to font size.
// The height does not change but the width does, e.g. 'M' is wider than 'I'.
float width = displacement.getX();
float height = fontHeight(font) / 2;
BoundingBox charBox = clippedBoundingBox(matrix, width, height);
// Skip the character if it is outside the clipping region and not visible.
if (charBox == null) { return; }
float boxHeight = charBox.bottom - charBox.top;
if (boxHeight < MIN_CHARACTER_HEIGHT) { return; }
// We need the text direction so we can sort text in separate buckets based on this.
int direction = textDirection(matrix);
processor.process(unicode, charBox, direction);
}
// https://stackoverflow.com/questions/17171815/get-the-font-height-of-a-character-in-pdfbox#answer-17202929
float fontHeight(PDFont font) {
return font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000;
}
int textDirection(Matrix matrix) {
float a = matrix.getValue(0, 0);
float b = matrix.getValue(0, 1);
float c = matrix.getValue(1, 0);
float d = matrix.getValue(1, 1);
// This logic is copied from:
// https://github.com/atsuoishimoto/pdfbox-ja/blob/master/src/main/java/org/apache/pdfbox/util/TextPosition.java
if ((a > 0) && (Math.abs(b) < d) && (Math.abs(c) < a) && (d > 0)) {
return 0;
} else if ((a < 0) && (Math.abs(b) < Math.abs(d)) && (Math.abs(c) < Math.abs(a)) && (d < 0)) {
return 180;
} else if ((Math.abs(a) < Math.abs(c)) && (b > 0) && (c < 0) && (Math.abs(d) < b)) {
return 90;
} else if ((Math.abs(a) < c) && (b < 0) && (c > 0) && (Math.abs(d) < Math.abs(b))) {
return 270;
}
return 0;
}
// We can't construct an instance of ExtractChars directly because its
// constructor requires PageDrawerParameters which is private to the package.
// Instead, make an instance via a renderer and forward the fields to it.
static class Renderer extends PDFRenderer {
Processor processor;
float pageHeight;
int pageIndex;
Renderer(PDDocument document) {
super(document);
}
protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException {
return new ExtractChars(params, pageHeight, pageIndex, processor);
}
}
public interface Processor {
void process(String character, BoundingBox box, int direction);
}
}
寻呼机。JAVA
package com.example.foo;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import java.awt.geom.*;
import java.io.*;
// This class provides utility methods to subclasses, mostly so they can check
// if the currently content is being clipped and therefore should be skipped.
//
// We shouldn't really use inheritance for sharing code but this has the
// advantage of being able to call some methods of the PageDrawer superclass.
public class PageDrawerHelper extends PageDrawer {
float pageHeight;
int pageIndex;
PageDrawerHelper(PageDrawerParameters params, float pageHeight, int pageIndex) throws IOException {
super(params);
this.pageHeight = pageHeight;
this.pageIndex = pageIndex;
}
// Gets the bounding for a matrix by transforming corner points and taking the
// min/max values in the x- and y-directions. This ensures rotation and skew
// are taken into account. This method can return null if content is clipped.
BoundingBox clippedBoundingBox(Matrix matrix, float width, float height) {
Point2D p0 = matrix.transformPoint(0, 0);
Point2D p1 = matrix.transformPoint(0, height);
Point2D p2 = matrix.transformPoint(width, 0);
Point2D p3 = matrix.transformPoint(width, height);
BoundingBox contentBox = boundingBox(p0, p1, p2, p3);
BoundingBox clippedBox = applyClipping(contentBox);
return clippedBox;
}
BoundingBox boundingBox(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
Point2D topLeft = topLeft(p0, p1, p2, p3);
Point2D botRight = botRight(p0, p1, p2, p3);
float left = (float)topLeft.getX();
float right = (float)botRight.getX();
float top = pageHeight - (float)botRight.getY();
float bottom = pageHeight - (float)topLeft.getY();
return new BoundingBox(pageIndex, left, right, top, bottom);
}
Point2D topLeft(Point2D... points) {
double minX = points[0].getX();
double minY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
minX = Math.min(minX, points[i].getX());
minY = Math.min(minY, points[i].getY());
}
return new Point2D.Double(minX, minY);
}
Point2D botRight(Point2D... points) {
double maxX = points[0].getX();
double maxY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
maxX = Math.max(maxX, points[i].getX());
maxY = Math.max(maxY, points[i].getY());
}
return new Point2D.Double(maxX, maxY);
}
BoundingBox applyClipping(BoundingBox box) {
Rectangle2D clip = getGraphicsState().getCurrentClippingPath().getBounds2D();
float clipLeft = (float)clip.getMinX();
float clipRight = (float)clip.getMaxX();
float clipTop = pageHeight - (float)clip.getMaxY();
float clipBottom = pageHeight - (float)clip.getMinY();
float left = Math.max(box.left, clipLeft);
float right = Math.min(box.right, clipRight);
float top = Math.max(box.top, clipTop);
float bottom = Math.min(box.bottom, clipBottom);
if (left >= right || top >= bottom) {
return null;
} else {
return new BoundingBox(pageIndex, left, right, top, bottom);
}
}
}
特征orter.java
package com.example.foo;
import java.util.*;
public class CharacterSorter {
ArrayList<String> characters;
ArrayList<BoundingBox> boxes;
ArrayList<Integer> directions;
public CharacterSorter(ArrayList<String> characters, ArrayList<BoundingBox> boxes, ArrayList<Integer> directions) {
this.characters = characters;
this.boxes = boxes;
this.directions = directions;
}
public void sortByDirectionThenPosition() {
ArrayList<Tuple> tuples = new ArrayList();
for (int i = 0; i < characters.size(); i += 1) {
tuples.add(new Tuple(characters.get(i), boxes.get(i), directions.get(i)));
}
Collections.sort((List)tuples);
characters.clear(); boxes.clear(); directions.clear();
for (Tuple tuple: tuples) {
characters.add(tuple.character);
boxes.add(tuple.box);
directions.add(tuple.direction);
}
}
// This helper class wraps the three fields associated with a single character
// and provides a comparator function which mimics how PDFTextStripper orders
// its characters when #setSortByPosition(true) is set.
class Tuple implements Comparable {
String character;
BoundingBox box;
Integer direction;
Tuple(String character, BoundingBox box, Integer direction) {
this.character = character;
this.box = box;
this.direction = direction;
}
public int compareTo(Object o) {
Tuple other = (Tuple)o;
int primary = ((Integer)box.pageIndex).compareTo(other.box.pageIndex);
if (primary != 0) { return primary; }
// The remainder of this logic is copied and adapted from:
// https://github.com/apache/pdfbox/blob/a78f4a2ea058181e5ed05d6367ba7556948331b8/pdfbox/src/main/java/org/apache/pdfbox/text/TextPositionComparator.java#L29-L70
// Only compare text that is in the same direction.
int secondary = Float.compare(direction, other.direction);
if (secondary != 0) { return secondary; }
// Get the text direction adjusted coordinates.
float x1 = box.left;
float x2 = other.box.left;
float pos1YBottom = box.bottom;
float pos2YBottom = other.box.bottom;
// Note that the coordinates have been adjusted so (0, 0) is in upper left.
float pos1YTop = pos1YBottom - (box.bottom - box.top);
float pos2YTop = pos2YBottom - (other.box.bottom - other.box.top);
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
// We will do a simple tolerance comparison.
if (yDifference < .1 ||
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2);
} else if (pos1YBottom < pos2YBottom) {
return -1;
} else {
return 1;
}
}
}
}
我正在努力学习java,但我遇到了一些问题,找到答案对我来说并不简单。我想做的任务似乎很简单。 我想添加一个标签框架。MyFrame是一个JFrame类,具有一些基本的修改,如大小、颜色等。 主代码如下所示: 我得到的结果就是我想要的:正确的结果 当我评论关于标签的最后一行时,它正在改变我的用户界面的外观。它只显示JFrame而没有我的标签。 非工作代码: 这是图形结果:不工作标签 我错过了一些基
问题内容: 基本上,我已经在正常的完整查询中创建了一个数据库,这是我使用的代码以及生成的响应。 生成的查询如下: 这是合乎逻辑的,因为我正在从表中提取所有内容。但是,当我尝试使用load_only专门选择一列时,在这种情况下为email列。我使用的代码是: 这两个命令给我相同的结果: 这非常奇怪,因为我应该在查询中仅获得一列。但是,当我使用此: 它神奇地为我返回了仅一列。我需要使用load_onl
每当我运行它时,它只会显示图像应该位于的空间下的按钮,但即使图像空间在那里,也没有图像。 这是关于intelliJ的,我尝试了很多解决方案,但都不起作用。 代码的第一部分 代码的第二部分 文件夹
问题内容: 我要选择不是特定类后代的跨度,我们称其为“否”。这是我的CSS: 这是HTML 两个问题: 为什么我的CSS同时适用于是1和否2? 如果切换到通用选择器,为什么整个过程都会中断? 问题答案: 元素的两个父元素都不具有class ,无论其他祖先是否都具有class :
总的来说,我对OpenCV和图像处理相当陌生。我正在研究背景减法,以方便运动跟踪(人计数)。查看关于背景减法的openCV文档,GMG给出了相当不错的结果。同样,当看一个视频比较的方法,我觉得GMG给出了最好的结果,至少对我来说是这样。 我安装了opencv的最新版本,以便与python3一起使用: 有趣的是,在我自己的测试中,和的当前(3.0.0-dev)版本比我以前在OpenCv2中测试的版本
问题内容: 链接到pdf 当我尝试从上面的pdf中提取文本时,我混合了在evince查看器中不可见的文本和可见的文本。此外,某些所需的文本缺少查看器中未缺少的字符,例如“ FALCONS”中的“ S”和许多缺少的“ 1/2”字符。我认为这是由于来自不可见文本的干扰,因为在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠。 有没有办法删除不可见的文字?还是有其他解决方案? 码: 输出(粗体