链接到pdf
当我尝试从上面的pdf中提取文本时,我混合了在evince查看器中不可见的文本和可见的文本。此外,某些所需的文本缺少查看器中未缺少的字符,例如“
FALCONS”中的“ S”和许多缺少的“
1/2”字符。我认为这是由于来自不可见文本的干扰,因为在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠。
有没有办法删除不可见的文字?还是有其他解决方案?
码:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class App {
public static String getPdfText(String pdfPath) throws IOException {
File file = new File(pdfPath);
PDDocument document = null;
PDFTextStripper textStripper = null;
String text = null;
try {
document = PDDocument.load(file);
textStripper = new PDFTextStripper();
textStripper.setEndPage(1);
text = textStripper.getText(document);
} catch (IOException e) {
throw new IOException("Could not load file and strip text.", e);
} finally {
try {
if (document != null)
document.close();
} catch (IOException e) {
System.out.println("Could not close document");
}
}
return text;
}
public static void main(String[] args) {
String filename = "RevTeaser09072016.pdf";
String text = null;
try {
text = getPdfText(filename);
} catch (IOException e) {
e.printStackTrace();
System.exit(1);
}
System.out.println(text);
}
}
输出(粗体为所需文本):
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
选择项
队数
投注金额
逆向测试仪
标记框如图所示
表示主队
职业橄榄球-2012年11月15日,星期四
1账单★NFL PM8:25 2海豚7–½6–½
职业足球-2012年11月18日,星期日
3个红人★PM1:00 4个老鹰10–½3–½
5包PM1:00 6里昂★10–½3–½
7个好友★PM1:00 8个基数17–½3+½
9个海盗PM1:00 10个裤子★7–½6–½
11牛郎★PM1:00 12布朗14–½+½
13 RAMS★PM1:00 14 JETS10–½3–½
15爱国者★PM4:25 16螺栓17–½3+½
17德克萨斯州★PM1:00 18美洲虎23–½9+½
19孟加拉人下午1:00 20奶酪★10–½3–½
21圣徒下午4:05 22袭击者★12–½1–½
23 BRONCOS★PM4:25 24充电器14–½+½
25乌鸦NBC PM8:30 26钢★7–½6–½
职业足球-2012年11月19日,星期一
27 49ERS★ESPN PM8:40 28 BEARS10–½3–½
1,000
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
选择项
队数
投注金额
逆向测试仪
标记框为how
表示主队
职业橄榄球-2012年11月15日,星期四
1账单★NFL PM8:25 2海豚7–½6–½
职业足球-2012年11月18日,星期日
3个红人★PM1:00 4个老鹰10–½3–½
5包PM1:00 6里昂★10–½3–½
7个好友★PM1:00 8个基数17–½3+½
9个海盗PM1:00 10个裤子★7–½6–½
11牛郎★PM1:00 12布朗14–½+½
13 RAMS★PM1:00 14 JETS10–½3–½
15爱国者★PM4:25 16螺栓17–½3+½
17德克萨斯州★PM1:00 18美洲虎23–½9+½
19孟加拉人下午1:00 20奶酪★10–½3–½
21圣徒下午4:05 22袭击者★12–½1–½
23 BRONCOS★PM4:25 24充电器14–½+½
25乌鸦NBC PM8:30 26钢RS★7–½6–½
职业足球-2012年11月19日,星期一
27 49ERS★ESPN PM8:40 28 BEARS10–½3–½
1,000
145
143
159
14
160
41
15715 156154150 153149 152148 51147
142
158
50
146
选择
队数
投注金额
方舟
表示主队
职业橄榄球-2012年11月15日,星期四
1账单★NFL PM8:25 2海豚7–½6–½
职业橄榄球-2012年11月18日,星期日
3红发★PM1:0 4鹰10–½3–½
5包PM1:0 6里昂★10–½3–½
7个好友★PM1:0 8个基数17–½3+½
9 BU CANEERS PM1:0 10裤子★7–½6–½
11牛郎★PM1:0 12布朗14–½+½
13 RAMS★PM1:0 14 JETS10–½3–½
15爱国者★PM4:25 16螺栓17–½3+½
17德州★PM1:0 18 JAGUARS23–½9+½
19孟加拉PM1:0 20奶酪★10–½3–½
21圣徒下午4:05 22袭击者★12–½1–½
23 BRONCOS★PM4:25 24充电器14–½+½
25乌鸦NBC PM8:30 26钢★7–½6–½
职业足球-2012年11月19日,星期一
27 49ERS★ESPN PM8:40 28 BEARS10–½3–½
1,0
显示的标记框
ODENOTES家庭团队
职业橄榄球-2016年9月8日,星期四
**1裤子nbc-10½8:30p 2 BRONCOS-3½
职业足球-2016年9月11日星期日
猎鹰-9 1:00p 4海盗-4½
5北欧海盗-9½1:00p 6泰坦队-4½
7鹰-10½1:00p 8布朗-3½
9孟加拉虎-9½1:00p 10箭-4½
11点圣骑士-7½1:00p 12点-6½
13 CHIEFS-14½1:00p 14充电器+½
15乌鸦-10½1:00p 16帐单-3½
17德克萨斯州-14 1:00p 18熊+½
19包-12 1:00p 20美洲虎-1½
21海鹰-17½4:05p 22海豚+3½
23牛仔男孩-7½4:25p 24裤-6½
25克拉-10½4:25p 26里昂-3½
27 DINnbc-14½8:30p 28爱国者+½
职业橄榄球-2016年9月12日,星期一
29钢espn-10½7:10p 30 REDSKINS-3½
31 RAMs espn-9 10:20p 32 49ERS-4½**
OP的样本PDF中的 不可见文本 通常通过定义剪切路径(超出文本范围)和填充路径(将文本隐藏在下方)而变得 不可见
。因此,我们必须在文本提取期间考虑与路径相关的指令,以忽略该 不可见的文本 。
不幸的是,没有为这些指令设计的回调在PDFTextStripper
其父类LegacyPDFStreamEngine
和中声明PDFStreamEngine
。
但是它们在其他主要PDFStreamEngine
子类中声明PDFGraphicsStreamEngine
,并在PageDrawer
。
因此,为了利用这一点,我们可以将PageDrawer
实现复制并粘贴并改编为的子类PDFTextStripper
,例如:
public class PDFVisibleTextStripper extends PDFTextStripper {
public PDFVisibleTextStripper() throws IOException {
addOperator(new AppendRectangleToPath());
addOperator(new ClipEvenOddRule());
addOperator(new ClipNonZeroRule());
addOperator(new ClosePath());
addOperator(new CurveTo());
addOperator(new CurveToReplicateFinalPoint());
addOperator(new CurveToReplicateInitialPoint());
addOperator(new EndPath());
addOperator(new FillEvenOddAndStrokePath());
addOperator(new FillEvenOddRule());
addOperator(new FillNonZeroAndStrokePath());
addOperator(new FillNonZeroRule());
addOperator(new LineTo());
addOperator(new MoveTo());
addOperator(new StrokePath());
}
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
super.processTextPosition(text);
}
private GeneralPath linePath = new GeneralPath();
void deleteCharsInPath() {
for (List<TextPosition> list : charactersByArticle) {
List<TextPosition> toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println(toRemove.size());
list.removeAll(toRemove);
}
}
}
public final class AppendRectangleToPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x = (COSNumber) operands.get(0);
COSNumber y = (COSNumber) operands.get(1);
COSNumber w = (COSNumber) operands.get(2);
COSNumber h = (COSNumber) operands.get(3);
float x1 = x.floatValue();
float y1 = y.floatValue();
// create a pair of coordinates for the transformation
float x2 = w.floatValue() + x1;
float y2 = h.floatValue() + y1;
Point2D p0 = context.transformedPoint(x1, y1);
Point2D p1 = context.transformedPoint(x2, y1);
Point2D p2 = context.transformedPoint(x2, y2);
Point2D p3 = context.transformedPoint(x1, y2);
// to ensure that the path is created in the right direction, we have to create
// it by combining single lines instead of creating a simple rectangle
linePath.moveTo((float) p0.getX(), (float) p0.getY());
linePath.lineTo((float) p1.getX(), (float) p1.getY());
linePath.lineTo((float) p2.getX(), (float) p2.getY());
linePath.lineTo((float) p3.getX(), (float) p3.getY());
// close the subpath instead of adding the last line so that a possible set line
// cap style isn't taken into account at the "beginning" of the rectangle
linePath.closePath();
}
@Override
public String getName() {
return "re";
}
}
public final class StrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "S";
}
}
public final class FillEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f*";
}
}
public class FillNonZeroRule extends OperatorProcessor {
@Override
public final void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f";
}
}
public final class FillEvenOddAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B*";
}
}
public class FillNonZeroAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B";
}
}
public final class ClipEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W*";
}
}
public class ClipNonZeroRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W";
}
}
public final class MoveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.moveTo(pos.x, pos.y);
}
@Override
public String getName() {
return "m";
}
}
public class LineTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
// append straight line segment from the current point to the point
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.lineTo(pos.x, pos.y);
}
@Override
public String getName() {
return "l";
}
}
public class CurveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 6) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x2 = (COSNumber) operands.get(2);
COSNumber y2 = (COSNumber) operands.get(3);
COSNumber x3 = (COSNumber) operands.get(4);
COSNumber y3 = (COSNumber) operands.get(5);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "c";
}
}
public final class CurveToReplicateFinalPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
}
@Override
public String getName() {
return "y";
}
}
public class CurveToReplicateInitialPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x2 = (COSNumber) operands.get(0);
COSNumber y2 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D currentPoint = linePath.getCurrentPoint();
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "v";
}
}
public final class ClosePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.closePath();
}
@Override
public String getName() {
return "h";
}
}
public final class EndPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "n";
}
}
}
( PDFVisibleTextStripper)
请确保您在PDFVisibleTextStripper
构造函数中使用内部运算符类,而不要使用PageDrawer
具有相同名称的类。为确保简单,请点击代码下的链接。
这将输出减少到
REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS - 3½
PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
3 FALCONS - 9½ 1:00p 4 BUCCANEERS - 4½
5 VIKINGS - 9½ 1:00p 6 TITANS - 4½
7 EAGLES - 10½ 1:00p 8 BROWNS - 3½
9 BENGALS - 9½ 1:00p 10 JETS - 4½
11 SAINTS - 7½ 1:00p 12 RAIDERS - 6½
13 CHIEFS - 14½ 1:00p 14 CHARGERS + ½
15 RAVENS - 10½ 1:00p 16 BILLS - 3½
17 TEXANS - 14½ 1:00p 18 BEARS + ½
19 PACKERS - 12½ 1:00p 20 JAGUARS - 1½
21 SEAHAWKS - 17½ 4:05p 22 DOLPHINS + 3½
23 COWBOYS - 7½ 4:25p 24 GIANTS - 6½
25 COLTS - 10½ 4:25p 26 LIONS - 3½
27 CARDINALS nbc - 14½ 8:30p 28 PATRIOTS + ½
PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
29 STEELERS espn - 10½ 7:10p 30 REDSKINS - 3½
31 RAMS espn - 9½ 10:20p 32 49ERS - 4½
这会丢弃大多数不需要的数据。
在
此问题的上下文中,很明显的是,字符基线的计算方式processTextPosition
和deleteCharsInPath
结尾隐含地假定了水平文本而没有页面旋转。但是,如果放宽了“可见性”的标准,则可以假定一个字符是可见的,前提是该字符的基线开始可见。在那种情况下,不再需要计算出来Vector end
的代码,并且代码对于旋转的页面也可以正常工作。
在
此问题的上下文中,很明显,由于浮点计算错误,正好在剪切路径边界上的字形原点坐标可能会在剪切路径之外徘徊。事实证明,切换到“胖点坐标检查”是可以接受的解决方法。
链接到pdf 当我尝试从上面的pdf中提取文本时,我得到了在evince viewer中不可见的文本和可见的文本的混合。此外,一些所需的文本缺少查看器中没有缺少的字符,例如,“FALCONS”中的“S”和许多缺少的“½”字符。我认为这是由于不可见文本的干扰,因为在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠。 有没有办法去掉不可见的文字?还是有别的解决办法? 代码: 输出(粗体文本为
系统中的一些PDF文档是通过扫描创建的,其中包括OCR文本。然而,OCR没有正确执行(西里尔语和拉丁语字符混淆),尽管文档看起来可以搜索,但该信息完全不正确,无法使用。 在Adobe Acrobat Reader DC(或GoogleChrome)中查看PDF文档时,它会正确显示,但在使用PDF. js呈现文档的网页上,OCR文本会显示在前面,而不是原始文本的扫描图形呈现。 这个想法是通过从PDF
使用QPDF,您可以简单地从PDF文件中删除限制/加密,如下所示: 我想用Java中的PDFBox做同样的事情: 我已经用尝试过了,但是我不知道所有者密码是什么。QPDF是如何做到这一点的? 示例文档: https://issues.apache.org/jira/secure/attachment/12514714/in.pdf
是否可以使用iTextSharp从PDF文档中删除不可见(或至少不显示)的对象? 更多详情: 这种解决方案有2个大缺点: 文档的大小是[原始大小]*[裁剪框的数量],因为整个页面都在那里,盖了很多次戳!(看不见,但它在那里) 仍然可以通过在Reader中选择all(Ctrl+A)并粘贴来访问不可见文本。 所以,我想我需要迭代PDF对象,检测它是否可见,并删除它。在撰写本文时,我正在尝试使用pdfr
是否可以用开放源码软件库pdfbox提取已签名PDF的可见签名(图像)? 工作流: null 像下面这样的oop风格的东西会很棒: 找到了类PDSignature和如何签署一个PDF,但没有解决方案提取一个可见的签名作为图像。
我需要比较PDF文档,这些文档是用iText创建的。我实际上设法比较了文件,但我发现了一个微小的差异。 当在像Notepad++这样的编辑器中打开PDF文件时,我可以看到文件末尾有这样的东西: