PDFBox提取文本

寿子轩

2023-12-01

这两天在写自己的桌面搜索程序，陆续把自己所碰到的一些问题写一下，防止以后遗忘，再犯同样的错误。

现在先说一下PDFBox对文本的提取，我最开始的时候对于文本的提取是按照下面的方式来写的：

COSDocument cosDoc = null;

FileInputStream is = new FileInputStream(file);

PDFParser parser = new PDFParser(is);//

parser.parse( );

cosDoc = parser.getDocument( );

PDFTextStripper stripper = new PDFTextStripper( );

String docText = stripper.getText(new PDDocument(cosDoc));

按照这种写法每次我去提取pdf文档的时候都会抛出异常：

java.lang.Throwable: Warning: You did not close the PDF Document

如果按照下面的方法进行构造就不会出现异常：

FileInputStream is = new FileInputStream(file);

PDFTextStripper stripper = new PDFTextStripper();

pdfDocument = PDDocument.load(is);

StringWriter writer = new StringWriter();

stripper.writeText(pdfDocument, writer);

docText = writer.getBuffer().toString();