使用PDFBox解析PDF内容

邓驰

2023-12-01

先来放松一下：

面试牛人
面试官：熟悉哪种语言
　　应聘者：Java。
　　面试官：知道什么叫类么
　　应聘者：我这人实在，工作努力，不知道什么叫累
　　面试官：知道什么是包?
　　应聘者：我这人实在平常不带包也不用公司准备了
　　面试官：知道什么是接口吗?
　　应聘者：我这个人工作认真。从来不找借口偷懒
　　M：知道什么是继承么
　　Y：我是孤儿没什么可以继承的
　　M:知道什么叫对象么？
　　M:知道，不过我工作努力，上进心强，暂时还没有打算找对象。
　　M：知道多态么？
　　Y：知道，我很保守的。我认为让心爱的女人为了自已一时的快乐去堕胎是不道德的行为！

使用PDFBox解析PDF内容：在下面的代码中，getText方法接收一个String类型的参数，指定要提取的PDF文件路径。这个位置可以是一个URL或本地文件。然后函数调用PDFBox提供的PDFTextStripper类，设置提取过程中的一些属性（如起始页、是否排序等）。最后将文本提取并写入文件。

public void geText(String file) throws Exception {
// 是否排序
    boolean sort = false ;
// pdf文件名
   String pdfFile = file;
// 输入文本文件名称
   String textFile = null ;
// 编码方式
   String encoding = " UTF-8 " ;
// 开始提取页数
    int startPage = 1 ;
// 结束提取页数
    int endPage = Integer.MAX_VALUE;
// 文件输入流，生成文本文件
   Writer output = null ;
// 内存中存储的PDF Document
   PDDocument document = null ;
try {
       try {
          // 首先当作一个URL来装载文件，如果得到异常再从本地文件系统 // 去装载文件
         URL url = new URL(pdfFile);
         document = PDDocument.load(url);
          // 获取PDF的文件名
         String fileName = url.getFile();

// 以原来PDF的名称来命名新产生的txt文件
          if (fileName.length() > 4 ) {
            File outputFile = new File(fileName.substring( 0 , fileName.length()
- 4 ) + " .txt " );
            textFile = outputFile.getName();
         }
      } catch (MalformedURLException e) {

// 如果作为URL装载得到异常则从文件系统装载
         document = PDDocument.load(pdfFile);
          if (pdfFile.length() > 4 ) {
            textFile = pdfFile.substring( 0 , pdfFile.length() - 4 ) + " .txt " ;
         }
      }
       // 文件输入流，写入文件倒textFile
      output = new OutputStreamWriter( new FileOutputStream(textFile),
encoding);
       // PDFTextStripper来提取文本
      PDFTextStripper stripper = null ;
      stripper = new PDFTextStripper();
// 设置是否排序
      stripper.setSortByPosition(sort);
// 设置起始页
      stripper.setStartPage(startPage);
// 设置结束页
      stripper.setEndPage(endPage);
// 调用PDFTextStripper的writeText提取并输出文本
      stripper.writeText(document, output);
   } finally {
       if (output != null ) {
          // 关闭输出流
         output.close();
      }
       if (document != null ) {
          // 关闭PDF Document
         document.close();
      }
   }
}

加入main函数

public static void main(String[] args) {
   PdfboxTest test = new PdfboxTest();
    try {
       // 取得C盘下的index.pdf的内容
      test.geText( " C:/index.pdf " );
   } catch (Exception e) {
      e.printStackTrace();
   }
}

把包也引入吧，省得麻烦

import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.net.MalformedURLException;
import java.net.URL;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.searchengine.lucene.LucenePDFDocument;
import org.pdfbox.util.PDFTextStripper;

复习一下File类的用法

public File(String pathname)Creates a new File instance by converting the given pathname string into an abstract pathname. If the given string is the empty string, then the result is the empty abstract pathname.

Parameters:
pathname - A pathname string

public File(URI uri)Creates a new File instance by converting the given file: URI into an abstract pathname.
The exact form of a file: URI is system-dependent, hence the transformation performed by this constructor is also system- dependent.

For a given abstract pathname f it is guaranteed that

new File( f.toURI()).equals( f.getAbsoluteFile())
so long as the original abstract pathname, the URI, and the new abstract pathname are all created in (possibly different invocations of) the same Java virtual machine. This relationship typically does not hold, however, when a file: URI that is created in a virtual machine on one operating system is converted into an abstract pathname in a virtual machine on a different operating system.

Parameters:
uri - An absolute, hierarchical URI with a scheme equal to "file", a non- empty path component, and undefined authority, query, and fragment components

很多人都说xpdf比PDFBox好，但我个人还是觉的PDFBox比较实用！

OK！

补充一下：URL（Uniform Resoure Locator：统一资源定位器）是WWW页的地址，它从左到右由下述部分组成：

　　·Internet资源类型（scheme）：指出WWW客户程序用来操作的工具。如“http：//”表示WWW服务器，“ftp：//”表示FTP服务器，“gopher：//”表示Gopher服务器，而“new：”表示Newgroup新闻组。

　　·服务器地址（host）：指出WWW页所在的服务器域名。

　　·端口（port）：有时（并非总是这样），对某些资源的访问来说，需给出相应的服务器提供端口号。

　　·路径（path）：指明服务器上某资源的位置（其格式与DOS系统中的格式一样，通常有目录/子目录/文件名这样结构组成）。与端口一样，路径并非总是需要的。

　　URL地址格式排列为：scheme：//host：port/path，例如http：//www.sohu.com/domain/HXWZ就是一个典型的URL地址。

使用PDFBox解析PDF内容

相关阅读

相关文章

相关问答

相关文档