发票的格式为PDF,初步想法是提取PDF的内容并转换为文本,查找资料,找到三个符合的Python package: PDFMiner , pdfminer3k和Pdfminer.six。
官方描述:
PDFMiner is a text extraction tool for PDF documents.
Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.
意思是说,PDFMiner 是一个从PDF文档中提取文本的工具。从2020年开始,基本不再维护,可以转向Pdfminer.six。
官方描述:
pdfminer3k is a Python 3 port of pdfminer. PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
We had to forked this because original package was removed from pypi.
意思是说,由于PDFMiner被pypi删除了,所以弄了个分支pdfminer3k。它能够从PDF文档中提取文本信息,专注于获取并分析文本数据。它有一个PDF转换器,能把PDF文件转换为其他文本格式,比如HTML。
官方描述:
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
意思是说,Pdfminer.six是PDFMiner的一个分支,能够从PDF文档中提取文本信息。特色是能从PDF的源码中获取文本信息,并且还能够获取文本的位置、字体和颜色。
三个工具的功能基本相同,都可以从PDF文档中提取文本信息,PDFMiner最早出现,pdfminer3k和Pdfminer.six都是PDFMiner的分支。在PyCharm中安装时,最好安装其中的一个,因为它们中有很多类的名字相同,互相覆盖冲突。