【已解决】Python读取PDF文件的内容

东郭宏深

2023-12-01

前言

创作开始时间：2021年7月1日10:10:50

如题。网上给了很多种方法，但是有的不太好使，这里给出一个可行的解决方案。

环境

windows 10
conda
Python 3.8

解决方案

我一共尝试了三种方案，具体代码如下：

pdf_path = os.path.join("E:\\input", "中国计算机学会推荐国际学术会议和期刊目录-2019.pdf")

# 方案1
# 没有tika 的话可以运行conda install tika 或者pip install tika
from tika import parser
file_data = parser.from_file(pdf_path)
text = file_data['content']
print(text)

# 方案2
# 没有pdfplumber的话可以运行conda install pdfplumber或者pip install pdfplumber
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

# 方案3
# 没有pypdf2的话可以运行conda install pypdf2或者pip install pypdf2
from PyPDF2 import PdfFileReader
open_file = open(pdf_path, "rb")
input = PdfFileReader(open_file) 
page = input.getPage(0)
page_content = page.extractText()
print(page_content)

三个方案里面，我觉得方案一最好：

读取数据完整
可以读取表格数据

方案二次之：

读取数据完整，
但是对表格数据的解析不太好

方案三最差：

读取数据不完整。

小结

以上。

创作结束时间：2021年7月1日10:25:58

参考文献

Extract text from PDF File using Python https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/
How to Extract Text from PDF：Learn to use Python to extract text from PDFs https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7 这篇文章挺有意思的，我觉得水平很高。语言也很不错
PDF To Text Python – Extract Text From PDF Documents Using PyPDF2 Module https://www.simplifiedpython.net/pdf-to-text-python-extract-text-from-pdf-documents-using-pypdf2-module/
PDF Text Extraction in Python https://towardsdatascience.com/pdf-text-extraction-in-python-5b6ab9e92dd
The PdfFileReader Class https://pythonhosted.org/PyPDF2/PdfFileReader.html
How to extract text from a PDF file? https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file
How can I read pdf in python? [duplicate] https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
HOW TO READ PDF FILES WITH PYTHON http://theautomatic.net/2020/01/21/how-to-read-pdf-files-with-python/

参考很多，有用者不多。

【已解决】Python读取PDF文件的内容

前言

环境

解决方案

小结

参考文献

相关阅读

相关文章

相关问答

相关文档