pypdf2 存储pdf
PyPDF2 is a pure-python library to work with PDF files. We can use the PyPDF2 module to work with the existing PDF files. We can’t create a new PDF file using this module.
PyPDF2是一个纯Python库,可处理PDF文件。 我们可以使用PyPDF2模块来处理现有的PDF文件。 我们无法使用此模块创建新的PDF文件。
Some of the exciting features of PyPDF2 module are:
PyPDF2模块的一些令人兴奋的功能包括:
We can use PIP to install PyPDF2 module.
我们可以使用PIP来安装PyPDF2模块。
$ pip install PyPDF2
Let’s look at some examples to work with PDF files using the PyPDF2 module.
让我们看一些使用PyPDF2模块处理PDF文件的示例。
We can get the number of pages in the PDF file. We can also get the information about the PDF author, creator app, and creation dates.
我们可以在PDF文件中获取页数。 我们还可以获得有关PDF作者,创建者应用程序和创建日期的信息。
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
print(f'Number of Pages in PDF File is {pdf_reader.getNumPages()}')
print(f'PDF Metadata is {pdf_reader.documentInfo}')
print(f'PDF File Author is {pdf_reader.documentInfo["/Author"]}')
print(f'PDF File Creator is {pdf_reader.documentInfo["/Creator"]}')
Sample Output:
样本输出:
Number of Pages in PDF File is 2
PDF Metadata is {'/Author': 'Microsoft Office User', '/Creator': 'Microsoft Word', '/CreationDate': "D:20191009091859+00'00'", '/ModDate': "D:20191009091859+00'00'"}
PDF File Author is Microsoft Office User
PDF File Creator is Microsoft Word
numPages
attribute. 我们可以使用getNumPages()函数获取PDF文件中的页数。 另一种方法是使用numPages
属性。 import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# printing first page contents
pdf_page = pdf_reader.getPage(0)
print(pdf_page.extractText())
# reading all the pages content one by one
for page_num in range(pdf_reader.numPages):
pdf_page = pdf_reader.getPage(page_num)
print(pdf_page.extractText())
PyPDF2.pdf.PageObject
instance. PdfFileReader的getPage(int)方法返回PyPDF2.pdf.PageObject
实例。 The PyPDF2 allows many types of manipulations that can be done page-by-page. We can rotate a page clockwise or counter-clockwise by an angle.
PyPDF2允许许多类型的操作,可以逐页进行。 我们可以将页面顺时针或逆时针旋转一个角度。
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
pdf_writer = PyPDF2.PdfFileWriter()
for page_num in range(pdf_reader.numPages):
pdf_page = pdf_reader.getPage(page_num)
pdf_page.rotateClockwise(90) # rotateCounterClockwise()
pdf_writer.addPage(pdf_page)
with open('Python_Tutorial_rotated.pdf', 'wb') as pdf_file_rotated:
pdf_writer.write(pdf_file_rotated)
import PyPDF2
pdf_merger = PyPDF2.PdfFileMerger()
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']
for pdf_file_name in pdf_files_list:
with open(pdf_file_name, 'rb') as pdf_file:
pdf_merger.append(pdf_file)
with open('Python_Tutorial_merged.pdf', 'wb') as pdf_file_merged:
pdf_merger.write(pdf_file_merged)
The above code looks good to merge the PDF files. But, it produced an empty PDF file. The reason is that the source PDF files got closed before the actual write happened to create the merged PDF file.
上面的代码可以很好地合并PDF文件。 但是,它产生了一个空的PDF文件。 原因是源PDF文件在实际写入发生之前已关闭,以创建合并的PDF文件。
It’s a bug in the latest version of PyPDF2. You can read about it this GitHub issue.
这是最新版本的PyPDF2中的错误。 您可以在GitHub问题上阅读它。
There is an alternative approach to use the contextlib
module to keep the source files open until the write operation is done.
有另一种方法可以使用contextlib
模块将源文件保持打开状态,直到完成写操作为止。
import contextlib
import PyPDF2
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']
with contextlib.ExitStack() as stack:
pdf_merger = PyPDF2.PdfFileMerger()
files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdf_files_list]
for f in files:
pdf_merger.append(f)
with open('Python_Tutorial_merged_contextlib.pdf', 'wb') as f:
pdf_merger.write(f)
You can read more about it at this StackOverflow Question.
您可以在此StackOverflow问题上阅读有关它的更多信息。
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(pdf_reader.numPages):
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(pdf_reader.getPage(i))
output_file_name = f'Python_Tutorial_{i}.pdf'
with open(output_file_name, 'wb') as output_file:
pdf_writer.write(output_file)
The Python_Tutorial.pdf has 2 pages. The output files are named as Python_Tutorial_0.pdf and Python_Tutorial_1.pdf.
Python_Tutorial.pdf有2页。 输出文件名为Python_Tutorial_0.pdf和Python_Tutorial_1.pdf。
We can use PyPDF2 along with Pillow (Python Imaging Library) to extract images from the PDF pages and save them as image files.
我们可以将PyPDF2与Pillow(Python Imaging Library)一起使用,以从PDF页面提取图像并将其另存为图像文件。
First of all, you will have to install the Pillow module using the following command.
首先,您将必须使用以下命令安装枕头模块。
$ pip install Pillow
Here is the simple program to extract images from the first page of the PDF file. We can easily extend it further to extract all the images from the PDF file.
这是从PDF文件首页提取图像的简单程序。 我们可以轻松地进一步扩展它,以从PDF文件中提取所有图像。
import PyPDF2
from PIL import Image
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# extracting images from the 1st page
page0 = pdf_reader.getPage(0)
if '/XObject' in page0['/Resources']:
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if '/Filter' in xObject[obj]:
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
img = open(obj[1:] + ".tiff", "wb")
img.write(data)
img.close()
else:
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
else:
print("No image found.")
My sample PDF file has a PNG image on the first page and the program saved it with an “image20.png” filename.
我的样本PDF文件的第一页上有一个PNG图像,程序将其保存为“ image20.png”文件名。
翻译自: https://www.journaldev.com/33281/pypdf2-python-library-for-pdf-files
pypdf2 存储pdf