pypdf2 存储pdf_PyPDF2：用于PDF文件操作的Python库

窦哲彦

2023-12-01

pypdf2 存储pdf

PyPDF2 is a pure-python library to work with PDF files. We can use the PyPDF2 module to work with the existing PDF files. We can’t create a new PDF file using this module.

PyPDF2是一个纯Python库，可处理PDF文件。我们可以使用PyPDF2模块来处理现有的PDF文件。我们无法使用此模块创建新的PDF文件。

PyPDF2功能 (PyPDF2 Features)

Some of the exciting features of PyPDF2 module are:

PyPDF2模块的一些令人兴奋的功能包括：

PDF Files metadata such as number of pages, author, creator, created and last updated time.
PDF文件元数据，例如页数，作者，创建者，创建时间和上次更新时间。
Extracting Content of PDF file page by page.
逐页提取PDF文件的内容。
Merge multiple PDF files.
合并多个PDF文件。
Rotate PDF file pages by an angle.
将PDF文件页面旋转一个角度。
Scaling of PDF pages.
PDF页面缩放。
Extracting images from PDF pages and saving as image using the Pillow library.
从PDF页面提取图像并使用Pillow库保存为图像。

安装PyPDF2模块 (Installing PyPDF2 Module)

We can use PIP to install PyPDF2 module.

我们可以使用PIP来安装PyPDF2模块。

$ pip install PyPDF2

PyPDF2范例 (PyPDF2 Examples)

Let’s look at some examples to work with PDF files using the PyPDF2 module.

让我们看一些使用PyPDF2模块处理PDF文件的示例。

1.提取PDF元数据 (1. Extracting PDF Metadata)

We can get the number of pages in the PDF file. We can also get the information about the PDF author, creator app, and creation dates.

我们可以在PDF文件中获取页数。我们还可以获得有关PDF作者，创建者应用程序和创建日期的信息。

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    print(f'Number of Pages in PDF File is {pdf_reader.getNumPages()}')
    print(f'PDF Metadata is {pdf_reader.documentInfo}')
    print(f'PDF File Author is {pdf_reader.documentInfo["/Author"]}')
    print(f'PDF File Creator is {pdf_reader.documentInfo["/Creator"]}')

Sample Output:

样本输出：

Number of Pages in PDF File is 2
PDF Metadata is {'/Author': 'Microsoft Office User', '/Creator': 'Microsoft Word', '/CreationDate': "D:20191009091859+00'00'", '/ModDate': "D:20191009091859+00'00'"}
PDF File Author is Microsoft Office User
PDF File Creator is Microsoft Word

Recommended Readings: 推荐读物 ： Python with Statement and 带声明和 Python f-strings Python f字符串的Python

The PDF file should be opened in the binary mode. That’w why the file opening mode is passed as ‘rb’.
PDF文件应以二进制模式打开。这就是为什么文件打开模式以'rb'传递的原因。
The PdfFileReader class is used to read the PDF file.
PdfFileReader类用于读取PDF文件。
The documentInfo is a dictionary that contains the metadata of the PDF file.
documentInfo是一个字典，其中包含PDF文件的元数据。
We can get the number of pages in the PDF file using the getNumPages() function. An alternative way is to use the numPages attribute.
我们可以使用getNumPages（）函数获取PDF文件中的页数。另一种方法是使用numPages属性。

2.提取PDF页面的文本 (2. Extracting Text of PDF Pages)

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # printing first page contents
    pdf_page = pdf_reader.getPage(0)
    print(pdf_page.extractText())

    # reading all the pages content one by one
    for page_num in range(pdf_reader.numPages):
        pdf_page = pdf_reader.getPage(page_num)
        print(pdf_page.extractText())

The PdfFileReader getPage(int) method returns the PyPDF2.pdf.PageObject instance.
PdfFileReader的getPage（int）方法返回PyPDF2.pdf.PageObject实例。
We can call the extractText() method on the page object to get the text content of the page.
我们可以在页面对象上调用extractText（）方法来获取页面的文本内容。
The extractText() will not return any binary data such as images.
extractText（）将不会返回任何二进制数据，例如图像。

3.旋转PDF文件页面 (3. Rotate PDF File Pages)

The PyPDF2 allows many types of manipulations that can be done page-by-page. We can rotate a page clockwise or counter-clockwise by an angle.

PyPDF2允许许多类型的操作，可以逐页进行。我们可以将页面顺时针或逆时针旋转一个角度。

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    pdf_writer = PyPDF2.PdfFileWriter()

    for page_num in range(pdf_reader.numPages):
        pdf_page = pdf_reader.getPage(page_num)
        pdf_page.rotateClockwise(90)  # rotateCounterClockwise()

        pdf_writer.addPage(pdf_page)

    with open('Python_Tutorial_rotated.pdf', 'wb') as pdf_file_rotated:
        pdf_writer.write(pdf_file_rotated)

The PdfFileWriter is used to write the PDF file from the source PDF.
PdfFileWriter用于从源PDF写入PDF文件。
We are using rotateClockwise(90) method to rotate the page clockwise by 90-degrees.
我们使用rotateClockwise（90）方法将页面顺时针旋转90度。
We are adding the rotated pages to the PdfFileWriter instance.
我们将旋转后的页面添加到PdfFileWriter实例。
Finally, the write() method of the PdfFileWriter is used to produce the rotated PDF file.
最后，PdfFileWriter的write（）方法用于生成旋转的PDF文件。

The PdfFileWriter can write PDF files from some source PDF files. We can’t use it to create a PDF file from some text data.

PdfFileWriter可以从某些源PDF文件写入PDF文件。我们不能使用它从某些文本数据创建PDF文件。

4.合并PDF文件 (4. Merge PDF Files)

import PyPDF2

pdf_merger = PyPDF2.PdfFileMerger()
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']

for pdf_file_name in pdf_files_list:
    with open(pdf_file_name, 'rb') as pdf_file:
        pdf_merger.append(pdf_file)

with open('Python_Tutorial_merged.pdf', 'wb') as pdf_file_merged:
    pdf_merger.write(pdf_file_merged)

The above code looks good to merge the PDF files. But, it produced an empty PDF file. The reason is that the source PDF files got closed before the actual write happened to create the merged PDF file.

上面的代码可以很好地合并PDF文件。但是，它产生了一个空的PDF文件。原因是源PDF文件在实际写入发生之前已关闭，以创建合并的PDF文件。

It’s a bug in the latest version of PyPDF2. You can read about it this GitHub issue.

这是最新版本的PyPDF2中的错误。您可以在GitHub问题上阅读它。

There is an alternative approach to use the contextlib module to keep the source files open until the write operation is done.

有另一种方法可以使用contextlib模块将源文件保持打开状态，直到完成写操作为止。

import contextlib
import PyPDF2

pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']

with contextlib.ExitStack() as stack:
    pdf_merger = PyPDF2.PdfFileMerger()
    files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdf_files_list]
    for f in files:
        pdf_merger.append(f)
    with open('Python_Tutorial_merged_contextlib.pdf', 'wb') as f:
        pdf_merger.write(f)

You can read more about it at this StackOverflow Question.

您可以在此StackOverflow问题上阅读有关它的更多信息。

5.将PDF文件拆分为单页文件 (5. Split PDF Files into Single Pages Files)

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    for i in range(pdf_reader.numPages):
        pdf_writer = PyPDF2.PdfFileWriter()
        pdf_writer.addPage(pdf_reader.getPage(i))
        output_file_name = f'Python_Tutorial_{i}.pdf'
        with open(output_file_name, 'wb') as output_file:
            pdf_writer.write(output_file)

The Python_Tutorial.pdf has 2 pages. The output files are named as Python_Tutorial_0.pdf and Python_Tutorial_1.pdf.

Python_Tutorial.pdf有2页。输出文件名为Python_Tutorial_0.pdf和Python_Tutorial_1.pdf。

6.从PDF文件提取图像 (6. Extracting Images from PDF Files)

We can use PyPDF2 along with Pillow (Python Imaging Library) to extract images from the PDF pages and save them as image files.

我们可以将PyPDF2与Pillow（Python Imaging Library）一起使用，以从PDF页面提取图像并将其另存为图像文件。

First of all, you will have to install the Pillow module using the following command.

首先，您将必须使用以下命令安装枕头模块。

$ pip install Pillow

Here is the simple program to extract images from the first page of the PDF file. We can easily extend it further to extract all the images from the PDF file.

这是从PDF文件首页提取图像的简单程序。我们可以轻松地进一步扩展它，以从PDF文件中提取所有图像。

import PyPDF2
from PIL import Image

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # extracting images from the 1st page
    page0 = pdf_reader.getPage(0)

    if '/XObject' in page0['/Resources']:
        xObject = page0['/Resources']['/XObject'].getObject()

        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                data = xObject[obj].getData()
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                else:
                    mode = "P"

                if '/Filter' in xObject[obj]:
                    if xObject[obj]['/Filter'] == '/FlateDecode':
                        img = Image.frombytes(mode, size, data)
                        img.save(obj[1:] + ".png")
                    elif xObject[obj]['/Filter'] == '/DCTDecode':
                        img = open(obj[1:] + ".jpg", "wb")
                        img.write(data)
                        img.close()
                    elif xObject[obj]['/Filter'] == '/JPXDecode':
                        img = open(obj[1:] + ".jp2", "wb")
                        img.write(data)
                        img.close()
                    elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                        img = open(obj[1:] + ".tiff", "wb")
                        img.write(data)
                        img.close()
                else:
                    img = Image.frombytes(mode, size, data)
                    img.save(obj[1:] + ".png")
    else:
        print("No image found.")

My sample PDF file has a PNG image on the first page and the program saved it with an “image20.png” filename.

我的样本PDF文件的第一页上有一个PNG图像，程序将其保存为“ image20.png”文件名。

参考资料 (References)

翻译自: https://www.journaldev.com/33281/pypdf2-python-library-for-pdf-files