用pymupdf将pdf转为图片速度测试

苏嘉志

2023-12-01

读取pdf文件,将每页图片转为np.array格式,供paddleocr进行读取,此代码对转换速度进行了测试.
需要安装:paddleocr, pyinstrument, pymupdf,memory_profiler
收到pymupdf开发者回复,得到了更高效的方法, 使用pix.samples_mv可以直通内存(which is a memoryview to that internal area (without copying)) github链接 , 速度非常可观,相比之前的ms级加速到µs级,足足有3000倍

下面是测试结果:

images = []
pixs = [page.get_pixmap(dpi=300) for page in doc]
%timeit [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]

    5.22 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]

    15.7 ms ± 77.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit [np.array(Image.frombytes("RGB", (pix.width, pix.height), pix.samples), dtype=np.uint8) for pix in pixs]

    105 ms ± 4.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG")))) for pix in pixs]

    179 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit [cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR) for pix in pixs]

    182 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit [cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR) for pix in pixs]

    394 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

测试平台是i7-8700,pdf是随便找的110KB文件,大文件速度会相对更慢一些,get_pixmap如果设置太大生成的图片会非常大
内存消耗也相对减少了一点:

%load_ext memory_profiler
%memit images = [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
   peak memory: 346.82 MiB, increment: 0.07 MiB
%memit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
   peak memory: 396.67 MiB, increment: 49.83 MiB

以下内容可以不看,下面是以前写的,测试不严谨

import io

import cv2
import fitz
import numpy as np
from PIL import Image
from paddleocr import PaddleOCR
from pyinstrument import Profiler
from memory_profiler import profile

pdf_file = "./测试文档.pdf"
doc = fitz.open(pdf_file)
ocr = PaddleOCR(use_angle_cls=True, use_gpu=False,lang="ch")

# 测试函数时间
def test(func):
    def _call():
        profiler = Profiler()
        profiler.start()

        func()
        profiler.stop()
        print(profiler.output_text(unicode=True, color=True))
    return _call


@test
# @profile
def test1():
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)  # dpi=300是测试出来比较合适的大小,过大会导致图片过大
        image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
        image = np.array(image, dtype=np.uint8)
        images.append(image)
    # [print(ocr.ocr(image)) for image in images] #确定images可以被ocr读取

@test
# @profile
def test2():
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)
        image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
        images.append(image)
    # [print(ocr.ocr(image)) for image in images]

@test
# @profile
def test3():
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)
        image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)
        images.append(image)
    # [print(ocr.ocr(image)) for image in images]

@test
# @profile
def test4():
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)
        image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)
        images.append(image)
    # [print(ocr.ocr(image)) for image in images]

@test
# @profile
def test5():
    images = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)
        image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))
        images.append(image)
    # [print(ocr.ocr(image)) for image in images]

@test
# @profile
def test2_Comprehensions():
    imaegs = []
    pixs = [page.get_pixmap(dpi=300) for page in doc]
    images = [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs ]
    # 列表推导式可以提高效率


test1()
test2()
test3()
test4()
test5()
test2_Comprehensions()

时间测试结果:

  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:09  Samples:  112
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.391     CPU time: 0.391
/   _/                      v4.1.1

Program: I:/ocr_test/性能测试.py

0.381 _call  性能测试.py:17
├─ 0.374 test1  性能测试.py:27
│  ├─ 0.128 get_pixmap  fitz\utils.py:812
│  │     [7 frames hidden]  fitz, <built-in>
│  │        0.125 DisplayList_get_pixmap  <built-in>:0
│  ├─ 0.126 __array__  PIL\Image.py:705
│  │     [17 frames hidden]  PIL, <built-in>
│  ├─ 0.056 frombytes  PIL\Image.py:2788
│  │     [7 frames hidden]  PIL, <built-in>
│  ├─ 0.038 array  <built-in>:0
│  │     [2 frames hidden]  <built-in>
│  ├─ 0.023 samples  fitz\fitz.py:7468
│  │     [2 frames hidden]  fitz
│  └─ 0.005 [self]  
└─ 0.007 [self]  



  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:10  Samples:  15
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.141     CPU time: 0.156
/   _/                      v4.1.1

Program: I:/ocr_test/性能测试.py

0.141 _call  性能测试.py:17
├─ 0.136 test2  性能测试.py:38
│  ├─ 0.109 get_pixmap  fitz\utils.py:812
│  │     [4 frames hidden]  fitz, <built-in>
│  │        0.109 DisplayList_get_pixmap  <built-in>:0
│  ├─ 0.024 samples  fitz\fitz.py:7468
│  │     [2 frames hidden]  fitz
│  └─ 0.003 __del__  fitz\fitz.py:7494
│        [3 frames hidden]  fitz, <built-in>
└─ 0.004 [self]  



  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:10  Samples:  56
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.611     CPU time: 0.609
/   _/                      v4.1.1

Program: I:/ocr_test/性能测试.py

0.607 _call  性能测试.py:17
└─ 0.607 test3  性能测试.py:48
   ├─ 0.296 imdecode  <built-in>:0
   │     [2 frames hidden]  <built-in>
   ├─ 0.196 pil_tobytes  fitz\fitz.py:7279
   │     [33 frames hidden]  fitz, PIL, <built-in>, ntpath, generi...
   └─ 0.113 get_pixmap  fitz\utils.py:812
         [4 frames hidden]  fitz, <built-in>



  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:10  Samples:  21
 /_//_/// /_\ / //_// / //_'/ //     Duration: 1.545     CPU time: 1.531
/   _/                      v4.1.1

性能测试.py

1.540 _call  性能测试.py:17
└─ 1.535 test4  性能测试.py:58
   ├─ 1.120 tobytes  fitz\fitz.py:7146
   │     [4 frames hidden]  fitz, <built-in>
   │        1.120 Pixmap__tobytes  <built-in>:0
   ├─ 0.306 imdecode  <built-in>:0
   │     [2 frames hidden]  <built-in>
   └─ 0.109 get_pixmap  fitz\utils.py:812
         [4 frames hidden]  fitz, <built-in>



  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:12  Samples:  146
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.561     CPU time: 0.562
/   _/                      v4.1.1

Program: I:/ocr_test/性能测试.py

0.567 _call  性能测试.py:17
└─ 0.567 test5  性能测试.py:68
   ├─ 0.232 __array__  PIL\Image.py:705
   │     [13 frames hidden]  PIL, <built-in>
   ├─ 0.185 pil_tobytes  fitz\fitz.py:7279
   │     [24 frames hidden]  fitz, PIL, <built-in>, ntpath, generi...
   ├─ 0.111 get_pixmap  fitz\utils.py:812
   │     [4 frames hidden]  fitz, <built-in>
   ├─ 0.032 array  <built-in>:0
   │     [2 frames hidden]  <built-in>
   └─ 0.006 [self]  



  _     ._   __/__   _ _  _  _ _/_   Recorded: 11:10:12  Samples:  16
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.136     CPU time: 0.141
/   _/                      v4.1.1

Program: I:/ocr_test/性能测试.py

0.143 _call  性能测试.py:17
├─ 0.131 test2_Comprehensions  性能测试.py:78
│  ├─ 0.109 <listcomp>  性能测试.py:82
│  │  └─ 0.109 get_pixmap  fitz\utils.py:812
│  │        [4 frames hidden]  fitz, <built-in>
│  │           0.109 DisplayList_get_pixmap  <built-in>:0
│  └─ 0.023 <listcomp>  性能测试.py:83
│     └─ 0.023 samples  fitz\fitz.py:7468
│           [2 frames hidden]  fitz
├─ 0.006 __del__  fitz\fitz.py:7494
│     [3 frames hidden]  fitz, <built-in>
└─ 0.006 [self]

内存测试结果:

Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    30    290.2 MiB    290.2 MiB           1   @profile
    31                                         def test1():
    32    290.2 MiB      0.0 MiB           1       images = []
    33    449.1 MiB      0.1 MiB           6       for page in doc:
    34    424.0 MiB     29.5 MiB           5           pix = page.get_pixmap(dpi=300)  # dpi=300是测试出来比较合适的大小,过大会导致图片过大
    35    457.2 MiB    166.1 MiB           5           image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
    36    449.1 MiB    -36.8 MiB           5           image = np.array(image, dtype=np.uint8)
    37    449.1 MiB      0.0 MiB           5           images.append(image)


Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    41    299.7 MiB    299.7 MiB           1   @profile
    42                                         def test2():
    43    299.7 MiB      0.0 MiB           1       images = []
    44    449.3 MiB      0.0 MiB           6       for page in doc:
    45    424.4 MiB     25.1 MiB           5           pix = page.get_pixmap(dpi=300)
    46    449.3 MiB    124.5 MiB           5           image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
    47    449.3 MiB      0.0 MiB           5           images.append(image)


Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    51    299.9 MiB    299.9 MiB           1   @profile
    52                                         def test3():
    53    299.9 MiB      0.0 MiB           1       images = []
    54    716.0 MiB      0.0 MiB           6       for page in doc:
    55    657.8 MiB    124.6 MiB           5           pix = page.get_pixmap(dpi=300)
    56    716.0 MiB    291.5 MiB           5           image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)
    57    716.0 MiB      0.0 MiB           5           images.append(image)


Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    61    300.9 MiB    300.9 MiB           1   @profile
    62                                         def test4():
    63    300.9 MiB      0.0 MiB           1       images = []
    64    450.3 MiB      0.0 MiB           6       for page in doc:
    65    425.6 MiB     24.8 MiB           5           pix = page.get_pixmap(dpi=300)
    66    450.3 MiB    124.6 MiB           5           image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)
    67    450.3 MiB      0.0 MiB           5           images.append(image)


Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    300.9 MiB    300.9 MiB           1   @profile
    72                                         def test5():
    73    300.9 MiB      0.0 MiB           1       images = []
    74    716.3 MiB      0.0 MiB           6       for page in doc:
    75    657.1 MiB    124.7 MiB           5           pix = page.get_pixmap(dpi=300)
    76    716.3 MiB    290.7 MiB           5           image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))
    77    716.3 MiB      0.0 MiB           5           images.append(image)


Filename: I:\ocr_test\性能测试.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    81    301.2 MiB    301.2 MiB           1   @profile
    82                                         def test2_Comprehensions():
    83    301.2 MiB      0.0 MiB           1       imaegs = []
    84    425.9 MiB    124.7 MiB           6       def a(pix):
    85    450.8 MiB    124.5 MiB           5           return np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
    86    425.9 MiB   -124.5 MiB           8       images = [a(page.get_pixmap(dpi=300)) for page in doc]

可以看出,test4方法最慢,test3方法占用内存最多,test2方法最优秀,有最快的速度和最少的内存占用,如果使用列表推导式理论上还能加速和减少内存使用,速度提升有限

测试过程中还发现如果使用Image.open可能会过大导致PIL.Image.DecompressionBombError

用pymupdf将pdf转为图片速度测试

相关阅读

相关文章

相关问答

相关文档