读取pdf文件,将每页图片转为np.array格式,供paddleocr进行读取,此代码对转换速度进行了测试.
需要安装:paddleocr, pyinstrument, pymupdf,memory_profiler
收到pymupdf开发者回复,得到了更高效的方法, 使用pix.samples_mv可以直通内存(which is a memoryview to that internal area (without copying)) github链接 , 速度非常可观,相比之前的ms级加速到µs级,足足有3000倍
下面是测试结果:
images = []
pixs = [page.get_pixmap(dpi=300) for page in doc]
%timeit [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
5.22 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
15.7 ms ± 77.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit [np.array(Image.frombytes("RGB", (pix.width, pix.height), pix.samples), dtype=np.uint8) for pix in pixs]
105 ms ± 4.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG")))) for pix in pixs]
179 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR) for pix in pixs]
182 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR) for pix in pixs]
394 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
测试平台是i7-8700,pdf是随便找的110KB文件,大文件速度会相对更慢一些,get_pixmap如果设置太大生成的图片会非常大
内存消耗也相对减少了一点:
%load_ext memory_profiler
%memit images = [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
peak memory: 346.82 MiB, increment: 0.07 MiB
%memit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]
peak memory: 396.67 MiB, increment: 49.83 MiB
以下内容可以不看,下面是以前写的,测试不严谨
import io
import cv2
import fitz
import numpy as np
from PIL import Image
from paddleocr import PaddleOCR
from pyinstrument import Profiler
from memory_profiler import profile
pdf_file = "./测试文档.pdf"
doc = fitz.open(pdf_file)
ocr = PaddleOCR(use_angle_cls=True, use_gpu=False,lang="ch")
# 测试函数时间
def test(func):
def _call():
profiler = Profiler()
profiler.start()
func()
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
return _call
@test
# @profile
def test1():
images = []
for page in doc:
pix = page.get_pixmap(dpi=300) # dpi=300是测试出来比较合适的大小,过大会导致图片过大
image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
image = np.array(image, dtype=np.uint8)
images.append(image)
# [print(ocr.ocr(image)) for image in images] #确定images可以被ocr读取
@test
# @profile
def test2():
images = []
for page in doc:
pix = page.get_pixmap(dpi=300)
image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
images.append(image)
# [print(ocr.ocr(image)) for image in images]
@test
# @profile
def test3():
images = []
for page in doc:
pix = page.get_pixmap(dpi=300)
image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)
images.append(image)
# [print(ocr.ocr(image)) for image in images]
@test
# @profile
def test4():
images = []
for page in doc:
pix = page.get_pixmap(dpi=300)
image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)
images.append(image)
# [print(ocr.ocr(image)) for image in images]
@test
# @profile
def test5():
images = []
for page in doc:
pix = page.get_pixmap(dpi=300)
image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))
images.append(image)
# [print(ocr.ocr(image)) for image in images]
@test
# @profile
def test2_Comprehensions():
imaegs = []
pixs = [page.get_pixmap(dpi=300) for page in doc]
images = [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs ]
# 列表推导式可以提高效率
test1()
test2()
test3()
test4()
test5()
test2_Comprehensions()
时间测试结果:
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:09 Samples: 112
/_//_/// /_\ / //_// / //_'/ // Duration: 0.391 CPU time: 0.391
/ _/ v4.1.1
Program: I:/ocr_test/性能测试.py
0.381 _call 性能测试.py:17
├─ 0.374 test1 性能测试.py:27
│ ├─ 0.128 get_pixmap fitz\utils.py:812
│ │ [7 frames hidden] fitz, <built-in>
│ │ 0.125 DisplayList_get_pixmap <built-in>:0
│ ├─ 0.126 __array__ PIL\Image.py:705
│ │ [17 frames hidden] PIL, <built-in>
│ ├─ 0.056 frombytes PIL\Image.py:2788
│ │ [7 frames hidden] PIL, <built-in>
│ ├─ 0.038 array <built-in>:0
│ │ [2 frames hidden] <built-in>
│ ├─ 0.023 samples fitz\fitz.py:7468
│ │ [2 frames hidden] fitz
│ └─ 0.005 [self]
└─ 0.007 [self]
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 15
/_//_/// /_\ / //_// / //_'/ // Duration: 0.141 CPU time: 0.156
/ _/ v4.1.1
Program: I:/ocr_test/性能测试.py
0.141 _call 性能测试.py:17
├─ 0.136 test2 性能测试.py:38
│ ├─ 0.109 get_pixmap fitz\utils.py:812
│ │ [4 frames hidden] fitz, <built-in>
│ │ 0.109 DisplayList_get_pixmap <built-in>:0
│ ├─ 0.024 samples fitz\fitz.py:7468
│ │ [2 frames hidden] fitz
│ └─ 0.003 __del__ fitz\fitz.py:7494
│ [3 frames hidden] fitz, <built-in>
└─ 0.004 [self]
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 56
/_//_/// /_\ / //_// / //_'/ // Duration: 0.611 CPU time: 0.609
/ _/ v4.1.1
Program: I:/ocr_test/性能测试.py
0.607 _call 性能测试.py:17
└─ 0.607 test3 性能测试.py:48
├─ 0.296 imdecode <built-in>:0
│ [2 frames hidden] <built-in>
├─ 0.196 pil_tobytes fitz\fitz.py:7279
│ [33 frames hidden] fitz, PIL, <built-in>, ntpath, generi...
└─ 0.113 get_pixmap fitz\utils.py:812
[4 frames hidden] fitz, <built-in>
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 21
/_//_/// /_\ / //_// / //_'/ // Duration: 1.545 CPU time: 1.531
/ _/ v4.1.1
性能测试.py
1.540 _call 性能测试.py:17
└─ 1.535 test4 性能测试.py:58
├─ 1.120 tobytes fitz\fitz.py:7146
│ [4 frames hidden] fitz, <built-in>
│ 1.120 Pixmap__tobytes <built-in>:0
├─ 0.306 imdecode <built-in>:0
│ [2 frames hidden] <built-in>
└─ 0.109 get_pixmap fitz\utils.py:812
[4 frames hidden] fitz, <built-in>
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:12 Samples: 146
/_//_/// /_\ / //_// / //_'/ // Duration: 0.561 CPU time: 0.562
/ _/ v4.1.1
Program: I:/ocr_test/性能测试.py
0.567 _call 性能测试.py:17
└─ 0.567 test5 性能测试.py:68
├─ 0.232 __array__ PIL\Image.py:705
│ [13 frames hidden] PIL, <built-in>
├─ 0.185 pil_tobytes fitz\fitz.py:7279
│ [24 frames hidden] fitz, PIL, <built-in>, ntpath, generi...
├─ 0.111 get_pixmap fitz\utils.py:812
│ [4 frames hidden] fitz, <built-in>
├─ 0.032 array <built-in>:0
│ [2 frames hidden] <built-in>
└─ 0.006 [self]
_ ._ __/__ _ _ _ _ _/_ Recorded: 11:10:12 Samples: 16
/_//_/// /_\ / //_// / //_'/ // Duration: 0.136 CPU time: 0.141
/ _/ v4.1.1
Program: I:/ocr_test/性能测试.py
0.143 _call 性能测试.py:17
├─ 0.131 test2_Comprehensions 性能测试.py:78
│ ├─ 0.109 <listcomp> 性能测试.py:82
│ │ └─ 0.109 get_pixmap fitz\utils.py:812
│ │ [4 frames hidden] fitz, <built-in>
│ │ 0.109 DisplayList_get_pixmap <built-in>:0
│ └─ 0.023 <listcomp> 性能测试.py:83
│ └─ 0.023 samples fitz\fitz.py:7468
│ [2 frames hidden] fitz
├─ 0.006 __del__ fitz\fitz.py:7494
│ [3 frames hidden] fitz, <built-in>
└─ 0.006 [self]
内存测试结果:
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
30 290.2 MiB 290.2 MiB 1 @profile
31 def test1():
32 290.2 MiB 0.0 MiB 1 images = []
33 449.1 MiB 0.1 MiB 6 for page in doc:
34 424.0 MiB 29.5 MiB 5 pix = page.get_pixmap(dpi=300) # dpi=300是测试出来比较合适的大小,过大会导致图片过大
35 457.2 MiB 166.1 MiB 5 image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
36 449.1 MiB -36.8 MiB 5 image = np.array(image, dtype=np.uint8)
37 449.1 MiB 0.0 MiB 5 images.append(image)
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
41 299.7 MiB 299.7 MiB 1 @profile
42 def test2():
43 299.7 MiB 0.0 MiB 1 images = []
44 449.3 MiB 0.0 MiB 6 for page in doc:
45 424.4 MiB 25.1 MiB 5 pix = page.get_pixmap(dpi=300)
46 449.3 MiB 124.5 MiB 5 image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
47 449.3 MiB 0.0 MiB 5 images.append(image)
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
51 299.9 MiB 299.9 MiB 1 @profile
52 def test3():
53 299.9 MiB 0.0 MiB 1 images = []
54 716.0 MiB 0.0 MiB 6 for page in doc:
55 657.8 MiB 124.6 MiB 5 pix = page.get_pixmap(dpi=300)
56 716.0 MiB 291.5 MiB 5 image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)
57 716.0 MiB 0.0 MiB 5 images.append(image)
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
61 300.9 MiB 300.9 MiB 1 @profile
62 def test4():
63 300.9 MiB 0.0 MiB 1 images = []
64 450.3 MiB 0.0 MiB 6 for page in doc:
65 425.6 MiB 24.8 MiB 5 pix = page.get_pixmap(dpi=300)
66 450.3 MiB 124.6 MiB 5 image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)
67 450.3 MiB 0.0 MiB 5 images.append(image)
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
71 300.9 MiB 300.9 MiB 1 @profile
72 def test5():
73 300.9 MiB 0.0 MiB 1 images = []
74 716.3 MiB 0.0 MiB 6 for page in doc:
75 657.1 MiB 124.7 MiB 5 pix = page.get_pixmap(dpi=300)
76 716.3 MiB 290.7 MiB 5 image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))
77 716.3 MiB 0.0 MiB 5 images.append(image)
Filename: I:\ocr_test\性能测试.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
81 301.2 MiB 301.2 MiB 1 @profile
82 def test2_Comprehensions():
83 301.2 MiB 0.0 MiB 1 imaegs = []
84 425.9 MiB 124.7 MiB 6 def a(pix):
85 450.8 MiB 124.5 MiB 5 return np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))
86 425.9 MiB -124.5 MiB 8 images = [a(page.get_pixmap(dpi=300)) for page in doc]
可以看出,test4方法最慢,test3方法占用内存最多,test2方法最优秀,有最快的速度和最少的内存占用,如果使用列表推导式理论上还能加速和减少内存使用,速度提升有限
测试过程中还发现如果使用Image.open可能会过大导致PIL.Image.DecompressionBombError