python textract windows配置

钦良弼

2023-12-01

python textract能够帮助你从图片和各种文档识别文字

测试环境：

1. win7_64/win10_64

2. python3.7_64

oonnley.com - 算工资工具

textract安装

pip install extract

Textract dependencies

If you use pip install textract, then it will support to extract data from docx, xlsx, pptx.

If you want textract support OCR(optical character recognition), you need to install tesseract:

for windows installer:

after install tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe, need to add installation path (C:\Program Files\Tesseract-OCR) to system variables – Path

if you want to support Chinese, then need to download the trained data from:

and put downloaded file chi_sim.traineddata in C:\Program Files\Tesseract-OCR\tessdata

test with below commands:

import textract

text=textract.process('./test_image/1.tif',method='tesseract',language='chi_sim')

print(text)

If you want textract support pdf, you need to download the pdftotext component from:

the latest version till now is: poppler-0.68.0_x86

unzip it, you can get the folder poppler-0.68.0, and put it in folder - C:\Program Files (x86)\

add path (C:\Program Files (x86)\poppler-0.68.0\bin) to system variables – Path

test with below commands:

import textract

text=textract.process('./test_image/1.pdf')

print(text)

the latest version till now is: antiword-0_37-windows.zip

unzip it, you will get the folder antiword, and must put it at c:\(seems path set in app)

add path (C:\antiword) to system variables – Path

test with below commands:

import subprocess

text= subprocess.check_output(['antiword', '-m', 'utf-8.txt', './test_image/1.doc'])

print(text)