当前位置: 首页 > 工具软件 > textract > 使用案例 >

python textract windows配置

钦良弼
2023-12-01

python textract能够帮助你从图片和各种文档识别文字

测试环境:

1. win7_64/win10_64

2. python3.7_64

3.test_image

 

oonnley.com - 算工资工具

 

textract安装

pip install extract

Textract dependencies

If you use pip install textract, then it will support to extract data from docx, xlsx, pptx.

  1. If you want textract support OCR(optical character recognition), you need to install tesseract:

https://github.com/tesseract-ocr/tesseract/wiki

for windows installer:

https://github.com/UB-Mannheim/tesseract/wiki

after install tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe, need to add installation path (C:\Program Files\Tesseract-OCR) to system variables – Path

if you want to support Chinese, then need to download the trained data from:

https://github.com/tesseract-ocr/tessdata

and put downloaded file chi_sim.traineddata in C:\Program Files\Tesseract-OCR\tessdata

test with below commands:

import textract

text=textract.process('./test_image/1.tif',method='tesseract',language='chi_sim')

print(text)

 

  1. If you want textract support pdf, you need to download the pdftotext component from:

http://blog.alivate.com.au/poppler-windows/

the latest version till now is: poppler-0.68.0_x86

unzip it, you can get the folder poppler-0.68.0, and put it in folder - C:\Program Files (x86)\

add path (C:\Program Files (x86)\poppler-0.68.0\bin) to system variables – Path

 

test with below commands:

import textract

text=textract.process('./test_image/1.pdf')

print(text)

 

  1. If you want to support doc, you need to download the antiword component from:

https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

the latest version till now is: antiword-0_37-windows.zip

unzip it, you will get the folder antiword, and must put it at c:\(seems path set in app)

add path (C:\antiword) to system variables – Path

test with below commands:

import subprocess

text= subprocess.check_output(['antiword', '-m', 'utf-8.txt', './test_image/1.doc'])

print(text)

 类似资料: