python textract能够帮助你从图片和各种文档识别文字
测试环境:
1. win7_64/win10_64
2. python3.7_64
oonnley.com - 算工资工具
textract安装
pip install extract
Textract dependencies
If you use pip install textract, then it will support to extract data from docx, xlsx, pptx.
https://github.com/tesseract-ocr/tesseract/wiki
for windows installer:
https://github.com/UB-Mannheim/tesseract/wiki
after install tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe, need to add installation path (C:\Program Files\Tesseract-OCR) to system variables – Path
if you want to support Chinese, then need to download the trained data from:
https://github.com/tesseract-ocr/tessdata
and put downloaded file chi_sim.traineddata in C:\Program Files\Tesseract-OCR\tessdata
test with below commands:
import textract
text=textract.process('./test_image/1.tif',method='tesseract',language='chi_sim')
print(text)
http://blog.alivate.com.au/poppler-windows/
the latest version till now is: poppler-0.68.0_x86
unzip it, you can get the folder poppler-0.68.0, and put it in folder - C:\Program Files (x86)\
add path (C:\Program Files (x86)\poppler-0.68.0\bin) to system variables – Path
test with below commands:
import textract
text=textract.process('./test_image/1.pdf')
print(text)
https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml
the latest version till now is: antiword-0_37-windows.zip
unzip it, you will get the folder antiword, and must put it at c:\(seems path set in app)
add path (C:\antiword) to system variables – Path
test with below commands:
import subprocess
text= subprocess.check_output(['antiword', '-m', 'utf-8.txt', './test_image/1.doc'])
print(text)