OCRmyPDF使用最好的可用开源OCR引擎Tesseract执行OCR。
OCRmyPDF是一个Python 3包,将OCR图层处理结果添加到PDF。
OCRmyPDF是功能最丰富且经过彻底测试的OCR PDF转换工具。
1) macOS
2) Ubuntu 16.04 LTS
3)ArchLinux
4)Windows
此外,OCRmypdf提供了docker镜像,可以直接下载镜像、使用。
Centos版本:
[root@bc22c4e1 ~]# cat /etc/issue
CentOS release 6.9 (Final)
1)Python > 3.5
[root@bc22c4e1 ~]# python -V
Python 3.5.0
2)pip > 0.9.1
[root@bc22c4e1 ~]# pip -V
pip 9.0.1 from /usr/local/python3/lib/python3.5/site-packages (python 3.5)
3) Python3导入sqlite3成功
4)基础配置
*CentOS/RHEL 6.x*
# yum install gcc python-devel python-setuptools
# easy_install pip
# pip install fabric
5)其他功能配置
pdftotext依赖如下:
yum install poppler-utils
步骤1:下载git源文件。
git clone -b master https://github.com/jbarlow83/OCRmyPDF.git
步骤2:设置环境
python3 -m venv ./
步骤3:源码编译
source venv/bin/activate
步骤4:执行安装
cd OCRmyPDF
pip3 install .
错误如下:
Running setup.py install for ocrmypdf … error
Complete output from command /usr/local/bin/python -u -c “import setuptools, tokenize;file=’/tmp/pip-lio4mtqk-build/setup.py’;f=getattr(tokenize, ‘open’, open)(file);code=f.read().replace(‘\r\n’, ‘\n’);f.close();exec(compile(code, file, ‘exec’))” install –record /tmp/pip-qnapqha6-record/install-record.txt –single-version-externally-managed –compile:
Checking for tesseract >= 3.04…
Found tesseract 3.04.00
Checking for gs >= 9.15..
解决方案:
curl -O http://downloads.ghostscript.com/public/ghostscript-9.14.tar.gz &&
tar -xzf ghostscript-9.14.tar.gz &&
cd ghostscript-9.14 &&
./configure &&
make install &&
make so &&
cp ghostscript-9.14/sobin/libgs.so.9.14 /usr/lib &&
ln -s /usr/lib/libgs.so.9.14 /usr/lib/libgs.so &&
mkdir -p /etc/ld.so.conf.d/ &&
echo “/usr/lib/libgs.so” > /etc/ld.so.conf.d/libgs.conf &&
ldconfig &&
echo “Installing ghostscript finish” &&
gs
参考地址:https://unix.stackexchange.com/questions/79025/install-ghostscript-v-9-05-or-newer-on-centos
解决方案:
步骤1:下载6.1版本unpaper
# cd /var/bin && wget https://www.flameeyes.eu/files/unpaper-6.1.tar.xz && tar -xvf unpaper-6.1.tar.xz
步骤2: 编译、安装、运行unpaper6.1
# cd unpaper-6.1 && ./configure && make && make install
参考地址:https://github.com/Flameeyes/unpaper/issues/44
解决方案:
qpdf编译、安装、运行。
./configure
make
make install
参考:https://github.com/qpdf/qpdf
configure: error: Package requirements (libavformat libavcodec libavutil) were not met:
No package 'libavformat' found
No package 'libavcodec' found
No package 'libavutil' found
或者:
No package 'libavformat' found No package 'libavcodec' found No package 'libavutil' found
解决方案:
步骤1:安装依赖。
yum install libvorbis yasm freetype zlib bzip2 faac lame speex libvpx libogg libtheora x264 XviD openjpeg15 opencore-amr
步骤2:下载&安装
wget https://www.libav.org/releases/libav-10.5.tar.gz
tar xvf libav-10.5.tar.gz
cd libav-10.5
/configure –extra-cflags=-I/opt/local/include –extra-ldflags=-L/opt/local/lib –enable-gpl –enable-version3 –enable-libvpx
make
make install
参考:https://superuser.com/questions/850808/how-to-install-libav-tools-in-centos-6
解决方案:
yum search ffi | grep python
yum install python-cffi
yum install libffi-devel
pip install –upgrade cffi
参考:https://github.com/Kozea/cairocffi/issues/14
ages (from reportlab>=3.3.0->ocrmypdf==5.2.post0+g3a7c341.d20170710)
Requirement already satisfied: pycparser in /home/centos001/lib/python3.5/site-packages (from cffi>=1.9.1->ocrmypdf==5.2.post0+g3a7c341.d20170710)
Installing collected packages: ocrmypdf
Running setup.py install for ocrmypdf ... done
Successfully installed ocrmypdf-5.2.post0+g3a7c341.d20170710
python3、pip3安装参考 :http://www.jianshu.com/p/6199b5c26725
sqlite3安装参考:http://www.cnblogs.com/greentomlee/p/6561509.html
事件出真知,有问题就努力一个个排查问题,直到全部解决。(耗时2天)
2017年08月13日 13:48 于家中床前
作者:铭毅天下
转载请标明出处,原文地址:
http://blog.csdn.net/laoyang360/article/details/77141977
如果感觉本文对您有帮助,请点击‘顶’支持一下,您的支持是我坚持写作最大的动力,谢谢!