在OCR项目调研过程发现一个开源工具gosseract,识别效果不错;
按部就班准备环境,先mac环境安装tesseract(gosseract依赖):
brew install tesseract
$ tesseract -v
tesseract 4.1.3
leptonica-1.82.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE
第一次安装很顺利,成功。
随着业务需求增加,需要进行语言训练,因此需要安装训练工具, 选择卸载重装:
$ brew install --with-training-tools tesseract
Usage: brew install [options] formula|cask [...]
Install a formula or cask. Additional options specific to a formula may be
appended to the command.
...
Error: invalid option: --with-training-tools
提示此安装方式已废弃。所以选择编译安装方式:
安装依赖
# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc
下载解压
https://github.com/tesseract-ocr/tesseract/releases
安装
cd tesseract-5.1.0
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
make training
sudo make training-install
问题:
安装好之后,编译项目报错:
2022/03/31 15:32:10 ERROR ▶ 0004 Failed to build the application: # ocr
/usr/local/go/pkg/tool/darwin_amd64/link: running clang++ failed: exit status 1
Undefined symbols for architecture x86_64:
"tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool)", referenced from:
Init(void*, char*, char*) in 000023.o
_Init in 000023.o
_GetDataPath in 000023.o
"tesseract::TessBaseAPI::Recognize(ETEXT_DESC*)", referenced from:
_GetBoundingBoxesVerbose in 000023.o
_GetBoundingBoxes in 000023.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
仅观察报错内容,没发现是版本问题,经过多次卸载重装后发现是版本太高导致的,于是重新安装了4.1.3版本后服务正常编译。
卸载方式可以手动删除安装文件,或者通过命令:
brew uninstall tesseract
但是在后续安装tesseract是会出现各种问题,如下:
$ brew install tesseract==4.1.3
Warning: No available formula with the name "tesseract==4.1.3". Did you mean tesseract?
==> Searching for similarly named formulae...
This similarly named formula was found:
tesseract
To install it, run:
brew install tesseract
==> Searching for a previously deleted formula (in the last month)...
Error: No previously deleted formula found.
==> Searching taps on GitHub...
Error: No formulae found in taps.
liumeng@liumengdeMacBook-Pro Pictures % brew install tesseract
==> Downloading https://ghcr.io/v2/homebrew/core/tesseract/manifests/4.1.3
Already downloaded: /Users/liumeng/Library/Caches/Homebrew/downloads/9597a8ae2cb676cd25c79cf252f4eb8759b9cf3d472c57f7c764e086c5f8f6e2--tesseract-4.1.3.bottle_manifest.json
==> Downloading https://ghcr.io/v2/homebrew/core/tesseract/blobs/sha256:1b67091dce98b42c6c561981a01738fe01c19ac69a1dc4de6d8e43fe885177f0
Already downloaded: /Users/liumeng/Library/Caches/Homebrew/downloads/cf8d3fbb1aea1cc629c6873a25b11d732c90ff23bfa4c44ba23d0ce5c24e907a--tesseract--4.1.3.big_sur.bottle.tar.gz
==> Pouring tesseract--4.1.3.big_sur.bottle.tar.gz
Error: The `brew link` step did not complete successfully
The formula built, but is not symlinked into /usr/local
Could not symlink include/tesseract/apitypes.h
/usr/local/include/tesseract is not writable.
You can try again using:
brew link tesseract
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
==> Summary
/usr/local/Cellar/tesseract/4.1.3: 65 files, 29.7MB
查看报错信息,需要如下操作:
$ brew link tesseract
Linking /usr/local/Cellar/tesseract/4.1.3...
Error: Could not symlink include/tesseract/apitypes.h
/usr/local/include/tesseract is not writable.
此时需要先删除一些文件:
$ sudo rm -rf /usr/local/include/tesseract
继续如下操作:
$ brew link tesseract
Linking /usr/local/Cellar/tesseract/4.1.3...
Error: Could not symlink share/tessdata/configs/alto
Target /usr/local/share/tessdata/configs/alto
already exists. You may want to remove it:
rm '/usr/local/share/tessdata/configs/alto'
To force the link and overwrite all conflicting files:
brew link --overwrite tesseract
To list all files that would be deleted:
brew link --overwrite --dry-run tesseract
给了三种操作方法。
如下操作:
$ sudo rm -rf /usr/local/share/tessdata/configs/alto
$ brew link --overwrite --dry-run tesseract
Would remove:
/usr/local/share/tessdata/configs/ambigs.train
...
/usr/local/lib/libtesseract.dylib -> /usr/local/lib/libtesseract.5.dylib
/usr/local/lib/pkgconfig/tesseract.pc
liumeng@liumengdeMacBook-Pro Pictures % tesseract -v
zsh: command not found: tesseract
liumeng@liumengdeMacBook-Pro Pictures % brew install tesseract
Updating Homebrew...
==> Auto-updated Homebrew!
Updated 1 tap (homebrew/cask).
==> Updated Casks
Updated 7 casks.
Warning: tesseract 4.1.3 is already installed, it's just not linked.
To link this version, run:
brew link tesseract
$ brew link --overwrite tesseract
Linking /usr/local/Cellar/tesseract/4.1.3...
Error: Could not symlink share/tessdata/configs/alto
/usr/local/share/tessdata/configs is not writable.
继续删除:
$ sudo rm -rf /usr/local/share/tessdata/configs
$ brew link --overwrite tesseract
Linking /usr/local/Cellar/tesseract/4.1.3...
Error: Could not symlink share/tessdata/tessconfigs/batch
/usr/local/share/tessdata/tessconfigs is not writable.
$ sudo rm -rf /usr/local/share/tessdata/tessconfigs
$ brew link --overwrite tesseract
Linking /usr/local/Cellar/tesseract/4.1.3... 12 symlinks created.
验证:
$ tesseract -v
tesseract 4.1.3
leptonica-1.82.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE
项目编译正常,结束!