The build instructions for Linux also apply to other UNIX like operating systems.
安装依赖Dependencies
A compiler for C and C++: GCC or Clang
GNU Autotools: autoconf, automake, libtool
autoconf-archive
pkg-config
Leptonica
libpng, libjpeg, libtiff
Ubuntu
If they are not already installed, you need the following libraries (Ubuntu 16.04/14.04):
sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
if you plan to install the training tools, you also need the following libraries:
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
Leptonica
You also need to install Leptonica. Ensure that the development headers for Leptonica are installed before compiling Tesseract.
Tesseract versions and the minimum version of Leptonica required:
Tesseract Leptonica Ubuntu
4.00 1.74.2 Must build from source
3.05 1.74.0 Must build from source
3.04 1.71 Ubuntu 16.04
3.03 1.70 Ubuntu 14.04
3.02 1.69 Ubuntu 12.04
3.01 1.67
One option is to install the distro’s Leptonica package:
sudo apt-get install libleptonica-dev
but if you are using an oldish version of Linux, the Leptonica version may be too old, so you will need to build from source.
The sources are at https://github.com/DanBloomberg/leptonica . The instructions for building are given in Leptonica README.
Leptonica 1.74.4
git clone --recursive https://github.com/DanBloomberg/leptonica
cd leptonica
./autobuild
./configure
./make-for-auto
sudo make
sudo make install
Tesseract 4.0
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract
cd tesseract
./autogen.sh
./configure --enable-debug
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
sudo make install
sudo ldconfig
build training tools if you like:
make training
sudo make training-install
test Tesseract
$ tesseract imagename outputbase [-l lang] [--psm pagesegmode] [configfiles...]
$ tesseract 1.jpg 1.txt -l chi_sim
Error:
Error opening data file /usr/local/share/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
办法:下载语言包
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
https://github.com/tesseract-ocr/tessdata #These language data files only work with Tesseract 4.0
sudo mv eng.traineddata /usr/share/tesseract-ocr/tessdata
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata
or
sudo mv /usr/local/share/tessdata /usr/local/share/tessdata.bak
sudo ln -s /usr/share/tesseract-ocr/tessdata /usr/local/share/
安装python接口
sudo pip install pytesseract
比如识别中文及数字:
tessdata_dir_config='-psm 7 digits'
ss = pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)
修改配置文件
当使用命令参数 digits来识别数字时,有考虑识别字母和数字,即可在系统tesseract所在位置修改配置文件:usr/share/tesseract-orc/tessdata/configs/digits
tessedit_chaar_whitelist 0123456789-. #default
tessedit_chaar_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-.
PS:图片太大时识别不好,缩放到指甲盖大小反而识别会好些。。。
link:
https://www.youtube.com/watch?v=vOdnt2h1U8U
https://lengerrong.blogspot.com/2017/03/how-to-build-latest-tesseract-leptonica.html
https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation
https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux
https://lucacerone.net/2017/install-tesseract-3-0-5-in-ubuntu-16-04/