开源OCR引擎tesseract的简单使用
简介
OCR
光学字符识别(OCR,Optical Character Recognition)是指对文本资料进行扫描,然后对图像文件进行分析处理,获取文字及版面信息的过程。OCR技术非常专业,一般多是印刷、打印行业的从业人员使用。而在人工智能快速发展阶段,该技术也被大量运用在一些常见的业务场景来提高业务流程效率,比如像一些文件扫描,身份证识别,图片识别等相关业务场景。
tesseract
Tesseract的OCR引擎最先由HP实验室于1985年开始研发,至1995年时已经成为OCR业内最准确的三款识别引擎之一。然而,HP不久便决定放弃OCR业务,Tesseract也从此尘封。
数年以后,HP意识到,与其将Tesseract束之高阁,不如贡献给开源软件业,让其重焕新生--2005年,Tesseract由美国内华达州信息技术研究所获得,并求诸于Google对Tesseract进行改进、消除Bug、优化工作。
需要注意的是,tesseract3.0以上才支持中文,而且从官方文档上看4.0版本(2017年1月左右发布)显著的提高了识别率,同时也加大了性能的消耗。
场景
一个亲身经历的场景就是以前去开户可能需要带上身份证资料各种去打印复印件进行物理备份,而使用OCR等相关人工智能技术后就可以通过手机摄像头快速识别身份证相关信息来存储个人资料,整个开户体验相当高效简单。
另外一个常见的场景可能就是我们手写的文章需要扫描成电子版,在以前我们可能需要专业的打印机设备才能够进行电子扫描,而当OCR相关技术普及到普通业务场景后,我们就可以使用手持设备进行纸质版的文件进行电子扫描。[个人之前尝试使用有道云笔记中的电子扫描功能还是相当不错的]
部署安装
tesseract需要Leptonica的支持,leptonica是一个开源的面向教学的软件,通常被用来作为图像处理和图像分析的一个底层库支持。
1.使用yum安装
Centos7中的epel源中包含了tesseract的3.05版本的包,可以直接安装使用。而在tesseract的github官方项目中最新版本也只到3.05.01
$ sudo yum install epel-release
-y
# 查看epel源
$ cat /etc/yum.repos.d/epel.repo
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch/debug
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/7/SRPMS
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-source-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
# 安装tesseract 以及相关的库依赖
$ sudo yum install tesseract tesseract-devel -y
# 安装中文库支持
$ sudo yum install -y tesseract-langpack-chi_sim.noarch tesseract-langpack-chi_tra.noarch
# 查看默认的配置以及语言库
$ ls /usr/share/tesseract/tessdata
chi_sim.traineddata chi_tra.traineddata
# 默认只支持eng英语一种语言,安装中文包之后查看语言支持
$ tesseract --list-langs
List of available languages (3):
eng
chi_sim
chi_tra
2.源码编译最新版本(4.00.00alpha版本)
从官网上看,4.0版本在识别率以及性能等各方面上要比3.0版本高好多,但是官方又没有提供4.0的release版本,因此接下来使用github上的源码来手动编译tesseract4.00.00alpha版本。同时由于对底层Leptonica的依赖,需要优先编译该库的依赖。
注意:宿主操作系统仍然是一个纯净的Centos7.3.1611的OS
# 安装基础依赖
# sudo yum install gcc git gcc-c++ make automake libtool libpng-devel libjpeg-devel libtiff-devel zlib-devel -y
# 安装leptonica-1.74.4
# wget http://www.leptonica.org/source/leptonica-1.74.4.tar.gz && tar -zxf leptonica-1.74.4.tar.gz && cd leptonica ;./configure && make && make install
........
/usr/bin/mkdir -p '/usr/local/lib'
/bin/sh ../libtool --mode=install /usr/bin/install -c liblept.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/liblept.so.5.0.1 /usr/local/lib/liblept.so.5.0.1
libtool: install: (cd /usr/local/lib && { ln -s -f liblept.so.5.0.1 liblept.so.5 || { rm -f liblept.so.5 && ln -s liblept.so.5.0.1 liblept.so.5; }; })
libtool: install: (cd /usr/local/lib && { ln -s -f liblept.so.5.0.1 liblept.so || { rm -f liblept.so && ln -s liblept.so.5.0.1 liblept.so; }; })
libtool: install: /usr/bin/install -c .libs/liblept.lai /usr/local/lib/liblept.la
libtool: install: /usr/bin/install -c .libs/liblept.a /usr/local/lib/liblept.a
libtool: install: chmod 644 /usr/local/lib/liblept.a
libtool: install: ranlib /usr/local/lib/liblept.a
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
/usr/bin/mkdir -p '/usr/local/include/leptonica'
/usr/bin/mkdir -p '/usr/local/bin'
/bin/sh ../libtool --mode=install /usr/bin/install -c convertfilestopdf convertfilestops convertformat convertsegfilestopdf convertsegfilestops converttopdf converttops fileinfo printimage printsplitimage printtiff splitimage2pdf xtractprotos '/usr/local/bin'
......
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory `/export/servers/leptonica-1.74.4/prog'
make[1]: Leaving directory `/export/servers/leptonica-1.74.4/prog'
make[1]: Entering directory `/export/servers/leptonica-1.74.4'
make[2]: Entering directory `/export/servers/leptonica-1.74.4'
make[2]: Nothing to be done for `install-exec-am'.
/usr/bin/mkdir -p '/usr/local/lib/pkgconfig'
/usr/bin/install -c -m 644 lept.pc '/usr/local/lib/pkgconfig'
# 以上输出显示leptonica已经成功编译,并给出相关提示知名动态链接库的地址以及头文件等相关地址,需要注意的在使用之前一定要加载动态链接库/usr/local/lib
# ldconfig -n /usr/local/lib
# 安装tesseract,由于作者在github上设置了tag来区分各个版本,因此我们需要切换到源码的指定分支进行源码编译
# git clone https://github.com/tesseract-ocr/tesseract.git;
cd tesseract && git checkout -b biaoge 4.00.00alpha && ./configure && make && make install
/usr/bin/mkdir -p '/usr/local/include/tesseract
/usr/bin/mkdir -p '/usr/local/lib'
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
/usr/bin/mkdir -p '/usr/local/bin'
/bin/sh ../libtool --mode=install /usr/bin/install -c tesseract '/usr/local/bin'
libtool: install: /usr/bin/install -c .libs/tesseract /usr/local/bin/tesseract
/usr/bin/mkdir -p '/usr/local/include/tesseract'
/usr/bin/install -c -m 644 apitypes.h baseapi.h capi.h renderer.h '/usr/local/include/tesseract'
/usr/bin/mkdir -p '/usr/local/lib/pkgconfig'
/usr/bin/install -c -m 644 tesseract.pc '/usr/local/lib/pkgconfig'
/usr/bin/mkdir -p '/usr/local/share/tessdata/configs'
/usr/bin/install -c -m 644 inter makebox box.train unlv ambigs.train api_config kannada box.train.stderr quiet logfile digits hocr tsv linebox pdf rebox strokewidth bigram txt '/usr/local/share/tessdata/configs'
/usr/bin/mkdir -p '/usr/local/share/tessdata/tessconfigs'
/usr/bin/install -c -m 644 batch batch.nochop nobatch matdemo segdemo msdemo '/usr/local/share/tessdata/tessconfigs'
/usr/bin/mkdir -p '/usr/local/share/tessdata'
/usr/bin/install -c -m 644 pdf.ttf '/usr/local/share/tessdata'
/usr/bin/mkdir -p '/usr/local/share/man/man1'
/usr/bin/install -c -m 644 cntraining.1 combine_tessdata.1 mftraining.1 tesseract.1 unicharset_extractor.1 wordlist2dawg.1 ambiguous_words.1 shapeclustering.1 dawg2wordlist.1 '/usr/local/share/man/man1'
/usr/bin/mkdir -p '/usr/local/share/man/man5'
/usr/bin/install -c -m 644 unicharambigs.5 unicharset.5 '/usr/local/share/man/man5'
# 以上输出也标明tesseract已经成功安装在了/usr/local/lib目录,并给出了一些tesseract相关的配置路径以及注意事项。特别需要注意的是,也需要加载动态链接库,否则程序可能无法识别相关的库文件。
# ldconfig -n /usr/local/lib
# 安装成功,即可使用tesseract命令(默认在/usr/local/bin/)
# tesseract --version
tesseract 4.00.00alpha
leptonica-1.74.4
# 查看当前的语言库支持
# tesseract --list-langs
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
# 提示已经很明显,需要配置TESSDATA_PREFIX环境变量
# 配置环境变量和相关的语言库
# export TESSDATA_PREFIX=/usr/local/share/
# cd $TESSDATA_PREFIX/tessdata
# for font in chi_sim chi_tra eng ;do wget https://github.com/tesseract-ocr/tessdata/raw/master/$font.traineddata;done
# 再次查看语言支持
# tesseract --list-langs
List of available languages (3):
chi_sim
eng
chi_tra
至此,环境中已经能够正常使用tesseract
命令了,基本上可以认为tesseract环境已经编译完成,接下来就是具体看看如何去使用tesseract了
简单使用
TB1L8SKhwnH8KJjSspcSuv3QFXa.jpg$ tesseract TB1L8SKhwnH8KJjSspcSuv3QFXa.jpg test -l chi_sim
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
$ cat test.txt
走心特惠
不玩套路
可以看出来对比较规范的字体还是可以正常识别的。
可能遇到的问题
# 测试使用
# tesseract 5a0178d2N01754832.jpg test -l chi_sim
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
pixReadStreamJpeg: function not present
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
# 以上提示是因为Leptonica不支持某些图片格式
# yum install ImageMagick -y
# 重新编译Leptonica(增加--with-libpng参数)
# cd /export/servers/leptonica-1.74 ;./configure --with-libpng && make && make install
# 再次测试
$ tesseract TB1L8SKhwnH8KJjSspcSuv3QFXa.jpg test -l chi_sim
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
$ cat test.txt
走心特惠
不玩套路
福利
由于目前tesseract官方并未直接提供4.0的软件包,并且tesseract环境的构建也是稍微比较复杂,为了秉承造福群众的理念,我将以上环境封装成Docker image供各位网友使用。具体使用方式如下:
前提:需要有一个可用的Docker环境,任何版本都可以
# 下载镜像(稍微有点大1.6G)
# docker pull xxbandy123/tesseract-ocr4.00.00alpha:17-12-05
# 运行tesseract环境
# docker run -itd xxbandy123/tesseract-ocr4.00.00alpha:17-12-05
2b4df1a1e9d20426aefa32ba79066b561fe66c623986b76634012ac9cae40e64
# 进入环境测试运行
# docker exec -it $(docker ps -a -q -l) bash
[root@2b4df1a1e9d2 /]# tesseract
Usage:
tesseract --help | --help-psm | --version
tesseract --list-langs [--tessdata-dir PATH]
tesseract --print-parameters [options...] [configfile...]
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
# 测试环境运行
# wget https://img.alicdn.com/tfs/TB16FK4SXXXXXXUXXXXXXXXXXXX-790-180.jpg
[root@2b4df1a1e9d2 /]# tesseract TB16FK4SXXXXXXUXXXXXXXXXXXX-790-180.jpg test -l chi_sim
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
[root@2b4df1a1e9d2 /]# cat test.txt
一天猫电器全新服务保障
胁 售后无忧
由此可见,开源的tesseract目前并不能完全识别所有的图片文字,如果需要借助于开源去做业务场景,可能还需要更多的二次改造才能够有所应用。