爬虫&图片识别

图片文字提取(mac+python3.8+pytesseract

2021-07-02  本文已影响0人  乐观的星辰

一:Mac 安装 tesseract (brew安装)
a. pip3 install tesseract -- 杜绝这种
pip3 install tesseract
Collecting tesseract
Downloading
https://files.pythonhosted.org/packages/8d/b7/c4fae9af5842f69d9c45bf1195a94aec090628535c102894552a7a7dbe6c/tesseract-0.1.3.tar.gz (45.6MB)
坑:该版本不支持python,需要修改tesseract, 注释其中print相关内容;ConfigParser 跑需要重新安装,并修改tesseract 创建ConfigParser class相关内容点;折腾好后,后续的模块依赖也会缺少

正确姿势
brew install tesseract
坑:libtiff 安装失败,资源包地址不可用
rying a mirror...
==> Downloading https://ghcr.io/v2/homebrew/core/bottles/libtiff-4.2.0.big_sur.bottle.tar.gz
==> Downloading from https://github.com/-/v2/packages/container/package/homebrew%2Fcore%2Fbottles%2Flibtiff-4.2.0.big_sur.bottle.tar.gz
Warning: Transient problem: timeout Will retry in 1 seconds. 3 retries left.
Warning: Transient problem: timeout Will retry in 2 seconds. 2 retries left.
Warning: Transient problem: timeout Will retry in 4 seconds. 1 retries left. #
-=O=- # # # #
curl: (22) The requested URL returned error: 404

切换到手工安装:
地址: https://github.com/vadz/libtiff
% ./configure
% make
% su
# make install

然后在 brew install tesseract 很丝滑

二 : python 相关资源包
pip3 install Image
pip3 install pytesseract

三:下载文字匹配语言包
地址:(https://github.com/tesseract-ocr/tessdata)
下载:chi_sim.traineddata
保存:/usr/local/Cellar/tesseract/4.0.0(version)/share/tessdata

四:测试脚本

-- coding: utf-8 --

"""
@Project :xxxxx
@Time : 2021/7/1 下午4:09
@Auth : 肖彬
@File :Image_test_data
@IDE :PyCharm

"""
from PIL import Image
import pytesseract

def image_to_str(image_path):
image = Image.open(image_path)
words = pytesseract.image_to_string(image, lang='chi_sim')
aa = pytesseract.image_to_data(image)
print(words, aa)

if name == 'main':
image_path_001 = '/Users/xiaobin/Downloads/image_test/a.png'
image_path_002 = '/Users/xiaobin/Downloads/image_test/b.jpeg'
image_to_str(image_path_002)

上一篇下一篇

猜你喜欢

热点阅读