Python-140 Extract txt from Imag

2022-03-23 本文已影响0人 RashidinAbdu

问题：刚好需要整理一些之前参加的会议啥的，但是发现当时只存了图片文件或者海报的截图，那么怎么办呢？其次英文和中文的字符识别还不太一样，然后提取后还需要合并到一个txt文件，以便后续整理。所以具体实现如下：

英文的提取：图片中的英文字，数字都基本可以准确提取，步骤为：

首先得安装：
需要从这个网站的link进行下载和安装：
https://www.simplifiedpython.net/how-to-extract-text-from-image-in-python/
下载点击link即可！Download tesseract from this link.

查找后会发现，安装地址为：
C:\Program Files\Tesseract-OCR\tesseract.exe
然后先安装以下包：

pip install pytesseract

图片为：

image.png

然后按照以下代码运行：

from PIL import Image
import PIL.Image

from pytesseract import image_to_string
# Import modules
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open('C:\\Users\\Administrator\\Desktop\\捕获.PNG')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='eng')

# Print the text 
print(image_to_text)

# write the result in to a txt file 
f = open (r'C:\\Users\\Administrator\\Desktop\\all.txt','w')

print (image_to_text,file = f)

f.close()

运行结果为：

image.png

即：成功获取图片中的文字部分；

1. 中文文字的提取：
  需要下载语言包，就是这个网站里下载中文简体的：下载完，直接放到安装位置的tessdata文件夹里就好了。

Traineddata Files for Version 4.00 + | tessdoc (tesseract-ocr.github.io)

image.png

别的就参考：解决pytesseract.pytesseract.TesseractError: (1, ‘Error opening data file C:\Program Files\Tesseract- - 云+社区 - 腾讯云 (tencent.com)

多个txt的合并：


import os
mergefiledir="D:\\GRAD_COURSES\\Ph.D_Publications\\2021_Publications\\KCTC-Deposition\\1"

filenames=os.listdir(mergefiledir)
file=open('D:\\GRAD_COURSES\\Ph.D_Publications\\2021_Publications\\KCTC-Deposition-Names.txt','w')

for filename in filenames:
     filepath=mergefiledir+'\\'+filename
     for line in open(filepath): file.writelines(line)
     file.write('\n')
file.close()

Python-140 Extract txt from Imag

即：成功获取图片中的文字部分；

猜你喜欢

热点阅读