3、how to extract text from PDFs
2019-01-11 本文已影响0人
BigBigGuy
Using wand, pillow and tesseract
注意:pdf必须是白色底,否则识别不出来。
其实就是根据pdf转为jpg再解析,真的是,就是从前面两篇提取结合,easy job!
import io #多用了io库
from PIL import Image
import pytesseract
from wand.image import Image as wi
pdf = wi(filename='jun.pdf',resolution=300)
pdfImg = pdf.convert('jpeg')
imgBlobs = []
for img in pdfImg.sequence:
page = wi(image=img)
imgBlobs.append(page.make_blob('jpeg'))
extracted_text = []
for imgBlobs in imgBlobs:
im = Image.open(io.BytesIO(imgBlobs))
text = pytesseract.image_to_string(im,lang='chi_sim')
extracted_text.append(text)
print(extracted_text[0])
data:image/s3,"s3://crabby-images/f76ec/f76eca07135ebb2a04bf151b81da5a7c83e0a03c" alt=""