python读取电子发票PDF文本

2022-02-21  本文已影响0人  昨日雨疏风骤

使用pdfminer.six进行电子发票的文本读取。(基于Python 3.7)

首先, 安装 pdfminer.six

pip install pdfminer.six=20201018

安装成功之后,参考以下代码进行文本读取:

from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def readPdf2(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    device = TextConverter(rsrcmgr,retstr,codec='utf-8',laparams=LAParams()) 
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    with open(path,'rb') as fp:
        for page in PDFPage.get_pages(fp,set()):
            interpreter.process_page(page)
        text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

text = readPdf2(r"C:\test.pdf")
print(text)
上一篇 下一篇

猜你喜欢

热点阅读