去除html中的标签

2021-05-12  本文已影响0人  隐墨留白

去除html中的标签

方法一

# 去除链接、图片、表格的标签
import html2text
item['html'] = content
html_txt = html2text.HTML2Text()
html_txt.ignore_links = True
html_txt.ignore_images = True
html_txt.ignore_tables = True
item['content'] = html_txt.handle(content)

方法二

# 去除js代码段 然后替换所有标签为空
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
content = cleaner.clean_html(bytes.decode(etree.tostring(content, encoding="utf-8")))
item['ontent'] = re.sub('<.*?>', '', content)
上一篇下一篇

猜你喜欢

热点阅读