python HTML解析之 - lxml
2019-02-19 本文已影响0人
tafanfly
lxml
lxml是处理XML和HTML的python语言,解析的时候,自动处理各种编码问题。而且它天生支持 XPath 1.0、XSLT 1.0、定制元素类。
安装:
pip install lxml
lxml用法
HTML 实例
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Study/title>
</head>
<body>
<h1>webpage</h1>
<p>source link</p>
<a href="http://www.runoob.com/html/html-tutorial.html" target="_blank">HTML</a>
<a href="http://www.runoob.com/python/python-tutorial.html" target="_blank">Python</a>
<a href="http://www.runoob.com/cplusplus/cpp-tutorial.html" target="_blank">C++</a>
<a href="http://www.runoob.com/java/java-tutorial.html" target="_blank">Java</a>
</body>
</html>
(1)HTML读取
test, test.html指上述实例
- 直接读取内容
from lxml import etree
html = etree.HTML(test)
- 直接读取文件
from lxml import etree
html = etree.parse(test.html)
(2)获取标签
获取a
的所有标签, 这种html内容有多种写法,可以 直接得到了4个元素。
-
//a
:获取html下的所有 a 标签 -
/html/body/a
:沿着节点顺序找 a 标签 -
/descendant::a
:当前节点后代里面找 a 标签
a_tags = html.xpath('//a')
In [12]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_2 = html.xpath('/html/body/a')
In [14]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_3 = html.xpath('/descendant::a')
In [16]: print a_tags_3
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
(3)获取标签属性, 文本
按照(2)中的方法,再加上/@href
,可以直接得到属性值。
a_attribute_2 = html.xpath('/html/body/a/@href')
In [21]: print a_attribute_2
['http://www.runoob.com/html/html-tutorial.html', 'http://www.runoob.com/python/python-tutorial.html', 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'http://www.runoob.com/java/java-tutorial.html']
a_text_2 = html.xpath('/html/body/a/text()')
In [31]: print a_text_2
['HTML', 'Python', 'C++', 'Java']
或者得到(2)中的结果,一一轮询。
for tag in a_tags_2:
print tag.attrib, tag.text
{'href': 'http://www.runoob.com/html/html-tutorial.html', 'target': '_blank'} HTML
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'} Python
{'href': 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'target': '_blank'} C++
{'href': 'http://www.runoob.com/java/java-tutorial.html', 'target': '_blank'} Java
(4)筛选标签
- 按照属性
python_tag = html.xpath('/html/body/a[@href="http://www.runoob.com/python/python-tutorial.html"]')
In [42]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [43]: print python_tag[0].text
Python
- 按照文本
python_tag = html.xpath('/html/body/a[text()="Python"]')
In [47]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [48]: print python_tag[0].text
Python
- 按照位置
python_tag = html.xpath('/html/body/a[position()=2]')
# python_tag = html.xpath('/html/body/a[2]')
In [52]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [53]: print python_tag[0].text
Python
更多表达式见 python xpath的学习
参考: https://www.jianshu.com/p/2ae6d51522c3