python HTML解析之 - lxml

2019-02-19  本文已影响0人  tafanfly

lxml

lxml是处理XML和HTML的python语言,解析的时候,自动处理各种编码问题。而且它天生支持 XPath 1.0、XSLT 1.0、定制元素类。
安装:

pip install lxml

lxml用法

HTML 实例

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Study/title>
</head>
<body>

<h1>webpage</h1>
<p>source link</p>
<a href="http://www.runoob.com/html/html-tutorial.html" target="_blank">HTML</a> 
<a href="http://www.runoob.com/python/python-tutorial.html" target="_blank">Python</a>
<a href="http://www.runoob.com/cplusplus/cpp-tutorial.html" target="_blank">C++</a> 
<a href="http://www.runoob.com/java/java-tutorial.html" target="_blank">Java</a>
</body>
</html>
(1)HTML读取

test, test.html指上述实例

from lxml import etree
html = etree.HTML(test)
from lxml import etree
html = etree.parse(test.html)
(2)获取标签

获取a的所有标签, 这种html内容有多种写法,可以 直接得到了4个元素。

a_tags = html.xpath('//a')
In [12]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_2 = html.xpath('/html/body/a')
In [14]: print a_tags
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
a_tags_3 = html.xpath('/descendant::a')
In [16]: print a_tags_3
[<Element a at 0x7fdd7aea6c20>, <Element a at 0x7fdd7aea6e60>, <Element a at 0x7fdd7aea6a70>, <Eaea6c68>]
(3)获取标签属性, 文本

按照(2)中的方法,再加上/@href,可以直接得到属性值。

a_attribute_2 = html.xpath('/html/body/a/@href')

In [21]: print a_attribute_2
['http://www.runoob.com/html/html-tutorial.html', 'http://www.runoob.com/python/python-tutorial.html', 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'http://www.runoob.com/java/java-tutorial.html']

a_text_2 = html.xpath('/html/body/a/text()')

In [31]: print a_text_2
['HTML', 'Python', 'C++', 'Java']

或者得到(2)中的结果,一一轮询。

for tag in a_tags_2:
    print tag.attrib, tag.text

{'href': 'http://www.runoob.com/html/html-tutorial.html', 'target': '_blank'} HTML
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'} Python
{'href': 'http://www.runoob.com/cplusplus/cpp-tutorial.html', 'target': '_blank'} C++
{'href': 'http://www.runoob.com/java/java-tutorial.html', 'target': '_blank'} Java
(4)筛选标签
python_tag = html.xpath('/html/body/a[@href="http://www.runoob.com/python/python-tutorial.html"]')

In [42]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [43]: print python_tag[0].text
Python
python_tag = html.xpath('/html/body/a[text()="Python"]')

In [47]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [48]: print python_tag[0].text
Python
python_tag = html.xpath('/html/body/a[position()=2]')
# python_tag = html.xpath('/html/body/a[2]')

In [52]: print python_tag[0].attrib
{'href': 'http://www.runoob.com/python/python-tutorial.html', 'target': '_blank'}
In [53]: print python_tag[0].text
Python

更多表达式见 python xpath的学习
参考: https://www.jianshu.com/p/2ae6d51522c3

上一篇下一篇

猜你喜欢

热点阅读