[spider]网页内容提取之pyquery
2019-01-08 本文已影响5人
Franckisses
准备工作,在开始之前,你先要确认你是否已经安装了pyquery库。如果没有安装的话,按照下面的步骤。
pip install pyquery
创建一个示例文档。
html= """
<div id='container' >
<ul class='list'>
<li class='item-0'>first item</li>
<li class='item-1'><a href='link2.html'>second item</a></li>
<li class='item-0'><a href='link3.html'><span class='blod'>third item</span></a></li>
<li class='item-1'><a href='link4.html'>fourth item</a></li>
<li class=''item-0><a href='link5.html'>fifth item</a></li>
</ul>
</div>
"""
1.初始化。
url初始化
doc1 = pq(url='http://www.cn.bing.com')
print(doc1('title'))
文件初始化
doc2 = pq(filename='index.html')
print(doc2('title'))
字符串的初始化
初始化
doc = pq(html)
2.基本的CSS选择器
(1) 根据id去选择
idElement = doc('#haha')
print(idElement)
结果:
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
(2)根据class属性去选择
classElement = doc(".item-0")
print(classElement)
结果:
<li class="item-0">first item</li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
(3)根据标签的关系去选择
sons = doc("#haha li")
print(sons)
结果:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
3.多个元素的查找
lilist = doc('li').items()
print(type(lilist))
for i in lilist:
print(i)
#结果
<class 'generator'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
4.获取属性
aElement = doc('.item-0.active a')
print(aElement.attr('href'))
print(aElement.attr.href)
#结果
link3.html
link3.html
5.获取文本
text = doc('.item-0.active a span')
print(text.html())
text的方法是将此标签下的所有的文本都能匹配出来
print("="*20)
print(text.text())
third item
============================================
third item
6.根据固定的内容去找标签
#找到含有second的li标签
print(doc("li:contains('second')"))
结果:
<li class="item-1"><a href="link2.html">second item</a></li>