爬虫3
利用xpath 获取所需要的东西
W3School官方文档:http://www.w3school.com.cn/xpath/index.asp
XPath使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。
举例说明
import requests
from lxml import etree
import uuid
url_base='http://tieba.baidu.com/'
url1='%sf'%(url_base)
kw = input('输入贴吧:')
begin_page = int(input('起始页:'))
end_page = int(input('结束页:'))
for page in range(begin_page,end_page+1):
params = {
'kw':kw,
'pn':(page-1)*50
}
response=requests.get(url=url1,params=params)
content1=response.content
# with open('./tieba.html', 'wb') as file:
# file.write(content)
'''数据处理'''
content1 = content1.decode('utf-8')
html1 = etree.HTML(content1)
href_list = html1.xpath(
'(//div[@class="threadlist_title pull_left j_th_tit "]/a|//div[@class="col2_right j_threadlist_li_right "]/a)/@href')
for href in href_list:
url2 = '%s%s' % (url_base, href)
print(url2)
response2 = requests.get(url=url2)
content2 = response2.content
html2 = etree.HTML(content2)
src_list = html2.xpath('//div/img[@class="BDE_Image"]/@src')
for src in src_list:
file_name = str(uuid.uuid1()) + src[src.rfind('.'):]
response3 = requests.get(url=src)
content3 = response3.content
with open('./images/%s' % file_name, 'wb') as file:
file.write(content3)