爬虫3

2018-03-21  本文已影响0人  冬gua

利用xpath  获取所需要的东西

W3School官方文档:http://www.w3school.com.cn/xpath/index.asp

XPath使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。

             举例说明

import requests

from lxml import etree

import uuid

url_base='http://tieba.baidu.com/'

url1='%sf'%(url_base)

kw = input('输入贴吧:')

begin_page = int(input('起始页:'))

end_page = int(input('结束页:'))

for page in range(begin_page,end_page+1):

    params = {

        'kw':kw,

        'pn':(page-1)*50

    }

    response=requests.get(url=url1,params=params)

    content1=response.content

    # with open('./tieba.html', 'wb') as file:

    #     file.write(content)

    '''数据处理'''

    content1 = content1.decode('utf-8')

    html1 = etree.HTML(content1)

    href_list = html1.xpath(

        '(//div[@class="threadlist_title pull_left j_th_tit "]/a|//div[@class="col2_right j_threadlist_li_right "]/a)/@href')

    for href in href_list:

        url2 = '%s%s' % (url_base, href)

        print(url2)

        response2 = requests.get(url=url2)

        content2 = response2.content

        html2 = etree.HTML(content2)

        src_list = html2.xpath('//div/img[@class="BDE_Image"]/@src')

        for src in src_list:

            file_name = str(uuid.uuid1()) + src[src.rfind('.'):]

            response3 = requests.get(url=src)

            content3 = response3.content

            with open('./images/%s' % file_name, 'wb') as file:

                file.write(content3)

上一篇下一篇

猜你喜欢

热点阅读