Python 爬虫 | 爬取动态加载的网站

2019-05-23 本文已影响0人 YocnZhao

上篇说了如何爬取静态网站https://www.jianshu.com/p/bbf4386f7527，我们可能在爬取的过程中发现有的网站并没有把内容放到html里面，而是通过ajax动态加载的方式放进来的。
比如http://tu.duowan.com/gallery/138916.html#p1
我们访问发现很容易找到图片的原图地址，于是我们兴冲冲的用爬虫请求一下发现根本没有地址，根本是个空的，一脸懵逼，可以比较下下面的两幅图。

浏览器的F12

爬虫请求的html
很明显我们请求的并没有地址，而浏览器是有的。
这是因为网站用了AJAX，也就是XMLHttpRequest，那我们怎么找到真正的地址呢？

XHR
我们可以从这里找到XHR请求的地址，也就是http://tu.duowan.com/index.php?r=show/getByGallery/&gid=138916&_=1558600256687，我们请求这个链接发现是个json：

地址真正的所在地
那这就好办了，既然找到了真正的地址，我们就按照我们之前的经验搞一搞。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2, urllib, os, time, json


class Pic:
    def __init__(self, url, desc, path):
        self.url = url
        self.desc = desc
        self.path = path


locol = "/Users/y/PythonWorkSpace/DUOWAN/"


def test():
    # 20000-20700
    start_index = 137882
    end_index = 138930
    for i in range(start_index, end_index):
        download_pic(i)
    return


def download_pic(index):
    curr_time = str(time.time()).replace(".", "0")
    url = "http://tu.duowan.com/index.php?r=show/getByGallery/&gid=%d&_=%s" % (index, curr_time)
    print "开始执行Task %s" % url
    request = urllib2.Request(url)  # Request参数有三个，url,data,headers,如果没有data参数，那就得按我这样的写法
    request.add_header("User-Agent",
                       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")
    request.add_header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7,fr;q=0.6")
    request.add_header("Accept",
                       "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")

    response = urllib2.urlopen(request)
    # print response.code
    if response.code != 200:
        return
    html = response.read()
    if html.strip() == '':
        return
    dict = json.loads(html, encoding="GBK")
    # print raw.keys()
    # print dict[u'picInfo']
    pic_list = []
    pic_info = dict[u'picInfo']
    current_dir = locol + "" + str(index) + "/"
    for info in pic_info:
        source = info[u'source']
        desc = info[u'add_intro']
        suffix = '.gif'
        if source.endswith("gif"):
            suffix = '.gif'
        elif source.endswith("jpg"):
            suffix = '.jpg'
        else:
            return
        path = current_dir + desc + suffix
        pic = Pic(source, desc, path)
        pic_list.append(pic)

    for pic in pic_list:
        if not os.path.exists(current_dir):
            os.mkdir(current_dir)
        print "-------------开始下载---------------", pic.url, pic.path
        urllib.urlretrieve(pic.url, pic.path)

    print '休息一下，休息3s'
    time.sleep(3)
    return

打完收工~~~

Python 爬虫 | 爬取动态加载的网站

猜你喜欢

热点阅读