Python爬虫壁纸福利

2019-01-12 本文已影响0人南风拂西洲

1586904-19438837.jpg

1653292-11834822.jpg

2178196-19834251.jpg

一、准备工作

1.环境：python3.x

2.编码工具：pycharm

3.依赖库： requests

pip install requests

4.爬取网站:

https://tuchong.com/tags/%E7%A7%81%E6%88%BF

二、分析网站

主页.png

步骤：

1.访问网站
2.F12打开控制台，然后点击clear图标，清除所有日志，按F5刷新，这样做的目的是为了重新加载所有请求
3.通过查看第一个主页请求的Response,发现里面的html页面不包含图片元素，初步断定是ajax异步加载，选择 XHR 继续查看
4.如下图
数据请求
选择了 XHR 后，就只剩下了三个请求，通过查看Response确定了第二个请求为 数据请求接口

https://tuchong.com/rest/tags/%E7%A7%81%E6%88%BF/posts?page=1&count=20&order=weekly&before_timestamp=

5.分析并测试该接口：

*page*    参数 为 当前页
*count*   参数经测试为固定数字，其他的不能请求成功
*order*   参数 为类型   热门为 **weekly**  最新为 **new**
*before_timestamp* 参数可有可无

必须添加  Header 中的User-Agent(值可选)
header={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'，
'Host':'tuchong.com',
'Referer':'https://tuchong.com/tags/%E7%A7%81%E6%88%BF',
}

6.分析该接口返回的json数据

sdasd.png
AH5)RRL($8O5TL`6)2EF(VU.png

主页共有20个照片集，而接口返回数据 JSON数据里面的postlist列表有20个json对象，所以这个位置的数据应该是图片数据的位置
而我们需要的是下载，肯定需要找到图片相关数据

7.在主页进行审查元素
img.png
找到了图片URL
background-color: rgb(153, 153, 153); background-image: url("//photo.tuchong.com/1433078/l/388106228.webp");
图片URL格式为:photo.tuchong.com/1433078/l/388106228.webp
8.结合Json数据和该URL分析得到如下数据:

参数                   描述                                     对应json数据
1433078                用户ID                                     author_id
I                      图片质量（经测试 f 最佳，可自行测试其他字母）
388106228              照片ID                                     img_id
.webp                  图片格式(.png .jpg 都可以)

自此，分析结束

三、代码编写

1.构造前20页图片URL

if __name__=="__main__":
    while 1:
        try:
            for index in range(1,20):
                url = "https://tuchong.com/rest/tags/%E7%A7%81%E6%88%BF/posts?page="+str(index)+"&count=20&order=weekly"
                RunCrawler(url)
        except Exception as e:
            print(e.args)
            time.sleep(10)

2.请求数据接口，并清洗数据

def RunCrawler(url):
    response = requests.get(url, headers=header)
    data = json.loads(response.text)
    ImageBox=[]
    for UserAlbum in data["postList"]:
        AuthorID=str(UserAlbum["author_id"])
        for s_image in UserAlbum["images"]:
            ImageID=str(s_image["img_id"])
            ImgURL="http://photo.tuchong.com/"+AuthorID+"/f/"+ImageID+".jpg"
            ImageName=AuthorID+"-"+ImageID+".jpg"
            ImageInfo={
                "ImageURL":ImgURL,
                "ImageName":ImageName
            }
            # print(ImgURL)
            ImageBox.append(ImageInfo)

    print("本页共计图片:"+str(len(ImageBox))+"张")

    for img in ImageBox:
        if os.path.isfile("E:/私房/"+img["ImageName"]):
            print(img["ImageName"]+"--- 已存在!")
            continue
        else:
            time.sleep(randint(1,3))
        DownImage(img["ImageURL"],img["ImageName"])

3.保存图片

def DownImage(ImageURL,FileName):
    if not os.path.isdir("E:/私房"):
        os.makedirs("E:/私房")
    response = requests.get(ImageURL, headers=header, stream=True)
    chunk_size = 1024 * 1024
    with open("E:/私房/"+FileName, "wb+") as f:
        for data in response.iter_content(chunk_size=chunk_size):
            f.write(data)
        print(FileName+"--- 保存成功!")

4.源码

import requests
import json
import  os
import time
from random import randint
header={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'，
'Host':'tuchong.com',
'Referer':'https://tuchong.com/tags/%E7%A7%81%E6%88%BF',
}

def DownImage(ImageURL,FileName):
    if not os.path.isdir("E:/私房"):
        os.makedirs("E:/私房")
    response = requests.get(ImageURL, headers=header, stream=True)
    chunk_size = 1024 * 1024
    with open("E:/私房/"+FileName, "wb+") as f:
        for data in response.iter_content(chunk_size=chunk_size):
            f.write(data)
        print(FileName+"--- 保存成功!")

def RunCrawler(url):
    response = requests.get(url, headers=header)
    data = json.loads(response.text)
    ImageBox=[]
    for UserAlbum in data["postList"]:
        AuthorID=str(UserAlbum["author_id"])
        for s_image in UserAlbum["images"]:
            ImageID=str(s_image["img_id"])
            ImgURL="http://photo.tuchong.com/"+AuthorID+"/f/"+ImageID+".jpg"
            ImageName=AuthorID+"-"+ImageID+".jpg"
            ImageInfo={
                "ImageURL":ImgURL,
                "ImageName":ImageName
            }
            # print(ImgURL)
            ImageBox.append(ImageInfo)

    print("本页共计图片:"+str(len(ImageBox))+"张")

    for img in ImageBox:
        if os.path.isfile("E:/私房/"+img["ImageName"]):
            print(img["ImageName"]+"--- 已存在!")
            continue
        else:
            time.sleep(randint(1,3))
        DownImage(img["ImageURL"],img["ImageName"])

if __name__=="__main__":
    while 1:
        try:
            for index in range(1,20):
                url = "https://tuchong.com/rest/tags/%E7%A7%81%E6%88%BF/posts?page="+str(index)+"&count=20&order=weekly"
                RunCrawler(url)
        except Exception as e:
            print(e.args)
            time.sleep(10)

四、总结

请求时加 User-Agent / Host Referce即可

反爬应对措施如下:

访问速度 同一ip 每次下载图片暂停1~2秒 否则会被禁止访问

有稳定的代理IP可考虑多线程，速度更快

图片接口，含6000张图片，每次随机返回一张
http://120.79.205.142/Image/
仅供学习交流，勿用于商业用途