python 爬虫百度图片之列表图

2019-03-19 本文已影响0人 leoryzhu

一、爬虫准备

语言：python
浏览器：google chrome
工具：request模块

首先我们在百度图片搜索页面输入需要搜索的关键词（比如：明星）页面结果如下

image.png
按F12进入开者，随便检查列表一张图片可以看到找到图片的地址，copy src里面的图片地址
https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=371978350,138525231&fm=26&gp=0.jpg，待用

image.png

选择Network All ，刷新一下页面，看到和浏览器一样的请求，类型为document

image.png

这个就是浏览器返回的页面，点击这个请求，并cont+f查找刚才复制的图片地址，可以发现能在js代码中找到该图片地址，也就是这个页面的图片地址不是静态页面生成的，是js动态生成的。这样不能通过request.get(url)审查元素获取图片地址，不过也可以通过正则表达式来获取js代码里面的图片地址，这样方式我不推荐

image.png

那么，图片地址是在那里获取的呢，切换Netword下的All标签到XHR，我们可以看到这个请求，打开一看https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E6%98%8E%E6%98%9F&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word=%E6%98%8E%E6%98%9F&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1552975216767=

image.png

这个好像就是我们想要的列表数据，打开一看，确定是，获取

image.png

，这样我们可以通过请求上面的地址，获取我们想要的数据，仔细看看地址的参数queryWord、word是我们编码的搜索关键字，不用编码也没问题，rn是每一页的图片数，默认30，pn是第几个图片,通常rn的倍数，其他的参数都是固定了，只需要改变这三个参数来获取列表图片了。

注意事项

有些图片地址直接复制到浏览器上是请求不到图片的，也就是requsts.get(image_url)是获取不到图片的，后来查找到浏览器正常的操作是有带有头部Referer，指向搜索地址

image.png

python 代码实现如下

import requests
import re
import time
import os
import urllib.parse
import json

page_num=30
photo_dir="D:\\data\\pic\\face\\photo"

def getThumbImage(word):
    num=0
    url = "http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word={0}&pn={1}"
    while num<50:

        page_url=url.format(urllib.parse.quote(word),num*page_num)
        print(page_url)
        response=requests.get(page_url)
        pic_urls=re.findall('"thumbURL":"(.*?)",',response.text,re.S)
        
        if pic_urls:
        
            for pic_url in pic_urls:
                name=pic_url.split('/')[-1]
                print(pic_url)
                headers={
                    "Referer":page_url,
                }
                html=requests.get(pic_url,headers=headers)
                with open(os.path.join(word_dir,name),'wb')as f:
                    f.write(html.content)
        num=num+1

def getThumb2Image(word):
    num=0
    url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={0}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word={0}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={1}&rn="+str(page_num)+"&gsm=1e&1552975216767="
    while num<50:

        page_url=url.format(urllib.parse.quote(word),num*page_num)
        print(page_url)
        response=requests.get(page_url)
        pic_urls=re.findall('"thumbURL":"(.*?)",',response.text,re.S)
        for pic_url in pic_urls:
            name=pic_url.split('/')[-1]
            print(pic_url)
            headers={
                "Referer":page_url,
            }
            html=requests.get(pic_url,headers=headers)
            with open(os.path.join(word_dir,name),'wb')as f:
                f.write(html.content)
        num=num+1
        

if __name__ == "__main__":
    word = input("请输入搜索关键词(可以是人名，地名等): ")
    word_dir=os.path.join(photo_dir,word)
    if not os.path.exists(word_dir):
        os.mkdir(word_dir)
    getThumb2Image(word)

python 爬虫百度图片之列表图

一、爬虫准备

注意事项

python 代码实现如下

猜你喜欢

热点阅读