Instagram爬虫记录（一）

2018-08-26 本文已影响13人 10f0a5e89b6d

Hello大家好，本人呢是IU粉，有天刷着IU的ins，然后产生了爬下IU发布的照片和视频的内容的想法，说干就干，走起！

版本是Python3.7，及其Windows 10 环境
ps：本篇最后附上本文源码

打开目标页面进行分析

首先打开IU的主页：https://www.instagram.com/dlwlrma/

IU首页
我是用火狐浏览器打开的，F12开始调试，通过选择元素可以发现，元素中的所有图片链接均在class="v1Nh3 kIKUG _bz0w"的一个div下面，如图所示：

OK，现在我们去调试器查看网页源码，对上述那个div就行搜索，发现网页源码中并没有这个div，由此，可以判断，此div为动态加载。

搜索div.png
那么问题来了，所有的数据都以某种暂时不知道的形式加载到首页HTML文件中。下面，我们就开始去搜索数据到底在哪里，首先我们去复制前面看到的某个jpg文件的名字，例如：39099041_724879754520599_610565124800905216_n.jpg，然后我们用这个名字继续去网页源码中进行搜索。

哈哈，nice，找到了，看来这一行就是加载的图片数据了，ok，我们把这一行整体单独复制出来，看看是个什么鬼！
咦~scrip包裹的window._sharedData数据，难得的貌似还是json数据

ok再将window._sharedData单独搂出来，并且格式化一下，看看现在的数据：

嘿嘿，是不是很漂亮？
下面我们就来写代码吧！第一个小目标就是拿到window._sharedData数据，很简单，不是吗？

# -*- coding: utf-8 -*- 
from lxml import etree
import requests

headers = { "Origin": "https://www.instagram.com/",
            "Referer": "https://www.instagram.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/58.0.3029.110 Safari/537.36",
            "Connection": "keep-alive",
            "Host": "www.instagram.com",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "accept-encoding": "gzip, deflate, sdch, br",
            "accept-language": "zh-CN,zh;q=0.8",
            "X-Instragram-AJAX": "1",
            "X-Requested-With": "XMLHttpRequest",
            "Upgrade-Insecure-Requests": "1",
            }

BASE_URL = 'https://www.instagram.com/dlwlrma/'

proxy = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080'
}


def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                print(js_tag)
    except Exception as e:
        print("有异常！！！")
        raise e


if __name__ == '__main__':
    crawler()

看看输出结果吧，已经成功得到window._sharedData数据啦！

通过对window._sharedData数据进行分析，发现jpg链接在["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][ "edges"]下面，ok，再次对数据进行处理，拿到所有的jpg链接

def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        new_imgs_url = []
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                data = js_tag[:-1].split('= {')[1]
                js_data = json.loads('{' + data, encoding='utf-8')
                edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                    "edges"]
                # print(edges[0])
                for edge in edges:
                    new_imgs_url.append(edge["node"]["display_url"])
        for i in new_imgs_url:
            print(i)

    except Exception as e:
        print("有异常！！！")
        raise e

得到jpg链接.png

ok，下面开始写一个下载方法对得到的链接进行下载

def download(imgs_urls,save_img_path):
    for i in imgs_urls:
        print(i)
        img_name = name_re.findall(i)[0]
        img_mp_path = save_img_path + img_name
        print(img_name + "正在下载")
        with open(img_mp_path, 'wb+') as f:
            f.write(requests.get(i, proxies=proxy).content)
            time.sleep(1)

现在，就已经下载完成了呢，去文件夹看看看下载好的图片吧！

本篇全部代码

# -*- coding: utf-8 -*-
from lxml import etree
import re
import json
import requests
import time

headers = { "Origin": "https://www.instagram.com/",
            "Referer": "https://www.instagram.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/58.0.3029.110 Safari/537.36",
            "Connection": "keep-alive",
            "Host": "www.instagram.com",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "accept-encoding": "gzip, deflate, sdch, br",
            "accept-language": "zh-CN,zh;q=0.8",
            "X-Instragram-AJAX": "1",
            "X-Requested-With": "XMLHttpRequest",
            "Upgrade-Insecure-Requests": "1",
            }

BASE_URL = 'https://www.instagram.com/dlwlrma/'

proxy = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080'
}

name_re = re.compile(r'[0-9\_]+[a-zA-Z]+[\.jpgmp4]{4}')
save_path = 'E:/IU/'

def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        new_imgs_url = []
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                data = js_tag[:-1].split('= {')[1]
                js_data = json.loads('{' + data, encoding='utf-8')
                edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                    "edges"]
                # print(edges[0])
                for edge in edges:
                    new_imgs_url.append(edge["node"]["display_url"])
        for i in new_imgs_url:
            print(i)
        download(new_imgs_url,save_path)

    except Exception as e:
        print("有异常！！！")
        raise e


def download(imgs_urls,save_img_path):
    for i in imgs_urls:
        print(i)
        img_name = name_re.findall(i)[0]
        img_mp_path = save_img_path + img_name
        print(img_name + "正在下载")
        with open(img_mp_path, 'wb+') as f:
            f.write(requests.get(i, proxies=proxy).content)
            time.sleep(1)


if __name__ == '__main__':
    crawler()

PS：第一次写简书，如有不足和错误，请各位看官多多指教
当然啦，对于wuli大IU怎么去只爬取12张图片呢，下一篇将会讲获取更多链接的方法，下篇见啦！

Instagram爬虫记录（一）

打开目标页面进行分析

本篇全部代码

猜你喜欢

热点阅读