Instagram爬虫记录(一)
2018-08-26 本文已影响13人
10f0a5e89b6d
Hello大家好,本人呢是IU粉,有天刷着IU的ins,然后产生了爬下IU发布的照片和视频的内容的想法,说干就干,走起!
版本是Python3.7,及其Windows 10 环境
ps:本篇最后附上本文源码
打开目标页面进行分析
首先打开IU的主页:https://www.instagram.com/dlwlrma/
我是用火狐浏览器打开的,F12开始调试,通过选择元素可以发现,元素中的所有图片链接均在
class="v1Nh3 kIKUG _bz0w"
的一个div
下面,如图所示:OK,现在我们去调试器查看网页源码,对上述那个
div
就行搜索,发现网页源码中并没有这个div
,由此,可以判断,此div
为动态加载。搜索div.png
那么问题来了,所有的数据都以某种暂时不知道的形式加载到首页HTML文件中。下面,我们就开始去搜索数据到底在哪里,首先我们去复制前面看到的某个
jpg
文件的名字,例如:39099041_724879754520599_610565124800905216_n.jpg
,然后我们用这个名字继续去网页源码中进行搜索。哈哈,nice,找到了,看来这一行就是加载的图片数据了,ok,我们把这一行整体单独复制出来,看看是个什么鬼!
咦~
scrip
包裹的window._sharedData
数据,难得的貌似还是json
数据ok再将
window._sharedData
单独搂出来,并且格式化一下,看看现在的数据:嘿嘿,是不是很漂亮?
下面我们就来写代码吧!第一个小目标就是拿到
window._sharedData
数据,很简单,不是吗?
# -*- coding: utf-8 -*-
from lxml import etree
import requests
headers = { "Origin": "https://www.instagram.com/",
"Referer": "https://www.instagram.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.36",
"Connection": "keep-alive",
"Host": "www.instagram.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "zh-CN,zh;q=0.8",
"X-Instragram-AJAX": "1",
"X-Requested-With": "XMLHttpRequest",
"Upgrade-Insecure-Requests": "1",
}
BASE_URL = 'https://www.instagram.com/dlwlrma/'
proxy = {
'http': 'http://127.0.0.1:1080',
'https': 'http://127.0.0.1:1080'
}
def crawler():
try:
res = requests.get(BASE_URL,headers = headers,proxies = proxy)
html = etree.HTML(res.content.decode())
all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
for js_tag in all_js_tags:
if js_tag.strip().startswith('window._sharedData'):
print(js_tag)
except Exception as e:
print("有异常!!!")
raise e
if __name__ == '__main__':
crawler()
看看输出结果吧,已经成功得到window._sharedData
数据啦!
通过对
window._sharedData
数据进行分析,发现jpg链接
在["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][ "edges"]
下面,ok,再次对数据进行处理,拿到所有的jpg链接
def crawler():
try:
res = requests.get(BASE_URL,headers = headers,proxies = proxy)
html = etree.HTML(res.content.decode())
all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
new_imgs_url = []
for js_tag in all_js_tags:
if js_tag.strip().startswith('window._sharedData'):
data = js_tag[:-1].split('= {')[1]
js_data = json.loads('{' + data, encoding='utf-8')
edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
"edges"]
# print(edges[0])
for edge in edges:
new_imgs_url.append(edge["node"]["display_url"])
for i in new_imgs_url:
print(i)
except Exception as e:
print("有异常!!!")
raise e
得到jpg链接.png
ok,下面开始写一个下载方法对得到的链接进行下载
def download(imgs_urls,save_img_path):
for i in imgs_urls:
print(i)
img_name = name_re.findall(i)[0]
img_mp_path = save_img_path + img_name
print(img_name + "正在下载")
with open(img_mp_path, 'wb+') as f:
f.write(requests.get(i, proxies=proxy).content)
time.sleep(1)
现在,就已经下载完成了呢,去文件夹看看看下载好的图片吧!
本篇全部代码
# -*- coding: utf-8 -*-
from lxml import etree
import re
import json
import requests
import time
headers = { "Origin": "https://www.instagram.com/",
"Referer": "https://www.instagram.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.36",
"Connection": "keep-alive",
"Host": "www.instagram.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "zh-CN,zh;q=0.8",
"X-Instragram-AJAX": "1",
"X-Requested-With": "XMLHttpRequest",
"Upgrade-Insecure-Requests": "1",
}
BASE_URL = 'https://www.instagram.com/dlwlrma/'
proxy = {
'http': 'http://127.0.0.1:1080',
'https': 'http://127.0.0.1:1080'
}
name_re = re.compile(r'[0-9\_]+[a-zA-Z]+[\.jpgmp4]{4}')
save_path = 'E:/IU/'
def crawler():
try:
res = requests.get(BASE_URL,headers = headers,proxies = proxy)
html = etree.HTML(res.content.decode())
all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
new_imgs_url = []
for js_tag in all_js_tags:
if js_tag.strip().startswith('window._sharedData'):
data = js_tag[:-1].split('= {')[1]
js_data = json.loads('{' + data, encoding='utf-8')
edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
"edges"]
# print(edges[0])
for edge in edges:
new_imgs_url.append(edge["node"]["display_url"])
for i in new_imgs_url:
print(i)
download(new_imgs_url,save_path)
except Exception as e:
print("有异常!!!")
raise e
def download(imgs_urls,save_img_path):
for i in imgs_urls:
print(i)
img_name = name_re.findall(i)[0]
img_mp_path = save_img_path + img_name
print(img_name + "正在下载")
with open(img_mp_path, 'wb+') as f:
f.write(requests.get(i, proxies=proxy).content)
time.sleep(1)
if __name__ == '__main__':
crawler()
PS:第一次写简书,如有不足和错误,请各位看官多多指教
当然啦,对于wuli大IU怎么去只爬取12张图片呢,下一篇将会讲获取更多链接的方法,下篇见啦!