Python实战计划学习笔记:week1_4 爬取霉霉照片

2016-06-29  本文已影响96人  luckywoo

学习爬虫第3天,爬取霉霉照片。
代码如下:

#!/usr/bin/env python
# coding:utf-8
__author__ = 'lucky'
from bs4 import BeautifulSoup
import requests,urllib.request
import time
urls = ['http://weheartit.com/inspirations/taylorswift?scrolling=true&page={}'.format(number) for number in range(1,21)]

header = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',   
           'Cookie':'locale=zh-cn; __whiAnonymousID=cedf556d59434a78a518b36279b59bd4; auth=no; _session=06742c2ee80e676adfa76366d2b522ed; _ga=GA1.2.1879005139.1467165244; _weheartit_anonymous_session=%7B%22page_views%22%3A1%2C%22search_count%22%3A0%2C%22last_searches%22%3A%5B%5D%2C%22last_page_view_at%22%3A1467202282156%7D'}

img_links = []
def get_links(url,data=None):    
    wb_data = requests.get(url,headers=header)    
    Soup = BeautifulSoup(wb_data.text,'lxml')    
    imgs = Soup.select('body > div > div > div > a > img')    
    if data == None:        
        for img in imgs:            
        img_links.append(img.get('src'))

for url in urls:    
    get_links(url)  #获取图片的link    
    print('OK')

i = 0  #图片名称: i.jpg
folder_path ='/Users/lucky/Life/pic/'

for img in img_links:               
    urllib.request.urlretrieve(img,folder_path+str(i)+'.jpg')    
    i+=1    
    print('Done')

获取图片如下:

pictures.png

单独照片:

2.jpg

总结:

1.更加熟练的调用函数来
2.添加header,伪装成浏览器,添加cookies来爬取相关网页。
3.Download图片用到了urllib.request,及urllib.request.urlretrieve()这个函数,此函数调用了open('filename','wb')这样的函数。
4.更好的理解了CSS和HTML网页元素的位置抓取和chrome浏览器的使用。

上一篇下一篇

猜你喜欢

热点阅读