1-4homework-1

2016-08-13 本文已影响35人 OldSix1987

结果

霉霉.png

我的代码

__author__ = 'CP6'
# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import requests
import time

# proxies = {"http": "207.62.234.53:8118"}
urls = ['http://weheartit.com/inspirations/taylorswift?scrolling=true&page={}'.format(str(i)) for i in range(1, 5)]
proxies = {'HTTP': '123.56.28.196:8888'}
headers = {
    'UserAgent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.21 Safari/537.36'
}


def download(url):
    r = requests.get(url, headers=headers, proxies=proxies)
    if r.status_code != 200:
        return
    # http://data.whicdn.com/images/254100583/superthumb.jpg
    filename = url.split('/')[4]
    target = './{}.jpg'.format(filename)
    # print(target)
    with open(target, "wb") as fs:
        fs.write(r.content)

    print("%s => %s" % (url, target))

def main():
    for single_url in urls:
        wb_data = requests.get(single_url, headers=headers, proxies=proxies)
        if wb_data.status_code != 200:
            continue
        soup = BeautifulSoup(wb_data.text, 'lxml')
        time.sleep(3)
        imgs = soup.select('a.js-entry-detail-link > img')
        for img in imgs:
            src = img.get('src')
            download(src)

if __name__ == '__main__':
    main()

总结

难点1：代理

使用浏览器的UserAgent防止反爬取，使用代理来访问一些墙外的网站。

proxies = {'HTTP': '123.56.28.196:8888'}
headers = {
    'UserAgent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.21 Safari/537.36'
}
r = requests.get(url, headers=headers, proxies=proxies)

代理的链接：proxy_pool

难点2 :获取img的Selector

Paste_Image.png

imgs = soup.select('a.js-entry-detail-link > img')

难点3: 获取img文件名

str.split('/') str被分割之后，会生成一个list，这里方便起见，简单的用了list[4]获取了文件名

// http://data.whicdn.com/images/254100583/superthumb.jpg

filename = url.split('/')[4]

['http:', '', 'data.whicdn.com', 'images', '247108528', 'superthumb.jpg']

难点4: 写文件-保存图片到本地

其实就是把url解析后的二进制写入target文件中（文件类型已写为.jpg），也就是把图片保存到了本地， r.content获取响应后的二进制内容

r = requests.get(url, headers=headers, proxies=proxies)
target = './{}.jpg'.format(filename)
with open(target, "wb") as fs:
        fs.write(r.content)

py语法

strip()函数，random.choice()函数

strip(rm) 删除s字符串中开头、结尾处，位于 rm（默认为空格）删除序列的字符

strip.png

** random.choice(seq) 从序列中获取一个随机元素**

import random
random.choice(range(10)) #输出0到10内随机整数
random.choice(range(10,100,2)) #输出随机值[10,12,14,16...]
random.choice("I love python") #输出随机字符I,o,v,p,y...
random.choice(("I love python")) #同上
random.choice(["I love python"]) #输出“I love python”
random.choice("I","love","python") #Error
random.choice(("I","love","python")) #输出随机字符串“I”，“love”，“python”
random.choice(["I","love","python"]) #输出随机字符串“I”，“love”，“python”

引用文章地址：
strip函数
 Python random模块

request.get(url, headers=headers, proxies=proxies, timeout=timeout)

payload = {'key1': 'value1', 'key2': 'value2'}
 r = requests.get("http://httpbin.org/get", params=payload)

requests用法

1-4homework-1

结果

我的代码

总结

难点1：代理

难点2 :获取img的Selector

难点3: 获取img文件名

难点4: 写文件-保存图片到本地

py语法

strip()函数，random.choice()函数

request.get(url, headers=headers, proxies=proxies, timeout=timeout)

猜你喜欢

热点阅读