我爱编程

爬取煎蛋妹子图-python

2018-01-24  本文已影响826人  努力努力再努力_y
效果图

一、准备工作

二、selenium作用

煎蛋做了反爬虫的机制,图片的URL做了加密处理,F12能看到,但是beautifulsoup解析不出来。 本来是想找解密的方法,无意中搜到selemium这个神器。 selenium 是一个web的自动化测试工具,可以模拟用户操作浏览器。这样就可以直接获取图片URL了

三、chromedriver下载

内网:https://npm.taobao.org/mirrors/chromedriver/

外网:https://sites.google.com/a/chromium.org/chromedriver/downloads

四、源代码

import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Directory = 'ooxx/'
base_url = "http://jandan.net/ooxx/page-"
path = "D:\chrome\chromedriver.exe"
driver = webdriver.Chrome(executable_path=path)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}
img_url = []
urls = ["http://jandan.net/ooxx/page-{}#comments".format(str(i)) for i in range(80, 85)]

def getImg():
    n = 1
    for url in img_url:
        print('第' + str(n) + ' 张', end='')
        with open(Directory + url[-15:], 'wb') as f:
            f.write(requests.get(url).content)
        print('...OK!')
        n = n+1

def getImgUrl(url):
    driver.get(url)
    data = driver.page_source
    soup = BeautifulSoup(data, "html.parser")  # 解析网页
    images = soup.select("a.view_img_link")  # 定位元素
    for i in images:
        z = i.get('href')
        if str('gif') in str(z):
            pass
        else:
            http_url = "http:" + z
            img_url.append(http_url)
            print(http_url)


if __name__ == "__main__":
    for url in urls:
        getImgUrl(url)
    getImg()
    print("")

项目地址:https://github.com/aszt/jiandan-gril

注:源码中存放了最新版,支持Chrome v62-64

PS:爬煎蛋不要太过分,对煎蛋服务器压力很大,练手后去爬其他大站吧。

上一篇下一篇

猜你喜欢

热点阅读