一大波妹子图向你来袭,用Python爬取妹子图全站图片

2018-09-19  本文已影响1488人  DannyWu
爬取的图片

大家好,我是DannyWu,刚学习python爬虫不久,最近在网上查找有趣的爬虫来练手,其中就看到了有爬取妹子图的。他们写的程序很好,因此我想写一个自己的,正好也学习一下。下面给大家分享一下,如有更好的实现方式欢迎在评论区讨论。
我的博客:DannyWu博客
公众号:DannyWu博客
我的Github:DannyWu

1.所需库安装

'''
author:DannyWu
site:www.idannywu.com
'''
pip install requests
pip install bs4
pip install os
pip install pathlib
pip install multiprocessing

2.网站分析

首先打开妹子图的官网(mzitu.com),点击菜单(最新),经过观察(最新)其实是按时间来排序的,也就是网站全部的组图按发布时间来排序的,页面链接为mzitu.com/page/1, mzitu.com/page/2最后面的数字递增,所以将(最新)的图片全部爬取就大功告成!

3.构造请求头

在我踩过坑之后,发现在请求头中要有referer才能获取图片,下面为请求头的构造。

def get_header(referer):
    header ={
        'cookie':'Hm_lvt_dbc355aef238b6c32b43eacbbf161c3c=1536981553; Hm_lpvt_dbc355aef238b6c32b43eacbbf161c3c=1536986863',
        'referer': referer,
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
        }
    return header

4.下载图片

def download_pics(pic_page_url):

    header = get_header(pic_page_url)
    try:
        page_data = requests.get(pic_page_url,headers=header)
        soup_data = BeautifulSoup(page_data.text,'lxml')
        img_link = soup_data.select('.main-image p a img')[0].get('src')
        img_alt = soup_data.select('.main-image p a img')[0].get('alt')
        print("img_link : ", img_link)
        pic_name = img_link.split('/')[-1]
        pic_save_path = "mzitu\\"+img_alt+"\\"
        path = Path(pic_save_path)
        if path.exists():
            pass
        else:
            path.mkdir()
    except:
        pass
    try:
        pic_data = requests.get(img_link,headers=header)
    except:
        pass
    try:
        if os.path.isfile(str(pic_save_path)+pic_name):
            print("########此图已经下载########")
        else:
            with open( str(pic_save_path)+pic_name ,'wb') as f:
                f.write(pic_data.content)
    except:
        pass

5.获取一组组图里面的所有图片页面链接

def get_pic_page_for_one_group(url):
   header = get_header(url)
   current_folder_path = os.getcwd()
   folder_path = str(current_folder_path) + '\\mzitu'
   path_is_exist = Path(folder_path)
   pages_link = []
   if path_is_exist.exists():
       pass
   else:
       path_is_exist.mkdir()

   try:
       web_data = requests.get(url,headers=header)

       soup = BeautifulSoup(web_data.text,'lxml')
       title = soup.select('.main-title')[0].text
       save_path = str(folder_path) + '\\' + title
       print("美图保存于:",save_path)
       pages_total = int(soup.select('.pagenavi a span')[-2].text)
       print("此美女图片总数:",pages_total)
       for i in range(pages_total):
           pages_link.append(url + str(i+1))
   except:
       pass
   return pages_link

6.使用多进程下载整页的所有图片

def download_pics_for_one_page(url,header,pool_num):
    try:
        web_data = requests.get(url,headers=header).text
        soup = BeautifulSoup(web_data,'lxml')
        pages_url = soup.select('#pins li span a')
        for page_url in pages_url:
            print('===============开始下载:',page_url.text+"==============")
            print("此美女美图链接",page_url.get('href'))
            url_list = get_pics_for_one(page_url.get('href')+"/")
            pool = Pool(pool_num)
            pool.map(download_pics,url_list)
            pool.close()
            pool.join()
            print("======================下载完成======================")
            print("")
    except:
        pass

7.下载全站所有图片

if __name__ == '__main__':
    hello = "                     |--------------------------------- |\n                     | 欢迎使用无界面多进程美图下载器! |\n                     |       目标站点:mzitu.com        |\n                     | 作者:DannyWu(mydannywu@gmail.com)|\n                     | 博客站点:www.idannywu.com       |\n                     | 此项目只供个人学习使用,请勿用于 |\n                     |       其他商业用途,谢谢!       |\n                     |       如若侵权,联系立删!       |\n                     |----------------------------------|"
    print(hello)
    page_num = int(input('请输入下载的页数:'))
    pool_num = int(input('请输入启动进程数: '))
    start_tip = "                                美图下载器开始运行...           "
    print(start_tip)
    header = get_header("referer")
    try:

        base_url = 'http://www.mzitu.com/page/{}/'
        start = "################第 {} 页开始################"
        end = "################第 {} 页结束################"
        for i in range(page_num):    
            print(start.format(i+1))
            doc.add_paragraph(start.format(i+1))
            url = base_url.format(i+1)
            get_pics_for_one_pages(url,header,pool_num)
            print(end.format(i+1))

    except:
        pass
    print("")
    print("##################全部下载完成!##################")

2018-09-19_212105.png

到此就全部完成了,全部源码在我的Github:DannyWu
声明:此项目仅是自己学习python时的练手小项目,请勿拿去当商业用途,一切责任与我无关。如有侵权,联系速删!
若转载此文章,请注明转载链接,否则视为侵权处理!

上一篇下一篇

猜你喜欢

热点阅读