爬虫,数据分析那些事

Scrapy 抓取图片

2018-03-13  本文已影响89人  whong736

目标:抓取图片网站 http://hunter-its.com上的图片

1.建立项目 beauty

scrapy startproject beauty

2.cd到目录,并新建爬虫,使用基础模板

cd beauty

scrapy genspider hunter hunter-its.com

image.png

3.pycharm打开项目,先编写item

打开item.py文件,定义名字和地址

import scrapy

class BeautyItem(scrapy.Item):

    name = scrapy.Field()
    address = scrapy.Field()

image.png

4.编写spider,爬虫文件

导入之前定义的BeautyItem模块,和Request模块

from beauty.items import BeautyItem
from scrapy.http import Request

使用xpath获取全部的图片节点
pics = response.xpath('//div[@class="pic"]/ul/li')
循环获取li节点中的所有图片和地址

        for pic in pics:
            item = BeautyItem()
            name = pic.xpath('./a/img/@alt').extract()[0]
            address = pic.xpath('./a/img/@src').extract()[0]

            item['name'] = name
            item['address'] = address

            yield item

递归调用函数,爬取多页数据

            for i in range(2, 8):
                url = 'http://hunter-its.com/m/'+str(i)+'.html'
                print(url)
                yield Request(url, callback=self.parse)

完整代码

# -*- coding: utf-8 -*-
import scrapy
from beauty.items import BeautyItem
from scrapy.http import Request


class HunterSpider(scrapy.Spider):
    name = 'hunter'
    allowed_domains = ['hunter-its.com']
    start_urls = ['http://hunter-its.com/m/1.html']

    def parse(self, response):
        #获取全部的图片节点
        pics = response.xpath('//div[@class="pic"]/ul/li')

        for pic in pics:
            item = BeautyItem()
            name = pic.xpath('./a/img/@alt').extract()[0]
            address = pic.xpath('./a/img/@src').extract()[0]

            item['name'] = name
            item['address'] = address

            yield item

            for i in range(2, 8):
                url = 'http://hunter-its.com/m/'+str(i)+'.html'
                print(url)
                yield Request(url, callback=self.parse)

image.png

5.编写数据处理脚本pipelines.py,导入requests模块

import requests

class BeautyPipeline(object):
    def process_item(self, item, spider):

        #模拟浏览器
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
        #使用request模块,发送get请求
        r = requests.get(url=item['address'], headers=headers, timeout=4)

        print(item['address'])
        #下载图片,存储在本地文件目录下
        with open(r'/Users/vincentwen/Downloads/hunter/'+ item['name'] + '.jpg', 'wb') as f:
            f.write(r.content)

image.png

6.修改setting ITEM_PIPELINES

ITEM_PIPELINES = {
   'beauty.pipelines.BeautyPipeline': 100,
}
image.png

7.运行爬虫

scrapy crawl hunter 
image.png image.png

觉得文章有用,请用支付宝扫描,领取一下红包!打赏一下

支付宝红包码
上一篇下一篇

猜你喜欢

热点阅读