1.初识scrapy框架

2019-06-10  本文已影响0人  思绪太重_飘不动

scrapy框架的使用

1.创建爬虫项目

1.创建scrapy项目 : 
    scrapy startproject project_name(项目的名称)

2.创建爬虫:
    cd 到project_name (切换到工程目录下,然后再创建爬虫)
    scrapy genspider spiser_name(爬虫的名字) spider.com(要爬取网站的域名)

3.在spider_name中书写爬虫代码

4.启动项目: 
    方式一: cd到爬虫所在的文件夹,执行代码 scrapy runspider sipder_name.py
    方式二: scrapy crawl spider_name
    方式三: 创建一个start.py , 编写如下代码:
    import scrapy.cmdline  

    # 执行scrapy命令
    def main():
        # 启动爬虫显示日志
        # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
        # scrapy.cmdline.execute("scrapy crawl movie".split())
        # 启动爬虫,但不显示日志
        scrapy.cmdline.execute("scrapy crawl movie --nolog".split()) 


    if __name__ == '__main__':
        main()

2.在爬虫文件中如何提取文本内容

print(type(response))   # 显示相应类型
print(response.text)    # 显示字符串内容
print(response.body)    # 显示二进制内容
extract()函数,抽取对象的文本内容
extract_first()函数,抽取对象的第一个文内容

1.在scrapy框架中自带xpath,所以我们一般使用xpath来解析内容
2.尽量使用extract_first()函数来抽取文本,如果文本为空不会报错

3.实例 ,爬取美剧网站的电影

爬取  url= 'https://www.meijutt.com/new100.html' 的最新电影
  

爬取的数据为 :{ 电影的名字 :name , 电影的分类:mjjp, 电影的播放电视台:mjtv, 电影的更新时间:data_time }

4.具体代码

# 1.自己在工程目录下创建的start.py文件
import scrapy.cmdline


# 执行scrapy命令
def main():
    # 启动爬虫显示日志
    # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
    # scrapy.cmdline.execute("scrapy crawl movie".split())
    # 启动爬虫,但不显示日志
    scrapy.cmdline.execute("scrapy crawl movie --nolog".split())
    # 通过命令保存成json文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.json --nolog".split())
    # 通过命令保存成xml文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.xml --nolog".split())
    # 通过命令保存成csv文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.csv --nolog".split())


if __name__ == '__main__':
    main()
# 2.movie.py  (自己创建的爬虫文件)
# -*- coding: utf-8 -*-
import scrapy
from ..items import MeijuItem


# 继承自基类scrapy.Spider
class MovieSpider(scrapy.Spider):
    name = 'movie'  # 项目名称
    allowed_domains = ['www.meijutt.com']   # 允许爬取的url的域名
    start_urls = ['https://www.meijutt.com/new100.html']    # 开始爬取url的列表

    # 定义parse()用来解析数据
    # 参数response: 就是服务端的响应,里面有我们想要的数据
    def parse(self, response):
        movie_list = response.xpath('//ul[@class="top-list  fn-clear"]/li')
        for movie in movie_list:
            name = movie.xpath('./h5/a/text()').extract_first()
            mjjp = movie.xpath('./span[@class="mjjq"]/text()').extract_first()
            mjtv = movie.xpath('./span[@class="mjtv"]/text()').extract_first()
            data_time = movie.xpath('./div[@class="lasted-time new100time fn-right"]/text()').extract_first()
            # print(name, mjjp, mjtv, data_time)

            item = MeijuItem()
            item['name'] = name
            item['mjjp'] = mjjp
            item['mjtv'] = mjtv
            item['data_time'] = data_time

            # yield会将item传入piplines.py文件中
            yield item
# 3.items.py (创建数据存储的模型)
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# 定义模型,相当于Django中的Model
class MeijuItem(scrapy.Item):
    # 定义爬取内容的字段
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()  # 名字
    mjjp = scrapy.Field()  # 分类
    mjtv = scrapy.Field()  # 电视台
    data_time = scrapy.Field()  # 更新时间
# 4.pipelines.py(管道用来处理存储操作)
# -*- coding: utf-8 -*-

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


# pipelines管道,用于存储爬取到的内容的操作
# 使用管道来存储数据的好处是,自动帮我们去重
class MeijuPipeline(object):
    def __init__(self):
        pass
    
    # 开始爬取的函数,这是系统默认的写法,使用要自己添加
    def open_spider(self, spider):
        print('开始爬取......')
        self.fp = open('movie.txt', 'a', encoding='utf-8')

    # 处理传入进来的每个item.会被毒刺调用
    # 参数item : 在爬虫.py中的parse()函数yield返回的每个item
    # 参数spider: 爬虫对象
    def process_item(self, item, spider):
        string = str((item['name'], item['mjjp'], item['mjtv'], item['data_time'])) + '\n'
        self.fp.write(string)
        self.fp.flush()
        return item

    # 结束爬取的函数,这是系统默认的写法,使用要自己添加
    def close_spider(self, spider):
        print('爬取结束......')
        self.fp.close()
# 5.settings.py (文件中大多数配置是默认不适用的,要是用它就要去掉注释)
# -*- coding: utf-8 -*-

# 爬虫的配置文件
# Scrapy settings for meiju project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 工程的名称
BOT_NAME = 'meiju'

# 指定爬虫文件位置
SPIDER_MODULES = ['meiju.spiders']
# 新建的爬虫位置
NEWSPIDER_MODULE = 'meiju.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User_Agent, 默认未使用
#USER_AGENT = 'meiju (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 默认遵守robots协议, 不遵守可改为Flase
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 连接的请求数, 默认是16个
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫的中间键,默认未使用
#SPIDER_MIDDLEWARES = {
#    'meiju.middlewares.MeijuSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载中间键,默认未使用
#DOWNLOADER_MIDDLEWARES = {
#    'meiju.middlewares.MeijuDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 设置pipeline管道,默认未使用,要是用必须放开它
ITEM_PIPELINES = {
   'meiju.pipelines.MeijuPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
上一篇下一篇

猜你喜欢

热点阅读