1.初识scrapy框架
2019-06-10 本文已影响0人
思绪太重_飘不动
scrapy框架的使用
1.创建爬虫项目
1.创建scrapy项目 :
scrapy startproject project_name(项目的名称)
2.创建爬虫:
cd 到project_name (切换到工程目录下,然后再创建爬虫)
scrapy genspider spiser_name(爬虫的名字) spider.com(要爬取网站的域名)
3.在spider_name中书写爬虫代码
4.启动项目:
方式一: cd到爬虫所在的文件夹,执行代码 scrapy runspider sipder_name.py
方式二: scrapy crawl spider_name
方式三: 创建一个start.py , 编写如下代码:
import scrapy.cmdline
# 执行scrapy命令
def main():
# 启动爬虫显示日志
# scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
# scrapy.cmdline.execute("scrapy crawl movie".split())
# 启动爬虫,但不显示日志
scrapy.cmdline.execute("scrapy crawl movie --nolog".split())
if __name__ == '__main__':
main()
2.在爬虫文件中如何提取文本内容
print(type(response)) # 显示相应类型
print(response.text) # 显示字符串内容
print(response.body) # 显示二进制内容
extract()函数,抽取对象的文本内容
extract_first()函数,抽取对象的第一个文内容
1.在scrapy框架中自带xpath,所以我们一般使用xpath来解析内容
2.尽量使用extract_first()函数来抽取文本,如果文本为空不会报错
3.实例 ,爬取美剧网站的电影
爬取 url= 'https://www.meijutt.com/new100.html' 的最新电影
爬取的数据为 :{ 电影的名字 :name , 电影的分类:mjjp, 电影的播放电视台:mjtv, 电影的更新时间:data_time }
4.具体代码
# 1.自己在工程目录下创建的start.py文件
import scrapy.cmdline
# 执行scrapy命令
def main():
# 启动爬虫显示日志
# scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
# scrapy.cmdline.execute("scrapy crawl movie".split())
# 启动爬虫,但不显示日志
scrapy.cmdline.execute("scrapy crawl movie --nolog".split())
# 通过命令保存成json文件
# scrapy.cmdline.execute("scrapy crawl movie -o movie.json --nolog".split())
# 通过命令保存成xml文件
# scrapy.cmdline.execute("scrapy crawl movie -o movie.xml --nolog".split())
# 通过命令保存成csv文件
# scrapy.cmdline.execute("scrapy crawl movie -o movie.csv --nolog".split())
if __name__ == '__main__':
main()
# 2.movie.py (自己创建的爬虫文件)
# -*- coding: utf-8 -*-
import scrapy
from ..items import MeijuItem
# 继承自基类scrapy.Spider
class MovieSpider(scrapy.Spider):
name = 'movie' # 项目名称
allowed_domains = ['www.meijutt.com'] # 允许爬取的url的域名
start_urls = ['https://www.meijutt.com/new100.html'] # 开始爬取url的列表
# 定义parse()用来解析数据
# 参数response: 就是服务端的响应,里面有我们想要的数据
def parse(self, response):
movie_list = response.xpath('//ul[@class="top-list fn-clear"]/li')
for movie in movie_list:
name = movie.xpath('./h5/a/text()').extract_first()
mjjp = movie.xpath('./span[@class="mjjq"]/text()').extract_first()
mjtv = movie.xpath('./span[@class="mjtv"]/text()').extract_first()
data_time = movie.xpath('./div[@class="lasted-time new100time fn-right"]/text()').extract_first()
# print(name, mjjp, mjtv, data_time)
item = MeijuItem()
item['name'] = name
item['mjjp'] = mjjp
item['mjtv'] = mjtv
item['data_time'] = data_time
# yield会将item传入piplines.py文件中
yield item
# 3.items.py (创建数据存储的模型)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
# 定义模型,相当于Django中的Model
class MeijuItem(scrapy.Item):
# 定义爬取内容的字段
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field() # 名字
mjjp = scrapy.Field() # 分类
mjtv = scrapy.Field() # 电视台
data_time = scrapy.Field() # 更新时间
# 4.pipelines.py(管道用来处理存储操作)
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# pipelines管道,用于存储爬取到的内容的操作
# 使用管道来存储数据的好处是,自动帮我们去重
class MeijuPipeline(object):
def __init__(self):
pass
# 开始爬取的函数,这是系统默认的写法,使用要自己添加
def open_spider(self, spider):
print('开始爬取......')
self.fp = open('movie.txt', 'a', encoding='utf-8')
# 处理传入进来的每个item.会被毒刺调用
# 参数item : 在爬虫.py中的parse()函数yield返回的每个item
# 参数spider: 爬虫对象
def process_item(self, item, spider):
string = str((item['name'], item['mjjp'], item['mjtv'], item['data_time'])) + '\n'
self.fp.write(string)
self.fp.flush()
return item
# 结束爬取的函数,这是系统默认的写法,使用要自己添加
def close_spider(self, spider):
print('爬取结束......')
self.fp.close()
# 5.settings.py (文件中大多数配置是默认不适用的,要是用它就要去掉注释)
# -*- coding: utf-8 -*-
# 爬虫的配置文件
# Scrapy settings for meiju project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# 工程的名称
BOT_NAME = 'meiju'
# 指定爬虫文件位置
SPIDER_MODULES = ['meiju.spiders']
# 新建的爬虫位置
NEWSPIDER_MODULE = 'meiju.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User_Agent, 默认未使用
#USER_AGENT = 'meiju (+http://www.yourdomain.com)'
# Obey robots.txt rules
# 默认遵守robots协议, 不遵守可改为Flase
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 连接的请求数, 默认是16个
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫的中间键,默认未使用
#SPIDER_MIDDLEWARES = {
# 'meiju.middlewares.MeijuSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载中间键,默认未使用
#DOWNLOADER_MIDDLEWARES = {
# 'meiju.middlewares.MeijuDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 设置pipeline管道,默认未使用,要是用必须放开它
ITEM_PIPELINES = {
'meiju.pipelines.MeijuPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'