Scrapy基础

2018-12-01  本文已影响21人  jdzhangxin

安装

Scrapy依赖包


在PyCharm中直接查找安装即可。

测试是否安装成功

scrapy bench

编写代码

import scrapy


class TestScrapy(scrapy.Spider):
    name = 'TestScrapy'
    start_urls = ["http://maoyan.com/board/4?offset=%d" % (offset * 10) for offset in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, headers={"User-Agent": "Chrome"})

    def parse(self, response):
        for dd in response.css("dd"):
            yield {
                'index': dd.css("i.board-index::text").get(),
                'image': dd.css("img.board-img::attr(data-src)").get(),
                'title': dd.css("p.name>a::text").get(),
                'actor': dd.css("p.star::text").get().strip()[3:],
                'time': dd.css("p.releasetime::text").get()[5:],
                'score': dd.css("i.integer::text").get()
            }

命令行执行

scrapy runspider 爬虫文件路径 -o 结果文件

PY直接执行

import scrapy
import scrapy.cmdline


class TestScrapy(scrapy.Spider):
    name = 'TestScrapy'
    start_urls = ["http://maoyan.com/board/4?offset=%d" % (offset * 10) for offset in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, headers={"User-Agent": "Chrome"})

    def parse(self, response):
        for dd in response.css("dd"):
            yield {
                'index': dd.css("i.board-index::text").get(),
                'image': dd.css("img.board-img::attr(data-src)").get(),
                'title': dd.css("p.name>a::text").get(),
                'actor': dd.css("p.star::text").get().strip()[3:],
                'time': dd.css("p.releasetime::text").get()[5:],
                'score': dd.css("i.integer::text").get()
            }

if __name__ == "__main__":
    scrapy.cmdline.execute("scrapy runspider TestScrapy.py".split())

使用框架

框架

主要流程

  1. 创建
  2. 编码
  3. 执行

1. 创建

在Terminal中输入

scrapy startproject 工程名

自动创建工程如下:

.
├── 工程名
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
└── scrapy.cfg

2. 编码

基本步骤

  1. 定义提取的Item
  2. 编写爬取网站的spider并提取Item
  3. 编写Item Pipeline来存储提取到的Item(即数据)

1. 定义提取的Item

import scrapy
class MaoyanItem(scrapy.Item):
    title = scrapy.Field()
    index = scrapy.Field()
    image = scrapy.Field()
    pass

2. 编写爬取网站的spider并提取Item

import scrapy.cmdline

from MyScrapy.items import MaoyanItem


class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    allowed_domains = ['maoyan.com']
    start_urls = ['http://maoyan.com/board/4']

    def parse(self, response):
        for dd in response.css("dd"):
            item = MaoyanItem()
            item["index"] = dd.css("i.board-index::text").extract_first()
            item["title"] = dd.css("p.name>a::text").extract_first()
            item["image"] = dd.css("img.board-img::attr(data-src)").extract_first()
            yield item

if __name__ == "__main__":
    scrapy.cmdline.execute("scrapy runspider maoyan.py -o maoyan.json".split())

3. 编写Item Pipeline来存储提取到的Item(即数据)

这里直接使用Scrapy自带的Pipline保存数据。

3. 执行

scrapy runspider maoyan.py -o maoyan.json

或者

scrapy crawl maoyan -o maoyan.json

4. 说明

基类scrapy.Spider

No. 属性 说明 作用
1 name 爬虫名,在工程中必须唯一 执行爬虫的标识,例如:scrapy crawl 爬虫名
2 allowed_domains 允许的域名 爬虫可爬取的站点
3 start_urls 开始爬取的URL 指定爬取的站点
No. 属性 作用
1 parse(reponse) 解析网页内容
2 start_requests() 生成初始的requests
scrapy.Request(url, callback=self.parse, headers,meta)
yield scrapy.Request(url,callback=other_parse)

5. 总结

三个重要命令

No. 功能 命令 工作目录
1 创建工程 scrapy startproject 工程名 工程所在目录
2 创建爬虫 scrapy genspider 爬虫名字 爬虫域名 工程目录
3 执行爬虫 scrapy crawl 爬虫名字 -o 结果文件 工程目录

基本应对反爬虫处理

  1. 用户代理
def start_requests(self):
        for url in self.start_urls:
            headers = random.choice(headers_pool) # 随机选一个headers
            proxy_addr = random.choice(proxy_pool) # 随机选一个代理
            yield scrapy.Request(url, callback=self.parse, headers=headers})
  1. IP代理
def start_requests(self):
        for url in self.start_urls:
            proxy_addr = random.choice(proxy_pool) # 随机选一个
            yield scrapy.Request(url, callback=self.parse, meta={'proxy': proxy_addr}) # 通过meta参数添加代理

proxy_addr字符串格式是'Scheme://IP:Port'

上一篇下一篇

猜你喜欢

热点阅读