scrapy(一) 入门

2021-06-06 本文已影响0人万事万物

什么是scrapy

scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，我们只需要实现少量的代码，就能实现快速的抓取。
Scrapy使用了Twisted['twistid']异步网络框架，可以加快我们的下载速度。

官网
 中文网

安装指南

支持的python版本

Scrapy 需要 Python 3.6+，CPython 实现（默认）或 PyPy 7.2.0+ 实现（请参阅替代实现）。
Scrapy 是用纯 Python 编写的，并且依赖于一些关键的 Python 包（以及其他）：

lxml，一个高效的 XML 和 HTML 解析器
parsel，一个写在 lxml 之上的 HTML/XML 数据提取库，
w3lib，用于处理 URL 和网页编码的多用途助手
twisted,一个异步网络框架
cryptography and pyOpenSSL,
处理各种网络级安全需求

安装 Scrapy

pip install Scrapy

创建项目

scrapy startproject <项目名称>

$ scrapy startproject tutorial

输出内容

# scrapy 项目存放位置   D:\project\python\tutorial
New Scrapy project 'tutorial', using template directory 'd:\tool\templates\project', created in:
    D:\project\python\tutorial
# 提示你可以使用    scrapy genspider example example.com 创建一个爬虫
You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

目录结构

tutorial/
    scrapy.cfg            # 项目配置文件

    tutorial/             # 项目的 Python 模块，你将从这里导入你的代码
        __init__.py

        items.py          # 项目项定义文件

        middlewares.py    # 定义的一些爬虫中间件，甚至包括自定义中间件

        pipelines.py      # 项目管道文件

        settings.py       # 项目设置文件

        spiders/          # 创建好的爬虫都会存放到该项目中。
            __init__.py

scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

#指定项目配置的地方（ settings.py ）
[settings]
default = tutorial.settings

# 发布，scrapy提供了这样的功能，这样的功能能帮助我们发布到服务器上或本机上。
[deploy]
#url = http://localhost:6800/
project = tutorial

items.py

生成一个爬虫

scrapy genspider <爬虫名字> <爬虫范围>

爬虫名称，通常按照爬取的网站来命名，如 jindong、taobao、dangdang 等
爬虫范围，防止爬虫爬取范围太大，爬取到其他网站上了，所以需要指定爬虫范围。通常指定域名如：taobao.com

进入到爬虫项目中

$ cd tutorial/

生成一个爬虫

$ scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  tutorial.spiders.quotes

此时会在spiders/项目中创建一个quotes.py的python文件，内容如下

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes' #爬虫名字
    allowed_domains = ['quotes.toscrape.com'] #运行爬虫的范围
    start_urls = ['http://quotes.toscrape.com/'] # 最开始请求的url地址，告诉爬虫从哪个地址开始爬。是scrapy 默认生成的，通常情况下，需要更改。

    def parse(self, response):
        pass

爬取当页数据

需要掌握 xpath 语法

元素定位

获取每个内容div的位置。可以通过chrome中的xpath工具进行定位。
元素定位最后确定内容都存放到 class='quote' 的div中。
xpath节点信息：//div[@class='col-md-8']/div[@class='quote']

最终代码如下：并没有相关参数的讲解（不在本章分享内容内）

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
        for div in div_list:
            item={}
            # 获取 text 内容
            item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
            # 获取 by 后的内容
            item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
            # 获取 by 后a标签中href的值
            item['by_href']=div.xpath("./span/a/@href").extract_first()
            # 获取所有的标签
            tags_list=div.xpath("./div[@class='tags']/a")
            tags_item_list=[]
            for tags in tags_list:
                tags_item={} 
                tags_item["href"]=tags.xpath('./@href').extract_first()
                tags_item["text"]=tags.xpath('./text()').extract_first()
                tags_item_list.append(tags_item)
            #将标签信息添加到item中
            item["tags"]=tags_item_list
            print(item)
            #为了展示好看，最后按照 - 进行分隔
            print('-'*20)

执行爬虫

$ scrapy crawl quotes

爬去内容如下：

$ scrapy crawl quotes 
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/change/page/1/', 'text': 'change'}, {'href': '/tag/deep-thoughts/page/1/', 'text': 'deep-thoughts'}, {'href': '/tag/thinking/page/1/', 'text': 'thinking'}, {'href': '/tag/world/page/1/', 'text': 'world'}]}
--------------------
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'by_text': 'J.K. Rowling', 'by_href': '/author/J-K-Rowling', 'tags': [{'href': '/tag/abilities/page/1/', 'text': 'abilities'}, 
{'href': '/tag/choices/page/1/', 'text': 'choices'}]}
--------------------
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/live/page/1/', 'text': 'live'}, {'href': '/tag/miracle/page/1/', 'text': 'miracle'}, {'href': '/tag/miracles/page/1/', 'text': 'miracles'}]}
--------------------
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'by_text': 'Jane Austen', 'by_href': '/author/Jane-Austen', 'tags': [{'href': '/tag/aliteracy/page/1/', 'text': 'aliteracy'}, {'href': '/tag/books/page/1/', 'text': 'books'}, {'href': '/tag/classic/page/1/', 'text': 'classic'}, {'href': '/tag/humor/page/1/', 'text': 'humor'}]}
--------------------
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'by_text': 'Marilyn Monroe', 'by_href': '/author/Marilyn-Monroe', 'tags': [{'href': '/tag/be-yourself/page/1/', 'text': 'be-yourself'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}]}
--------------------
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/adulthood/page/1/', 'text': 'adulthood'}, {'href': '/tag/success/page/1/', 'text': 'success'}, {'href': '/tag/value/page/1/', 'text': 'value'}]}
--------------------
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'by_text': 'André Gide', 'by_href': '/author/Andre-Gide', 'tags': [{'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/love/page/1/', 'text': 'love'}]}
--------------------
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'by_text': 'Thomas A. Edison', 'by_href': '/author/Thomas-A-Edison', 'tags': [{'href': '/tag/edison/page/1/', 'text': 'edison'}, {'href': '/tag/failure/page/1/', 'text': 'failure'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/paraphrased/page/1/', 'text': 'paraphrased'}]}
--------------------
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'by_text': 'Eleanor Roosevelt', 'by_href': '/author/Eleanor-Roosevelt', 'tags': [{'href': '/tag/misattributed-eleanor-roosevelt/page/1/', 'text': 'misattributed-eleanor-roosevelt'}]}
--------------------
{'text': '“A day without sunshine is like, you know, night.”', 'by_text': 'Steve Martin', 'by_href': '/author/Steve-Martin', 'tags': [{'href': '/tag/humor/page/1/', 'text': 'humor'}, {'href': '/tag/obvious/page/1/', 'text': 'obvious'}, {'href': '/tag/simile/page/1/', 'text': 'simile'}]}
--------------------

结束

以上案例参考官网给的网站进行爬取，本章内容只是我使用爬虫这么久的一次入门总结，只供参考，若是小白的童鞋，建议上B站系统的学习一下，然后自己整理一份爬虫。后续将陆续整理有关scrapy的其他内容。

scrapy(一) 入门

什么是scrapy

安装指南

支持的python版本

安装 Scrapy

创建项目

目录结构

scrapy.cfg

items.py

生成一个爬虫

爬取当页数据

执行爬虫

结束

猜你喜欢

热点阅读