scrapy(一) 入门

2021-06-06  本文已影响0人  万事万物

什么是scrapy

scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能实现快速的抓取。
Scrapy使用了Twisted['twistid']异步网络框架,可以加快我们的下载速度。

官网
中文网

安装指南

支持的python版本

Scrapy 需要 Python 3.6+,CPython 实现(默认)或 PyPy 7.2.0+ 实现(请参阅替代实现)。
Scrapy 是用纯 Python 编写的,并且依赖于一些关键的 Python 包(以及其他):

安装 Scrapy

pip install Scrapy

创建项目

scrapy startproject <项目名称>

$ scrapy startproject tutorial

输出内容

# scrapy 项目存放位置   D:\project\python\tutorial
New Scrapy project 'tutorial', using template directory 'd:\tool\templates\project', created in:
    D:\project\python\tutorial
# 提示你可以使用    scrapy genspider example example.com 创建一个爬虫
You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

目录结构

tutorial/
    scrapy.cfg            # 项目配置文件

    tutorial/             # 项目的 Python 模块,你将从这里导入你的代码
        __init__.py

        items.py          # 项目项定义文件

        middlewares.py    # 定义的一些爬虫中间件,甚至包括自定义中间件

        pipelines.py      # 项目管道文件

        settings.py       # 项目设置文件

        spiders/          # 创建好的爬虫都会存放到该项目中。
            __init__.py

scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

#指定项目配置的地方( settings.py )
[settings]
default = tutorial.settings

# 发布,scrapy提供了这样的功能,这样的功能能帮助我们发布到服务器上或本机上。
[deploy]
#url = http://localhost:6800/
project = tutorial

items.py

生成一个爬虫

scrapy genspider <爬虫名字> <爬虫范围>

进入到爬虫项目中

$ cd tutorial/

生成一个爬虫

$ scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  tutorial.spiders.quotes

此时会在spiders/项目中创建一个quotes.py的python文件,内容如下

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes' #爬虫名字
    allowed_domains = ['quotes.toscrape.com'] #运行爬虫的范围
    start_urls = ['http://quotes.toscrape.com/'] # 最开始请求的url地址,告诉爬虫从哪个地址开始爬。是scrapy 默认生成的,通常情况下,需要更改。

    def parse(self, response):
        pass

爬取当页数据

需要掌握 xpath 语法

元素定位

获取每个内容div的位置。可以通过chrome中的xpath工具进行定位。

元素定位 最后确定内容都存放到 class='quote' 的div中。
xpath节点信息://div[@class='col-md-8']/div[@class='quote']

最终代码如下:并没有相关参数的讲解(不在本章分享内容内)

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
        for div in div_list:
            item={}
            # 获取 text 内容
            item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
            # 获取 by 后的内容
            item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
            # 获取 by 后a标签中href的值
            item['by_href']=div.xpath("./span/a/@href").extract_first()
            # 获取所有的标签
            tags_list=div.xpath("./div[@class='tags']/a")
            tags_item_list=[]
            for tags in tags_list:
                tags_item={} 
                tags_item["href"]=tags.xpath('./@href').extract_first()
                tags_item["text"]=tags.xpath('./text()').extract_first()
                tags_item_list.append(tags_item)
            #将标签信息添加到item中
            item["tags"]=tags_item_list
            print(item)
            #为了展示好看,最后按照 - 进行分隔
            print('-'*20)

执行爬虫

$ scrapy crawl quotes

爬去内容如下:

$ scrapy crawl quotes 
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/change/page/1/', 'text': 'change'}, {'href': '/tag/deep-thoughts/page/1/', 'text': 'deep-thoughts'}, {'href': '/tag/thinking/page/1/', 'text': 'thinking'}, {'href': '/tag/world/page/1/', 'text': 'world'}]}
--------------------
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'by_text': 'J.K. Rowling', 'by_href': '/author/J-K-Rowling', 'tags': [{'href': '/tag/abilities/page/1/', 'text': 'abilities'}, 
{'href': '/tag/choices/page/1/', 'text': 'choices'}]}
--------------------
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/live/page/1/', 'text': 'live'}, {'href': '/tag/miracle/page/1/', 'text': 'miracle'}, {'href': '/tag/miracles/page/1/', 'text': 'miracles'}]}
--------------------
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'by_text': 'Jane Austen', 'by_href': '/author/Jane-Austen', 'tags': [{'href': '/tag/aliteracy/page/1/', 'text': 'aliteracy'}, {'href': '/tag/books/page/1/', 'text': 'books'}, {'href': '/tag/classic/page/1/', 'text': 'classic'}, {'href': '/tag/humor/page/1/', 'text': 'humor'}]}
--------------------
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'by_text': 'Marilyn Monroe', 'by_href': '/author/Marilyn-Monroe', 'tags': [{'href': '/tag/be-yourself/page/1/', 'text': 'be-yourself'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}]}
--------------------
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/adulthood/page/1/', 'text': 'adulthood'}, {'href': '/tag/success/page/1/', 'text': 'success'}, {'href': '/tag/value/page/1/', 'text': 'value'}]}
--------------------
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'by_text': 'André Gide', 'by_href': '/author/Andre-Gide', 'tags': [{'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/love/page/1/', 'text': 'love'}]}
--------------------
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'by_text': 'Thomas A. Edison', 'by_href': '/author/Thomas-A-Edison', 'tags': [{'href': '/tag/edison/page/1/', 'text': 'edison'}, {'href': '/tag/failure/page/1/', 'text': 'failure'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/paraphrased/page/1/', 'text': 'paraphrased'}]}
--------------------
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'by_text': 'Eleanor Roosevelt', 'by_href': '/author/Eleanor-Roosevelt', 'tags': [{'href': '/tag/misattributed-eleanor-roosevelt/page/1/', 'text': 'misattributed-eleanor-roosevelt'}]}
--------------------
{'text': '“A day without sunshine is like, you know, night.”', 'by_text': 'Steve Martin', 'by_href': '/author/Steve-Martin', 'tags': [{'href': '/tag/humor/page/1/', 'text': 'humor'}, {'href': '/tag/obvious/page/1/', 'text': 'obvious'}, {'href': '/tag/simile/page/1/', 'text': 'simile'}]}
--------------------

结束

以上案例参考官网给的网站进行爬取,本章内容只是我使用爬虫这么久的一次入门总结,只供参考,若是小白的童鞋,建议上B站系统的学习一下,然后自己整理一份爬虫。后续将陆续整理有关scrapy的其他内容。

上一篇下一篇

猜你喜欢

热点阅读