python3的爬虫笔记13——Scrapy初窥

2019-04-07  本文已影响0人  X_xxieRiemann

1、Scrapy安装

在windows平台anaconda环境下,在命令窗口输入conda install scrapy,输入确认的y后,静静等待安装完成即可。安装完成后,在窗口输入scrapy version,能显示版本号说明能够正常使用。


2、Scrapy指令

输入scrapy -h可以看到指令,关于命令行,后面会再总结。

Scrapy 1.3.3 - project: quotetutorial

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

3、新建项目

爬取的为用于测试scrapy的网站:http://quotes.toscrape.com/
爬取目标:获取名言---作者---标签。

网站样式

1、命令窗口下,用cd指令移动到想用来存放项目的文件夹
2、命令窗口下,scrapy startproject 你的文件夹名,这里命名为scrapy startproject quotetutorial
可以看到两个提示, cd quotetutorial ,scrapy genspider example example.com,(即cd 你的工作文件夹 ,scrapy genspider 你的爬虫名 爬取的目标地址),根据提示继续操作。

C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\quotetutorial

You can start your first spider with:
    cd quotetutorial
    scrapy genspider example example.com

3、cd quotetutorial移动到创建好的文件夹中

C:\Users\m1812>cd quotetutorial

4、scrapy genspider quotes quotes.toscrape.com,生成一个名为quotes.py的文件,地址为quotes quotes.toscrape.com

C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  quotetutorial.spiders.quotes
用pycharm打开工程,框架如图

4、Scrapy初窥

1、修改quotes.py中的parse函数,让其打印出网页的html代码,这个网页直接输出print(response.text)会有编码报错。parse函数会在爬虫运行的最后开始执行,这里的response就是网页请求返回的结果。


2、在命令窗口中使用scrapy crawl quotes运行爬虫,看到scrapy除了打印出网页html代码外,还有很多信息输出。
C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bo
otstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-
box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n
                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/l
ogin">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\
n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0The world as we have
 created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1</span>\n        <span>by <small class="author"
itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n
      Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n
<a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n
           \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/"
>world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span cl
ass="text" itemprop="text">\xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1</span>\n        <span>
by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div cla
ss="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n
<a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n
     \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop
="text">\xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\
n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,l
ive,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n
  <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n
     <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n
        \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemp
rop="text">\xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1</span>\n        <sp
an>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div c
lass="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n
\n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a
>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1
/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span
class="text" itemprop="text">\xa1\xb0Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring
.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</
a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspi
rational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag"
 href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://s
chema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n
      </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value
" /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/
success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\
n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0It is better to be
 hated for what you are than to be loved for what you are not.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Andr\xa8\xa6
Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cla
ss="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n
  \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemty
pe="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0I have not failed. I&#39;ve just found 10,000 ways that won&
#39;t work.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edis
on">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edis
on,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n
    <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational<
/a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div c
lass="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A woman is like a tea bag; you
never know how strong it is until it&#39;s in hot water.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</
small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cl
ass="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattribut
ed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemt
ype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1</s
pan>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </sp
an>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n
          \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvi
ous</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n
   <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&r
arr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <
h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a
>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirationa
l/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" hre
f="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" h
ref="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22p
x" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size:
 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="fo
nt-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="
tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a
class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n
    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goo
dreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'sh-red\'>\x817\xc5
8</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)

5、开始爬取

首先浏览下目标信息的html结构:


1、修改items.py中的内容,将欲提取的3个信息按照指定的格式填入:


2、修改quotes.py中的内容,添加爬取的规则,并且和步骤一中items.py的配置相映射。
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

这边用到了自带的css选择器。
在命令窗口中,利用shell指令可以进行交互性测试,scrapy shell "quotes.toscrape.com"注意这里的双引号),从这里我们可以理解上面的css选出了什么,extract_first()extract()有什么区别。

C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"

2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000002B3C7410748>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com>
[s]   response   <200 http://quotes.toscrape.com>
[s]   settings   <scrapy.settings.Settings object at 0x000002B3C74D97B8>
[s]   spider     <DefaultSpider 'default' at 0x2b3c90e5ba8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: 

In [1]: response
Out[1]: <200 http://quotes.toscrape.com>

In [2]: quotes = response.css('.quote')

In [3]: quotes
Out[3]: 
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>]

In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" i
temscope itemtype="h'>

In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" i
temprop="text">“The '>]

In [6]:  quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world a
s we have created it is a pr'>]

In [7]:  quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

In [8]: quotes[0].css('.text').extract()
Out[8]: ['<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”</span>']

In [9]:  quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [10]:  quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']

In [11]: exit()

此时我们再运行下爬虫scrapy crawl quotes
在终端可以看到很多信息。


3、单页爬取完成,接下来要进行翻页。网页的url变化如下,可以通过next按钮的href属性获得下一页网址。


修改quotes.py中的代码:
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()   #字典类型
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)  #生成完整url
        yield scrapy.Request(url=url, callback=self.parse)  #递归调用

4、保存数据
命令行scrapy crawl quotes -o quotes.json,保存为json格式


或者存成jl格式,scrapy crawl quotes -o quotes.jl

或者存成CSV格式,scrapy crawl quotes -o quotes.csv
或者存成xml格式,scrapy crawl quotes -o quotes.xml
或者存成pickle格式,scrapy crawl quotes -o quotes.pickle
或者存成marshal格式,scrapy crawl quotes -o quotes.marshal

5、处理一些不想要的item或者保存到数据库
这时候需要修改pipelines.py中的代码


这里限定了字符最大为50个字符,超过的部分在后面添加...
同时定义了mongodb的保存函数
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

from scrapy.exceptions import DropItem

# 这里限定了最大字符个数为50,超过用省略号代替
class QuotetutorialPipeline(object):

    def __init__(self):
        self.limit = 50

    # 只能返回两种值,item和DropItem
    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        # 关闭mongodb
        self.client.close()

还要修改setting.py中的相关设置

取消setting.py中关于pipeline的注释,这里的数字表示优先级,越小的优先级越高

重新运行命令行scrapy crawl quotes

mongodb中也可以看到保存的数据了。


参考自崔庆才博主的Scrapy教学

上一篇下一篇

猜你喜欢

热点阅读