Scrapy - 第一个爬虫和我的博客

2018-07-21 本文已影响298人小温侯

第一个爬虫

这里我用官方文档的第一个例子：爬取http://quotes.toscrape.com来作为我的首个scrapy爬虫，我没有找到scrapy 1.5的中文文档，后续内容有部分是我按照官方文档进行翻译的（广告：要翻译也可以联系我，我有三本英文书籍的翻译出版经验，其中两本是独立翻译LOL），具体的步骤是：

在CMD中，进入你想要存储代码的目录下执行：scrapy startproject myspiders，其中quotes可以是你想要创建的目录名字。
Scrapy会自动创建一个名为myspiders的目录，并在它里面初始化一些内容。
进入myspiders/spiders目录，新建一个名为quotestoscrape.py的文件，并添加如下代码：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

保存后，切回CMD，执行scrapy crawl quotestoscrape，在展示结果之前，我想先简单解释一下这部分的代码：

首先经过我的测试start_requests(self)这个方法并不是必须的，至少它也可以是一个名为start_urls[]的列表。不过我觉得还是遵循某种标准写法比较好。如果有的话，按照文档的说法，必须返回一个Requests的迭代器（它可以是一系列请求也可以是一个生成迭代器的方法），它代表了这个爬虫要从哪个或哪些地址开始爬取。同时也会同来进一步生成之后的请求。
每条请求都会从服务器下载下来一些内容，parse()方法是用来处理这些内容的。参数response包含了整个页面的内容，之后你可以使用其他函数方法来进一步处理它。
yield关键字代表了Python另一个特性：生成器。我忽然想到似乎我从来没提到过它，虽然我知道这是什么。以后有机会在写一写吧。

指令执行后，都会输出一大堆的log，大多数不难理解，我这里只截取其中我们想看的一部分，其中前半部分是爬取到的结果，后面一部分是一个统计：

....
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider opened
2018-04-19 15:56:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-19 15:56:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-19 15:56:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 908603),
 'item_scraped_count': 10,
 'log_count/DEBUG': 13,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 400951)}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider closed (finished)

以后如果有空会专门写一篇文档把这部分日志展开来说一说。

error: No module named win32api

在最后执行的时候，有可能会出现找不到win32api的错误，安装如下模块即可：pip install pypiwin32。

进一步处理response

初次接触爬虫，可能会对上述代码中的response.css(), quote.css(), quote.xpath()和extract_first()感到陌生，这些就是所谓的进一步处理response的方法。

这部分内容需要用到一些HTML/CSS的知识，你需要知道通过怎样的表达式才能从返回内容中获取到你需要的内容。因为网页的代码都是树形结构，理论上通过合理的表达式，我们可以获取任何我们想要获得的内容。通常情况下，我们有两种方法可以计算出我们的表达式：

第一种是用浏览器的审查模式。
第二种是利用scrapy提供的命令行模式。

CSS选择器

上述代码中，response.css('div.quote')和quote.css('span.text::text')都是CSS选择器。如果我们打开该网页的元素审查页面，会有如下结果：

Python爬虫CSS选择器.jpg

依我之见，流程大概如下：利用屏幕底下几个标签可以先定位到一个大概的位置，比如说quote = response.css('div.quote')定位到图中蓝框的位置，之后我们要进行进一步的筛选，我没有找到文档说明应如何进行筛选，这里是我的一点经验之谈：如果是html标签用空格分割，如果标签带class标识，则用.连接，最后再加上::text 用来剔除首尾的<>标识。

在整个过程中，我们都可以用scrapy的命令行来测试，在你的CMD下输入：scrapy shell "http://quotes.toscrape.com/"。之后出现一大推日志和一些可用的指令：

D:\OneDrive\Documents\Python和数据挖掘\code\blogspider>scrapy shell "http://quotes.toscrape.com/"
.............省略.............
2018-04-19 18:28:19 [scrapy.core.engine] INFO: Spider opened
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000029D0C61AC50>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/>
[s]   response   <200 http://quotes.toscrape.com/>
[s]   settings   <scrapy.settings.Settings object at 0x0000029D0ED439B0>
[s]   spider     <DefaultSpider 'default' at 0x29d0efecc18>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

我们主要用到的是response对象，之后我们就可以进行调试，如下：

# 定位这个网站的标题，extract()用来获取其中的data
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']

# 定位到作者信息，这是最完整的写法
>>> response.css("div.quote span small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以简单一点
>>> response.css("div span small::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以拆开来写
>>> response.css("div.quote").css("span").css("small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 只需要第一项？
>>> response.css("div.quote").css("span").css("small.author::text")[0].extract()
'Albert Einstein'
>>> response.css("div.quote").css("span").css("small.author::text").extract_first()
'Albert Einstein'

如果你之前自己写过网站的CSS，这些其实还是很好理解的，因为内在的逻辑是一样的，伴随这个命令行指令自己琢磨琢磨很容就就能掌握。如果你仔细看，会发现这个函数返回的其实是个列表，这点可以方便我们写代码。

XPath选择器

另一种方法是使用XPath选择器，如上文中的代码：quote.xpath('span/small/text()')。根据文档的描述，XPath才是Scrapy的基础，事实上，即使是CSS选择器最终也会在底层被转化为XPath。XPath比CSS选择强大的地方在于它还可以对筛选出的网页的内容本身就行操作，比如说它可以进行诸如选择那个内容为（下一页）的链接的操作。官方提供了三个关于XPath的文档：using XPath with Scrapy Selectors，learn XPath through examples和how to think in XPath。

保存数据

这个只是一行命令的事，比如说我要将上文爬虫的内容写入一个json文件，我只需要在cmd中执行：

scrapy crawl quotes -o data.json

-o应该就是output，这个linux命令很像，不难理解。当然也可以是其他格式的文件，官方推荐一个叫JSON Lines的格式，虽然我目前还不知道这是什么格式。

所有指出的到处数据类型为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'。

爬取下一页的数据

像http://quotes.toscrape.com这个网站，它可以分为好几页，我们可以通过解析网页中的“下一个”按钮的链接来爬取下一页，下一页的下一页，...，的内容，直到没有下一页了。代码不难理解，直接放上了：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

爬取我自己的博客

说了这么多，做点实际的，我想爬取一下我自己博客的所有文章和发布时间，代码如下：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'ethanshub'
    start_urls = [
        'https://journal.ethanshub.com/archive',
    ]

    def parse(self, response):
        yearlists = response.css('ul.listing')

        for i in range(len(yearlists)):
            lists = yearlists[i]

            for j in range(len(lists.css("li.listing_item"))//2):
                yield {
                    'date': lists.css("li.listing_item::text")[j*2].extract(),
                    'title': lists.css("li.listing_item a::text")[j].extract(),
                }

这里唯一要注意的是要注意不要只爬取了一年的文章，要准确找到能包含所有文章的最小结构。然后就是简单的逻辑性操作了。另外值得一提的一点是，我的博客使用的是Bitcron，CSS文件也是后台渲染的并且我也是按照其语法规则编写CSS的，但是我在分析过程中发现lists.css("li.listing_item")对于每一项都会多爬取到一个空白字段，这也就导致了最后取出的date数量总是title数量的两倍，好在这也保证了date数量肯定是双数，代码略微调整一下即可。

在执行scrapy crawl ethanshub -o data.json之后抓取到的data.json文件内容如下：

[
{"date": "[2017-12-16]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e8c\uff09"},
{"date": "[2017-12-15]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e00\uff09"},
{"date": "[2017-12-13]\n", "title": "\u7528Python\u5411Kindle\u63a8\u9001\u7535\u5b50\u4e66"},
{"date": "[2017-12-12]\n", "title": "GUI\u7f16\u7a0b\uff0cTkinter\u5e93\u548c\u5e03\u5c40"},
{"date": "[2017-12-12]\n", "title": "Python3\u7684\u6b63\u5219\u8868\u8fbe\u5f0f"},
{"date": "[2017-12-10]\n", "title": "Python\u901f\u89c8[7]"},
{"date": "[2017-12-09]\n", "title": "Python\u901f\u89c8[6]"},
....
{"date": "[2013-09-16]\n", "title": "How to split a string in C"},
{"date": "[2012-11-28]\n", "title": "Common Filters for Wireshark"}
]

一切OK，其中\u722c是Unicode的中文字符，只是个编码问题，就不多做了。