Scrapy的shell调试工具（三）

2018-10-25 本文已影响0人艾胖胖胖

一、背景环境

环境介绍

操作系统：Win10
Python版本：Python3.6
Scrapy版本：Scrapy1.5.1

二、概述

Scrapy shell的简介
Scrapy shell是Scrapy的交互终端，供您在未启动spider的情况下尝试及调试您的爬取代码。其本意是用来测试提取数据的代码，不过您可以将其作为正常的Python终端，在上面测试任何的Python代码。
用来干什么
该终端是用来测试XPath或CSS表达式，查看他们的工作方式及从爬取的网页中提取的数据。在编写您的spider时，该终端提供了交互性测试您的表达式代码的功能，免去了每次修改后运行spider的麻烦。

三、使用

启动Scrapy Shell

scrapy shell https://www.jianshu.com/u/aa614d07fab1

可用的scrapy对象

crawler    -当前 Crawler 对象
spider     - 处理这个URL的spider。 如果没有为当前的URL找到Spider时，则为一个 Spider 对象。
request    - 最近获取到的页面的 Request 对象。 您可以使用 replace() 修改该request。或者使用 fetch快捷命令来获取新的request（不用退出shell界面）
response    - 包含最近获取到的页面的 Response 对象。
settings   - 当前的 Scrapy settings

提供的快捷命令

shelp()   - 帮助列表
fetch(url[, redirect=True]  - 从给定的URL获取新响应并相应地更新所有相关对象。您可以选择要求HTTP 3xx重定向，然后不要传递redirect = False
fetch(request)  - 从给定请求中获取新响应并相应地更新所有相关对象。
view(response)   - 在本机的浏览器打开给定的response。 其会在response的body中添加一个 <base> tag ，使得外部链接(例如图片及css)能正确显示。 注意，该操作会在本地创建一个临时文件，且该文件不会被自动删除。

调试我们的选择器

D:\BTW> scrapy shell https://www.jianshu.com/u/aa614d07fab1
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000022DA5B602B0>
[s]   item       {}
[s]   request    <GET https://www.jianshu.com/u/aa614d07fab1>
[s]   response   <200 https://www.jianshu.com/u/aa614d07fab1>
[s]   settings   <scrapy.settings.Settings object at 0x0000022DA6E7F828>
[s]   spider     <DefaultSpider 'default' at 0x22da74074e0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: request.url
Out[1]: 'https://www.jianshu.com/u/aa614d07fab1'

In [3]: request.headers
Out[3]: 
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
 b'Accept-Encoding': b'gzip, deflate',
 b'Accept-Language': b'zh-CN,zh;q=0.9',
 b'Cache-Control': b'no-cache',
 b'Connection': b'keep-alive',
 b'Pragma': b'no-cache',
 b'Upgrade-Insecure-Requests': b'1',
 b'User-Agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537
.36'}

In [4]: response.xpath('//h1/text()')
Out[4]: []

In [5]: response.selector.re('(https://www.23us.so/list/\w+.html)')
Out[5]: []

# 切换到另一个网站
In [6]: fetch('https://www.23us.so/list/1_1.html')

In [7]: request.url
Out[7]: 'https://www.23us.so/list/1_1.html'

In [8]: response.url
Out[8]: 'https://www.23us.so/list/1_1.html'

In [9]: response.status
Out[9]: 200

Scrapy的shell调试工具（三）

一、背景环境

二、概述

三、使用

猜你喜欢

热点阅读