聚焦Python分布式爬虫必学框架 Scrapy 打造搜索引擎
常见 HTTP CODE
![](https://img.haomeiwen.com/i9538421/42cff52b5fd84a63.png)
Requests 模拟登录知乎
- 先输入用户名密码进行登录测试
![](https://img.haomeiwen.com/i9538421/dbd4d21ba3902e6e.png)
![](https://img.haomeiwen.com/i9538421/70571103b016f2bd.png)
![](https://img.haomeiwen.com/i9538421/34ff4a1449e05187.png)
![](https://img.haomeiwen.com/i9538421/908990df135467b0.png)
- 发现主要发送了 3 个请求,一次
sing_in
和两次captcha?lang=en
- sing_in:登录请求,Request URL 为:https://www.zhihu.com/api/v3/oauth/sign_in
- captcha?lang=en:获取验证码,Request URL 为:https://www.zhihu.com/api/v3/oauth/captcha?lang=en
- Request Payload 所有参数
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="client_id"
c3cef7c66a1843f8b3a9e6a1e3160e20
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="grant_type"
password
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="timestamp"
1528552667566
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="source"
com.zhihu.web
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="signature"
a66dbcd96828a60de1120e4419cd1eb8fa79145f
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="username"
123@163.com
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="password"
123456
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="captcha"
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="lang"
en
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="ref_source"
homepage
------WebKitFormBoundarypgxZVHAByOSf1fnG
Content-Disposition: form-data; name="utm_source"
------WebKitFormBoundarypgxZVHAByOSf1fnG--
分析发现,sing_in 的时候下面几个请求参数是需要传递的
- timestamp:当前时间戳
- signature:签名认证
- username:用户名
- password:密码
- captcha:验证码
其他的参数全部是固定的,比较不容易搞定的就是 signature 和 captcha
- 如何获取 signature
在网页源码里找 js 文件,有这样一段代码,就是生成这个 signature 的
function (e, t, n) {
"use strict";
function r(e, t) {
var n = Date.now(), r = new a.a("SHA-1", "TEXT");
return r.setHMACKey("d1b964811afb40118a12068ff74a12f4", "TEXT"), r.update(e), r.update(i.a), r.update("com.zhihu.web"), r.update(String(n)), c({
clientId: i.a,
grantType: e,
timestamp: n,
source: "com.zhihu.web",
signature: r.getHMAC("HEX")
}, t)
}
分析发现主要使用 hmac、sha1 进行了加密,可以用 Python 实现同样算法
def get_signature(login_time):
"""
生成 signature
"""
h = hmac.new(key='d1b964811afb40118a12068ff74a12f4'.encode('utf-8'), digestmod=sha1)
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = login_time
source = 'com.zhihu.web'
h.update((grant_type + client_id + source + timestamp).encode('utf-8'))
return h.hexdigest()
- 如何获取 captcha
![](https://img.haomeiwen.com/i9538421/bd2ae8d628b07e5c.png)
![](https://img.haomeiwen.com/i9538421/d45cc55a37435313.png)
- 可以发现,请求了两次 captcha?lang=en,为什么是两次呢?原因是有的时候是不需要验证码的,直接输入用户名密码就可以完成登录,这样最好,但是有的时候是会出现验证码的,第一次请求 captcha?lang=en 返回结果是 {"show_captcha":true} 为 'true' 代表需要验证码,如果第一次返回 {"show_captcha":false} 则代表不需要图形验证码,也就不会有第二次请求真正的验证的了
- 这里要说明一下,请求 Request URL:https://www.zhihu.com/api/v3/oauth/captcha?lang=en 最后的 lang=en 代表英文验证码,如果 lang=cn 就会出现倒立的中文验证码,所以发送请求的时候要发送能够获取英文验证码的请求
获取图形验证码
def get_captcha(headers):
"""
获取登录验证码
Args:
headers: 请求头信息
Returns:
captcha: 返回手动输入的验证码
"""
response = session.get('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', headers=headers)
result = re.findall('"show_captcha":(\w+)', response.text)
if result[0] == 'false':
return ''
else:
response = session.get('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', headers=headers)
show_captcha = json.loads(response.text)['img_base64']
with open('captcha.jpg', 'wb') as f:
f.write(base64.b64decode(show_captcha.encode('utf-8')))
image = Image.open('captcha.jpg')
image.show()
image.close()
captcha = input('请输入验证码:')
return captcha
- 两个难点解决了,其他问题就都好处理了
完整代码如下
import re
import time
import http.cookiejar
import hmac
from hashlib import sha1
import base64
import json
import requests
from PIL import Image
session = requests.Session()
session.cookies = http.cookiejar.LWPCookieJar(filename='cookies.txt')
try:
session.cookies.load(ignore_discard=True)
except Exception as e:
print('cookie 未能加载')
headers = {
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
'host': 'www.zhihu.com',
'referer': 'https://www.zhihu.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
def is_login():
"""
判断是否已经登录
"""
response = session.get('https://www.zhihu.com', headers=headers, allow_redirects=False)
if response.status_code != 200:
return False
else:
return True
def get_xsrf_d_c0():
"""
获取 _xsrf 和 d_c0 的值
"""
response = session.get('https://www.zhihu.com/signup', headers=headers)
return response.cookies.get('_xsrf'), response.cookies.get('d_c0')
def get_signature(login_time):
"""
生成 signature
"""
h = hmac.new(key='d1b964811afb40118a12068ff74a12f4'.encode('utf-8'), digestmod=sha1)
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = login_time
source = 'com.zhihu.web'
h.update((grant_type + client_id + source + timestamp).encode('utf-8'))
return h.hexdigest()
def get_captcha(headers):
"""
获取登录验证码
Args:
headers: 请求头信息
Returns:
captcha: 返回手动输入的验证码
"""
response = session.get('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', headers=headers)
result = re.findall('"show_captcha":(\w+)', response.text)
if result[0] == 'false':
return ''
else:
response = session.get('https://www.zhihu.com/api/v3/oauth/captcha?lang=en', headers=headers)
show_captcha = json.loads(response.text)['img_base64']
with open('captcha.jpg', 'wb') as f:
f.write(base64.b64decode(show_captcha.encode('utf-8')))
image = Image.open('captcha.jpg')
image.show()
image.close()
captcha = input('请输入验证码:')
return captcha
def zhihu_login(user, password):
"""
知乎登录
Args:
user: 用户名
password: 密码
Returns:
response: 响应信息
"""
x_xsrftoken, x_udid = get_xsrf_d_c0()
headers.update({
# 'x-udid': x_udid,
'x-xsrftoken': x_xsrftoken,
})
login_time = str(int(time.time() * 1000))
post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
post_data = {
'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',
'grant_type': 'password',
'timestamp': login_time,
'source': 'com.zhihu.web',
'signature': get_signature(login_time),
'username': user,
'password': password,
'captcha': get_captcha(headers),
'lang': 'en',
'ref_source': 'homepage',
'utm_source': '',
}
response = session.post(post_url, data=post_data, headers=headers)
session.cookies.save()
return response
def get_index():
"""
请求知乎首页
Returns:
返回首页 HTML 源码
"""
response = session.get('https://www.zhihu.com', headers=headers)
return response.text
if __name__ == '__main__':
USER = 'username' # 用户名
PASSWORD = 'password' # 密码
if not is_login():
print('正在登录...')
zhihu_login(USER, PASSWORD)
html = get_index()
print(html)
else:
print('已登录...')
将 USER 和 PASSWORD 换成自己的用户名和密码就可以登录了
Scrapy 模拟知乎登录1
- 新建 知乎 爬虫
scrapy genspider zhihu www.zhihu.com
![](https://img.haomeiwen.com/i9538421/ddaad6712b3e5900.png)
![](https://img.haomeiwen.com/i9538421/0179e48cef59f370.png)
- 整体思路:模拟登录过程和之前通过 requests 模拟登录时差不多的,不过在 scrapy 中爬虫第一个请求会从 start_urls 列表中开始爬取,并且是运行爬虫后自动开始爬取,而自动爬取是通过 start_requests 方法来实现的,所以要先重写 start_requests 方法,第一个请求改成不从 start_urls 列表中开始,而是从知乎的登录网址 https://www.zhihu.com/signup 开始,登录成功后,再调用 start_urls 列表中的起始 URL 进行真正的数据爬取,然后调用 parse 方法进行数据解析
- 知乎爬虫 模拟登录 完整代码 zhihu.py
# ArticleSpider/spiders/zhihu.py
# -*- coding: utf-8 -*-
import time
import re
import hmac
from hashlib import sha1
import base64
import json
from PIL import Image
import scrapy
from ArticleSpider.settings import ZHIHU_USER, ZHIHU_PASSWORD
class ZhihuSpider(scrapy.Spider):
name = 'zhihu_login'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/topic']
headers = {
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
'HOST': 'www.zhihu.com',
'Referer': 'https://www.zhihu.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
def start_requests(self):
"""
重写爬虫入口 start_requests,以完成模拟登录知乎
"""
# 先请求登录页面,从登录页面获取登录所需数据,再进行登录
return [
scrapy.Request('https://www.zhihu.com/signup', headers=self.headers, callback=self.get_captcha1)
]
def get_signature(self, time_str):
"""
生成 signature
"""
h = hmac.new(key='d1b964811afb40118a12068ff74a12f4'.encode('utf-8'), digestmod=sha1)
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = time_str
source = 'com.zhihu.web'
h.update((grant_type + client_id + source + timestamp).encode('utf-8'))
return h.hexdigest()
def get_captcha1(self, response):
"""
第一次请求获取登录验证码
"""
pattern = re.compile('_xsrf=(.*?);')
cookies = response.headers.getlist('Set-Cookie')
xsrf = re.findall(pattern, str(cookies))
xsrf = xsrf[0] if xsrf else ''
self.headers.update({'x-xsrftoken': xsrf})
time_str = str(int(time.time() * 1000))
signature = self.get_signature(time_str)
return [
scrapy.Request(
url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
headers=self.headers,
callback=self.get_captcha2,
meta={'time_str': time_str, 'signature': signature},
dont_filter=True
)
]
def get_captcha2(self, response):
"""
第二次请求获取登录验证码
如果判断不需要验证码,则直接进行登录
"""
result = re.findall('"show_captcha":(\w+)', response.text)
formdata = {
'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',
'grant_type': 'password',
'timestamp': response.meta['time_str'],
'source': 'com.zhihu.web',
'signature': response.meta['signature'],
'username': ZHIHU_USER,
'password': ZHIHU_PASSWORD,
'captcha': '',
'lang': 'en',
'ref_source': 'homepage',
'utm_source': '',
}
if result[0] == 'false':
# 不需要输入验证码,直接进行登录
return [
scrapy.FormRequest( # FormRequest 可以实现 POST 请求完成表单提交
url='https://www.zhihu.com/api/v3/oauth/sign_in',
formdata=formdata,
headers=self.headers,
callback=self.check_login,
dont_filter=True
)
]
else:
# 需要输入验证码,再次请求获取验证码
# 请求回调 parse_captcha 方法进行验证码解析
return [
scrapy.Request(
url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
headers=self.headers,
callback=self.parse_captcha,
meta={'formdata': formdata},
dont_filter=True
)
]
def parse_captcha(self, response):
"""
解析验证码,并将验证码图片存储到本地
需要手动输入验证码后继续执行代码完成登录
"""
show_captcha = json.loads(response.text)['img_base64']
with open('captcha.jpg', 'wb') as f:
f.write(base64.b64decode(show_captcha.encode('utf-8')))
image = Image.open('captcha.jpg')
image.show()
image.close()
captcha = input('请输入验证码:')
formdata = response.meta['formdata']
formdata['captcha'] = captcha
return [
scrapy.FormRequest( # FormRequest 可以实现 POST 请求完成表单提交
url='https://www.zhihu.com/api/v3/oauth/sign_in',
formdata=formdata,
headers=self.headers,
callback=self.check_login,
dont_filter=True
)
]
def check_login(self, response):
"""
验证服务器返回数据,判断是否登录成功
如果登录成功,则进行进一步的数据爬取
"""
if response.status == 200 or response.status == 201:
# 判断登录成功后,开始进行数据爬取,相当于延后了
# start_urls 中的 URL 爬取,先完成上面的登录过程
# 准备工作做完,在这里再开始进行真正的数据爬取
for url in self.start_urls:
# 不写 callback 参数默认回调时调用 parse 方法
yield scrapy.Request(url, headers=self.headers, dont_filter=True)
def parse(self, response):
"""
数据解析
"""
print(response.text)
pass
- 还需要在 settings.py 中配置用户名和密码
# ArticleSpider/settings.py
# 知乎 配置
ZHIHU_USER = 'username'
ZHIHU_PASSWORD = 'password'
Scrapy 模拟知乎登录2
成功模拟登录以后,无法正常访问知乎首页 https://www.zhihu.com/,开始以为模拟登录有问题,就重新分析了一下登录请求过程,重写了登录逻辑,发现依旧不能访问首页,模拟登录后其他页面是可以访问的(如知乎话题页:https://www.zhihu.com/topic)。不过写都写了,代码贴出来,这次逻辑更加清晰了。不过经过实际测试:之前的模拟登录方式从来没遇到过需要输入验证码才能登录的情况,这次的模拟登录每次都需要验证码才能登录
-
不需要验证码情况下的一次完整登录请求
image.png
![](https://img.haomeiwen.com/i9538421/d0261b7b365dc124.png)
![](https://img.haomeiwen.com/i9538421/27f692855f0b1337.png)
![](https://img.haomeiwen.com/i9538421/59a09ac0a040233d.png)
![](https://img.haomeiwen.com/i9538421/cad9443bc15903bf.png)
![](https://img.haomeiwen.com/i9538421/29bf50408a985c8c.png)
- 需要验证码的情况下一次完整的登录请求
![](https://img.haomeiwen.com/i9538421/cf0316f8d9d008c8.png)
![](https://img.haomeiwen.com/i9538421/c301fbd6bce11376.png)
![](https://img.haomeiwen.com/i9538421/8699c4d3a7ab6316.png)
![](https://img.haomeiwen.com/i9538421/081b0e89aaf05a01.png)
![](https://img.haomeiwen.com/i9538421/ba141aac911b5fd0.png)
![](https://img.haomeiwen.com/i9538421/0493a81b3b3cb7cf.png)
![](https://img.haomeiwen.com/i9538421/3177fb8e906fd109.png)
![](https://img.haomeiwen.com/i9538421/5203c0c643d36c89.png)
![](https://img.haomeiwen.com/i9538421/e0b56913a76baef6.png)
![](https://img.haomeiwen.com/i9538421/6828abdf84c82d29.png)
- 完整的 知乎登录 代码
# ArticleSpider/spiders/zhihu_login.py
# -*- coding: utf-8 -*-
import time
import hmac
from hashlib import sha1
import base64
import json
from PIL import Image
import scrapy
from ArticleSpider.settings import ZHIHU_USER, ZHIHU_PASSWORD
class ZhihuSpider(scrapy.Spider):
name = 'zhihu_login'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/topic']
headers = {
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
'Host': 'www.zhihu.com',
'Referer': 'https://www.zhihu.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
time_str = str(int(time.time() * 1000))
formdata = {
'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',
'grant_type': 'password',
'timestamp': time_str,
'source': 'com.zhihu.web',
'signature': '',
'username': ZHIHU_USER,
'password': ZHIHU_PASSWORD,
'captcha': '',
'lang': 'en',
'ref_source': 'homepage',
'utm_source': '',
}
def start_requests(self):
"""
重写爬虫入口 start_requests,以完成模拟登录知乎
"""
# 先请求登录页面,从登录页面获取登录所需数据,再进行登录
return [
scrapy.Request(url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
headers=self.headers,
callback=self.is_show_captcha)
]
def get_signature(self, time_str):
"""
生成 signature
"""
h = hmac.new(key='d1b964811afb40118a12068ff74a12f4'.encode('utf-8'), digestmod=sha1)
client_id = 'c3cef7c66a1843f8b3a9e6a1e3160e20'
grant_type = 'password'
timestamp = time_str
source = 'com.zhihu.web'
h.update((grant_type + client_id + source + timestamp).encode('utf-8'))
return h.hexdigest()
def is_show_captcha(self, response):
"""
判断是否需要验证码登录
"""
show_captcha = json.loads(response.text)['show_captcha']
if show_captcha:
print('需要获取登录验证码')
return [
scrapy.Request(url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
headers=self.headers,
callback=self.parse_captcha,
method='PUT',
dont_filter=True)
]
else:
print('无需获取登录验证码,直接登录')
formdata = self.formdata
formdata['signature'] = self.get_signature(self.time_str)
return [
scrapy.FormRequest( # FormRequest 可以实现 POST 请求完成表单提交
url='https://www.zhihu.com/api/v3/oauth/sign_in',
formdata=formdata,
headers=self.headers,
callback=self.check_login,
dont_filter=True)
]
def parse_captcha(self, response):
"""
解析验证码,并将验证码图片存储到本地
需要手动输入验证码后继续执行代码完成登录
"""
try:
show_captcha = json.loads(response.text)['img_base64']
with open('captcha.jpg', 'wb') as f:
f.write(base64.b64decode(show_captcha.encode('utf-8')))
image = Image.open('captcha.jpg')
image.show()
# image.close()
captcha = input('请输入验证码:')
formdata = {'input_text': captcha}
except ValueError:
print('img_base64 获取失败')
except Exception as e:
formdata = {}
print(f'程序出错:{e}')
return [
scrapy.FormRequest( # FormRequest 可以实现 POST 请求完成表单提交
url='https://www.zhihu.com/api/v3/oauth/captcha?lang=en',
formdata=formdata,
headers=self.headers,
callback=self.captcha_login,
dont_filter=True)
]
def captcha_login(self, response):
"""
验证码登录
"""
try:
captcha_success = json.loads(response.text)['success']
except ValueError:
print('验证码验证失败')
if captcha_success:
print('验证码验证成功,正在登录...')
formdata = self.formdata
formdata['signature'] = self.get_signature(self.time_str)
print(formdata)
return [
scrapy.FormRequest( # FormRequest 可以实现 POST 请求完成表单提交
url='https://www.zhihu.com/api/v3/oauth/sign_in',
formdata=formdata,
headers=self.headers,
callback=self.check_login,
dont_filter=True)
]
def check_login(self, response):
"""
验证服务器返回数据,判断是否登录成功
如果登录成功,则进行进一步的数据爬取
"""
print(response)
if response.status == 200 or response.status == 201:
# 判断登录成功后,开始进行数据爬取,相当于延后了
# start_urls 中的 URL 爬取,先完成上面的登录过程
# 准备工作做完,在这里再开始进行真正的数据爬取
for url in self.start_urls:
# 不写 callback 参数默认回调时调用 parse 方法
yield scrapy.Request(url, headers=self.headers, dont_filter=True)
def parse(self, response):
"""
数据解析
"""
print(response.text)
爬取知乎问题以及回答
- 对页面进行分析
知乎首页随意点击一个问题进去
![](https://img.haomeiwen.com/i9538421/2e3bba574998ecd8.png)
链接如:https://www.zhihu.com/question/280139065/answer/415927966 只显示了当前问题下面的一个回答
![](https://img.haomeiwen.com/i9538421/fa51424e4605dec7.png)
点击查看全部回答,发现URL链接改变了:https://www.zhihu.com/question/280139065
![](https://img.haomeiwen.com/i9538421/6a7db49692160cf2.png)
综上,两个链接进行对比:
单个回答:https://www.zhihu.com/question/280139065/answer/415927966
全部回答:https://www.zhihu.com/question/280139065
可以考虑访问类似:https://www.zhihu.com/question/280139065 这样的 URL 链接进行爬取
- 明确问题需要爬取的信息
问题详情页面需要提取的信息
![](https://img.haomeiwen.com/i9538421/bcd125a699755939.png)
- 明确回答页面需要提取的信息
分析问题的全部回答页面,发现有 Ajax 请求,复制链接地址,进行访问
![](https://img.haomeiwen.com/i9538421/1e57ea7512119638.png)
返回的 JSON 数据,刚好是需要采集的回答内容
![](https://img.haomeiwen.com/i9538421/e8de474bddd05e34.png)
![](https://img.haomeiwen.com/i9538421/8e1c102339a43471.png)
![](https://img.haomeiwen.com/i9538421/0177ced2fd9dc4d8.png)
![](https://img.haomeiwen.com/i9538421/5c17533f31c8d531.png)
![](https://img.haomeiwen.com/i9538421/e8b2212ccc215c84.png)
- 设计数据库表
知乎问题表:zhihu_question
![](https://img.haomeiwen.com/i9538421/8ec0496e49b4050b.png)
知乎问题回答表:zhihu_answer
![](https://img.haomeiwen.com/i9538421/8625b489f1799c33.png)
- 定义 ITEM
# ArticleSpider/items.py
class ZhihuQuestionItem(scrapy.Item):
"""
知乎问题 Item
"""
# 问题 id
zhihu_id = scrapy.Field()
# 所属话题
topics = scrapy.Field()
# 问题 URL
url = scrapy.Field()
# 问题标题
title = scrapy.Field(
input_processor=MapCompose(lambda x: x.strip())
)
# 问题内容
content = scrapy.Field()
# 问题回答数量
answer_num = scrapy.Field()
# 问题评论数
comments_num = scrapy.Field()
# 问题关注数
watch_user_num = scrapy.Field()
# 问题点击数(浏览数)
click_num = scrapy.Field()
# 问题创建时间
create_time = scrapy.Field()
# 问题更新时间
update_time = scrapy.Field()
# 问题爬取时间
crawl_time = scrapy.Field()
class ZhihuAnswerItem(scrapy.Item):
"""
知乎问题回答 Item
"""
# 回答 id
zhihu_id = scrapy.Field()
# 回答 URL
url = scrapy.Field()
# 所属问题 id
question_id = scrapy.Field()
# 用户 id
author_id = scrapy.Field()
# 用户名
author_name = scrapy.Field()
# 回答内容
content = scrapy.Field()
# 赞同数
praise_num = scrapy.Field()
# 评论数
comments_num = scrapy.Field()
# 回答创建时间
create_time = scrapy.Field()
# 回答修改时间
update_time = scrapy.Field()
# 回答爬取时间
crawl_time = scrapy.Field()
- 编写爬虫
# ArticleSpider/spiders/zhihu_question.py
# -*- coding: utf-8 -*-
import re
import json
import random
import time
from urllib.parse import urljoin
import scrapy
from scrapy.loader import ItemLoader
from ArticleSpider.items import ZhihuQuestionItem, ZhihuAnswerItem
class ZhihuQuestionSpider(scrapy.Spider):
name = 'zhihu_question'
allowed_domains = ['www.zhihu.com']
# start_urls = ['https://www.zhihu.com/explore'] # 重写了爬虫入口 start_requests,所以 start_urls 也就没必要了
# question 的第一页 answer 请求 URL
start_answer_url = 'https://www.zhihu.com/api/v4/questions/{question}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit={limit}&offset={offset}'
headers = {
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
'Host': 'www.zhihu.com',
'Referer': 'https://www.zhihu.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
def start_requests(self):
"""
重写爬虫入口 start_requests
"""
url = 'https://www.zhihu.com/node/ExploreAnswerListV2?params='
# offset = 0
for offset in range(4):
time.sleep(random.random())
paramse = {"offset": offset * 5, "type": "day"}
full_url = f'{url}{json.dumps(paramse)}'
yield scrapy.Request(url=full_url, headers=self.headers, callback=self.parse_question)
def parse_question(self, response):
"""
处理 question
这里只是提取了 question 部分字段信息,因为 question 页面是 JS 动态加载的,无法直接爬取
有些字段须在 parse_answer 方法中提取
"""
pattern = '(https://www.zhihu.com/question/\d+)/answer/\d+'
question_links = response.xpath('//a[@class="question_link"]/@href').extract()
question_links = [urljoin(response.url, url) for url in question_links]
question_links = [re.findall(pattern, url)[0] for url in question_links]
question_titles = response.xpath('//a[@class="question_link"]/text()').extract()
for i, question_link in enumerate(question_links):
question_id = re.findall('\d+', question_link)[0]
item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
item_loader.add_value('zhihu_id', question_id)
item_loader.add_value('url', question_link)
item_loader.add_value('title', question_titles[i])
question_item = item_loader.load_item()
yield scrapy.Request(
url=self.start_answer_url.format(question=question_id, limit=20, offset=0),
headers=self.headers,
callback=self.parse_answer
)
yield question_item
def parse_answer(self, response):
"""
处理 question 的 answer
"""
data_json = json.loads(response.text)
totals = data_json.get('paging').get('totals')
is_end = data_json.get('paging').get('is_end')
next_url = data_json.get('paging').get('next')
# 添加 question 字段信息并 yield 到 pipeline
# 通过 UPDATE 方式将字段信息更新到 zhihu_question 表中
# question 字段信息并不完整,还有几个字段无法提取,因为需要在
# js 加载的动态页面中提取,比较麻烦,这里只提取了静态页面中的数据
question_item = ZhihuQuestionItem()
question_item['zhihu_id'] = re.findall('questions/(\d+)/answers?', next_url)
question_item['answer_num'] = totals
question_item['create_time'] = data_json.get('data')[0].get('question').get('created')
question_item['update_time'] = data_json.get('data')[0].get('question').get('updated_time')
yield question_item
# 提取 answer 信息
for data in data_json.get('data'):
answer_item = ZhihuAnswerItem()
answer_item['zhihu_id'] = data.get('id')
answer_item['url'] = data.get('url')
answer_item['question_id'] = data.get('question').get('id')
answer_item['author_id'] = data.get('author').get('id')
answer_item['author_name'] = data.get('author').get('name')
answer_item['content'] = data.get('content')
answer_item['praise_num'] = data.get('voteup_count')
answer_item['comments_num'] = data.get('comment_count')
answer_item['create_time'] = data.get('created_time')
answer_item['update_time'] = data.get('updated_time')
# 将 answer 字段信息 yield 到 pipeline
yield answer_item
if not is_end:
yield scrapy.Request(url=next_url, headers=self.headers, callback=self.parse_answer)
将爬取数据保存到 MySQL 中
通过 pipeline 管道将爬取的数据入库,有几种方案:
- ①通过一个 pipeline 来处理所有 item,上一章节中爬取了 jobbole 的文章,写过了一个 pipeline 来将数据保存到 MySQL,现在又爬去了 知乎的文章和回答信息,可以将新增的这两个 item 数据同样通过这个 pipeline 来处理,这里需要改造之前写好的 pipeline,已达到同时处理 3 个 不同的 item 的需求,并分别保存到相应的 MySQL 数据表当中
- ②每个 item 对应一个 pipeline,也就是一个 pipeline 只处理一个 item
- ③每个网站一个 pipeline,伯乐在线网站一个 pipeline,知乎一个 pipeline
分析三种不同方案利弊:
- 第二种和第三种方案差别不大,但是如果项目中爬虫很多,每个网站或者每个 item 都写一个 pipeline,而每个 pipeline 都去连接 MySQL 数据库进行操作,是很不合理的,过度占用资源
- 第一种方案虽然全部写在一个 pipeline 里面看上去好像逻辑没有那么清晰,但最合理,占用资源小,管理起来方便
还有一种情况就是一个项目中很多爬虫,有的爬虫数据需要存储到 MySQL,有的需要存储到 MongoDB,这种需要分别存储到不同数据库的情况,就可以采用每个数据库写一个 pipeline
- 思路1
在 pipeline 中,通过判断
item.__class__.__name__
是否等于哪一个 item 类名(scrapy 中提供了item.__class__.__name__
来获取 item 所属类),来进行处理。
但是这样做把程序写的太死,item 类名就不能变了,如果变了,这里还要过来修改,会很麻烦
![](https://img.haomeiwen.com/i9538421/f72224861726bd07.png)
- 思路2
另外一种技巧,借鉴 Django 的 Model 思想
- Django 中对数据库的操作采用的是 Model 来完成的,每张表对应一个 class,而每个 class 都可以定义私有方法,这样就屏蔽掉具体的 SQL 语句
- 所以,可以将 pipeline 中不同爬虫变化的部分拿到 item 类定义中去,也就是把所有的 SQL 语句插入数据的操作拿到 item 类定义中去,定义成一个统一的方法 get_insert_sql,这样只需在 pipeline 中调用 这个函数就可以了,不同的 item 都调用各自的 get_insert_sql 方法,pipeline 中就不需要针对每个爬虫写对应处理逻辑,结构也更加清晰
pipeline 管道文件
# ArticleSpider/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import pymysql
from twisted.enterprise import adbapi
from scrapy.pipelines.images import ImagesPipeline
class ArticlespiderPipeline(object):
def process_item(self, item, spider):
return item
class ArticleImagePipeline(ImagesPipeline):
def item_completed(self, results, item, info):
if 'front_img_url' in item:
for ok, v in results:
image_file_path = v['path']
item['front_img_path'] = image_file_path
return item
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = open('article.json', 'a', encoding='utf-8')
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item), ensure_ascii=False) + '\n')
return item
def close_spider(self):
self.file.close()
class MySQLPipeline(object):
def __init__(self):
self.conn = pymysql.connect('127.0.0.1', 'pythonic', 'pythonic', 'Articles', charset='utf8')
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
insert_sql = '''
insert into jobbole_article(title,create_date,url,url_object_id,front_image_url,
front_image_path,praise_nums,comment_nums,fav_nums,tags,content)
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
'''
self.cursor.execute(insert_sql, (
item['title'], item['create_date'], item['url'], item['url_object_id'],
item['front_img_url'], item['front_img_url'], item['praise_nums'],
item['comment_nums'], item['fav_nums'], item['tags'], item['content']
))
self.conn.commit()
def close_spider(self):
self.cursor.close()
self.conn.close()
class MySQLTwistedPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
# Scrapy 提供了一个类方法可以直接获取 settings.py 文件中的配置信息
@classmethod
def from_settings(cls, settings):
db_params = dict(
host=settings['MYSQL_HOST'],
user=settings['MYSQL_USER'],
password=settings['MYSQL_PASSWORD'],
database=settings['MYSQL_DATABASE'],
port=settings['MYSQL_PORT'],
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
)
# Twister 只是提供了一个异步的容器,并没有提供数据库连接,所以连接数据库还是要用 pymysql 进行连接
# adbapi 可以将 MySQL 的操作变为异步
# ConnectionPool 第一个参数是我们连接数据库所使用的 库名,这里是连接 MySQL 用的 pymysql
# 第二个参数就是 pymysql 连接操作数据库所需的参数,这里将参数组装成字典 db_params,当作关键字参数传递进去
dbpool = adbapi.ConnectionPool('pymysql', **db_params)
return cls(dbpool)
def process_item(self, item, spider):
# 使用 Twisted 将 MYSQL 插入变成异步
# 执行 runInteraction 方法的时候会返回一个 query 对象,专门用来处理异常
query = self.dbpool.runInteraction(self.do_insert, item)
# 添加错误处理方法到 query 对象
# addErrback 第一个参数是处理异常的方法,后面的参数是这个方法所需的参数
# 因为定义的 handle_error 方法需要接收 item、spider 参数,所以这里需要传递
query.addErrback(self.handle_error, item, spider)
def do_insert(self, cursor, item):
# 执行具体的插入操作
# 这里已经不需要手动 commit 了,Twisted 会自动 commit
# 调用每个 item 的 get_insert_sql 方法,获取 insert_sql 语句
# 以及 params 参数,完成数据的插入操作
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
def handle_error(self, failure, item, spider):
# 异常处理方法,处理异步插入数据库时产生的异常
# failure 参数不需要我们自己传递,出现异常会自动将异常当作这个参数传递进来
# item、spider 参数并不是必须的,传递进来最大好处是如果出现异常
# debug 的时候很容易通过 item 和 spider 来定位错误
print(f'出现异常:{failure}')
print(item.__class__.__name__)
修改 items.py 文件,将插入 MySQL 的语句封装成方法,放到每个 ITEM 类中
# ArticleSpider/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import re
from datetime import datetime
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from ArticleSpider.settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT
# class ArticlespiderItem(scrapy.Item):
# # define the fields for your item here like:
# # name = scrapy.Field()
# pass
# class JobBoleArticleItem(scrapy.Item):
# title = scrapy.Field() # 文章标题
# create_date = scrapy.Field() # 文章发布日期
# url = scrapy.Field() # 文章 URL
# url_object_id = scrapy.Field() # 文章 URL 的 MD5 值
# front_img_url = scrapy.Field() # 文章封面图(文章列表页显示的封面图,通常是文章详情第一张图片)
# front_img_path = scrapy.Field() # 文章封面图存放路径
# praise_nums = scrapy.Field() # 点赞数
# comment_nums = scrapy.Field() # 评论数
# fav_nums = scrapy.Field() # 收藏数
# tags = scrapy.Field() # 文章标签
# content = scrapy.Field() # 文章内容
# def add_jobbole(value):
# return f'{value} --jobbole'
def convert_date(value):
"""
将字符串转换成日期
"""
try:
return datetime.strptime(value, '%Y/%m/%d')
except Exception as e:
return datetime.now().date()
def take_nums(value):
"""
提取数字
"""
re_find = re.findall('.*?(\d+).*', value)
if re_find:
return int(re_find[0])
else:
return 0
def remove_tags_comment(value):
"""
移除 tags 中的评论
"""
return '' if '评论' in value else value
class ArticleItemLoader(ItemLoader):
"""
自定义 ItemLoader,继承自 Scrapy 的 ItemLoader
来改变 ItemLoader 的默认 output_processor 方法
"""
default_output_processor = TakeFirst()
class JobBoleArticleLoadItem(scrapy.Item):
title = scrapy.Field(
# input_processor=MapCompose(add_jobbole, lambda x: x+'[article]'),
# output_processor=TakeFirst()
)
create_date = scrapy.Field(
input_processor=MapCompose(convert_date)
)
url = scrapy.Field()
url_object_id = scrapy.Field()
front_img_url = scrapy.Field(
# 因为 使用 scrapy 自带的下载图片管道图片字段必须是可迭代的
# 所以这里不再使用默认的 output_processor=TakeFirst()
# 而是通过 lambda 表达式返回一个列表,因为本来传进来
# 就是一个列表,所以什么也不需要做,只需要直接将值重新返回
output_processor=MapCompose(lambda x: x)
)
front_img_path = scrapy.Field()
praise_nums = scrapy.Field(
input_processor=MapCompose(take_nums)
)
comment_nums = scrapy.Field(
input_processor=MapCompose(take_nums)
)
fav_nums = scrapy.Field(
input_processor=MapCompose(take_nums)
)
tags = scrapy.Field(
input_processor=MapCompose(remove_tags_comment),
output_processor=Join(',')
)
content = scrapy.Field()
def get_insert_sql(self):
# ON DUPLICATE KEY UPDATE 为 MySQL 独有的 '插入更新' 语法
# 将一条数据插入 MySQL 时,如果这条数据已经存在于数据库中
# 不会报错,而是执行更新操作,更新的字段可以自己指定
insert_sql = '''
INSERT INTO jobbole_article(title,create_date,url,url_object_id,front_image_url,
front_image_path,praise_nums,comment_nums,fav_nums,tags,content)
VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
ON DUPLICATE KEY UPDATE praise_nums=VALUES(praise_nums),comment_nums=VALUES(comment_nums),
fav_nums=VALUES(fav_nums),content=VALUES(content),
'''
params = (
self['title'], self['create_date'], self['url'], self['url_object_id'],
self['front_img_url'], self['front_img_url'], self['praise_nums'],
self['comment_nums'], self['fav_nums'], self['tags'], self['content']
)
return insert_sql, params
class ZhihuQuestionItem(scrapy.Item):
"""
知乎问题 Item
"""
# 问题 id
zhihu_id = scrapy.Field()
# 所属话题
topics = scrapy.Field()
# 问题 URL
url = scrapy.Field()
# 问题标题
title = scrapy.Field(
input_processor=MapCompose(lambda x: x.strip())
)
# 问题内容
content = scrapy.Field()
# 问题回答数量
answer_num = scrapy.Field()
# 问题评论数
comments_num = scrapy.Field()
# 问题关注数
watch_user_num = scrapy.Field()
# 问题点击数(浏览数)
click_num = scrapy.Field()
# 问题创建时间
create_time = scrapy.Field()
# 问题更新时间
update_time = scrapy.Field()
# 问题爬取时间
crawl_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = '''
INSERT INTO zhihu_question(zhihu_id,topics,url,title,content,create_time,update_time,
answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
ON DUPLICATE KEY UPDATE create_time=VALUES(create_time),update_time=VALUES(update_time),
answer_num=VALUES(answer_num),crawl_update_time=VALUES(crawl_update_time)
'''
# 因为 zhihu_question 爬虫中使用的是 scrapy 的 ItemLoader
# 而没有使用自定义的 ArticleItemLoader,并且没有在上面定义
# 字段里面使用 output_processor = TakeFirst(),所以现在每个
# 字段的类型都是列表,所以其实也是可以在这个地方插入数据库
# 之前将数据字段类型全部修改成对应于 MySQL 表中的字段类型
zhihu_id = int(self['zhihu_id'][0])
topics = None
url = self['url'][0] if self.get('url') else ''
title = self['title'][0] if self.get('title') else ''
content = None
create_time = datetime.fromtimestamp(self['create_time']).strftime(SQL_DATETIME_FORMAT) if self.get('create_time') else None
update_time = datetime.fromtimestamp(self['update_time']).strftime(SQL_DATETIME_FORMAT) if self.get('update_time') else None
answer_num = self['answer_num'] if self.get('answer_num') else 0
comments_num = 0
watch_user_num = 0
click_num = 0
# 因为 执行 insert_sql 插入数据时,需要传递字符串类型,所以将时间转换成字符串
crawl_time = datetime.now().strftime(SQL_DATETIME_FORMAT)
crawl_update_time = datetime.now().strftime(SQL_DATETIME_FORMAT)
params = (
zhihu_id, topics, url, title, content, create_time, update_time, answer_num,
comments_num, watch_user_num, click_num, crawl_time, crawl_update_time
)
return insert_sql, params
class ZhihuAnswerItem(scrapy.Item):
"""
知乎问题回答 Item
"""
# 回答 id
zhihu_id = scrapy.Field()
# 回答 URL
url = scrapy.Field()
# 所属问题 id
question_id = scrapy.Field()
# 用户 id
author_id = scrapy.Field()
# 用户名
author_name = scrapy.Field()
# 回答内容
content = scrapy.Field()
# 赞同数
praise_num = scrapy.Field()
# 评论数
comments_num = scrapy.Field()
# 回答创建时间
create_time = scrapy.Field()
# 回答修改时间
update_time = scrapy.Field()
# 回答爬取时间
crawl_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = '''
INSERT INTO zhihu_answer(zhihu_id,url,question_id,author_id,author_name,
content,praise_num,comments_num,create_time,update_time,crawl_time,crawl_updatetime)
VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
ON DUPLICATE KEY UPDATE content=VALUES(content),praise_num=VALUES(praise_num),
comments_num=VALUES(comments_num),update_time=VALUES(update_time),
crawl_updatetime=VALUES(crawl_updatetime)
'''
# create_time 和 update_time 从 spider 传递过来是 int 类型
# 先通过 datetime.fromtimestamp() 将 int 转换成 datetime 类型
# 再通过 strftime() 将 datetime 类型转成 str 类型
create_time = datetime.fromtimestamp(self['create_time']).strftime(SQL_DATETIME_FORMAT)
update_time = datetime.fromtimestamp(self['update_time']).strftime(SQL_DATETIME_FORMAT)
crawl_time = datetime.now().strftime(SQL_DATETIME_FORMAT)
crawl_updatetime = datetime.now().strftime(SQL_DATETIME_FORMAT)
params = (
self['zhihu_id'], self['url'], self['question_id'], self['author_id'],
self['author_name'], self['content'], self['praise_num'], self['comments_num'],
create_time, update_time, crawl_time, crawl_updatetime
)
return insert_sql, params