[CP_15] Python爬虫框架02:Scrapy框架爬取咨
目录结构
一、Scrapy框架发送POST请求的应用
1. Scrapy发送POST请求
2. 添加请求头信息
二、某咨询平台咨询问题爬取案例
1. 创建项目
2. 在项目内生成主体脚本
3. 在items中定义目标字段
4. 编写主体脚本:rpPlatform.py
5. piplines对items后期处理&数据保存
6. 配置项目设置文件settings.py
7. 执行主体脚本rpPlatform.py
8. 利用main.py快捷启动执行爬虫命令
一、Scrapy框架发送POST请求的应用
1. Scrapy发送POST请求
创建项目:scrapy startproject youdaoTranslate
在项目内生成主体脚本:scrapy genspider ydTranslate "fanyi.youdao.com"
ydTranslate.py
# -*- coding: utf-8 -*-
import scrapy
class YdtranslateSpider(scrapy.Spider):
name = 'ydTranslate'
allowed_domains = ['fanyi.youdao.com']
def start_requests(self):
url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
# 向队列中加入一个带有表单信息的POST请求
yield scrapy.FormRequest(
url=url,
formdata={
"i":"测试",
"from":"AUTO",
"to":"AUTO",
"smartresult":"dict",
"client":"fanyideskweb",
"salt":"15556827153720",
"sign":"067431debbc1e4c7666f6f4b1e204747",
"ts":"1555682715372",
"bv":"e2a78ed30c66e16a857c5b6486a1d326",
"doctype":"json",
"version":"2.1",
"keyfrom":"fanyi.web",
"action":"FY_BY_CLICKBUTTION"
},
callback=self.parse # 回调函数
)
def parse(self, response):
print("------------------")
print(response.body)
执行主体脚本:scrapy crawl ydTranslate
2. 添加请求头信息
ydTranslate.py
# -*- coding: utf-8 -*-
import scrapy
import random
class YdtranslateSpider(scrapy.Spider):
name = 'ydTranslate'
allowed_domains = ['fanyi.youdao.com']
def start_requests(self):
url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
# 多个User-Agent随机取一个
agent1="User-Agent,Mozilla/5.0 (Windows NT 6.1; rv:65.0) Gecko/20100101 Firefox/65.0"
agent2="User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
agent3="User-Agent,Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
agent4="User-Agent, MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
agent5="Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36"
agent6="Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044207 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070332) NetType/4G Language/zh_CN Process/tools"
ls=[agent1,agent2,agent3,agent4,agent5,agent6]
agent=random.choice(ls)
# 构造请求头信息
header={"User-Agent":agent}
# 向队列中加入一个带有表单信息的POST请求
yield scrapy.FormRequest(
url=url,
headers=header,
formdata={
"i":"测试",
"from":"AUTO",
"to":"AUTO",
"smartresult":"dict",
"client":"fanyideskweb",
"salt":"15556827153720",
"sign":"067431debbc1e4c7666f6f4b1e204747",
"ts":"1555682715372",
"bv":"e2a78ed30c66e16a857c5b6486a1d326",
"doctype":"json",
"version":"2.1",
"keyfrom":"fanyi.web",
"action":"FY_BY_CLICKBUTTION"
},
callback=self.parse # 回调函数
)
def parse(self, response):
print("------------------")
print(response.body)
二、某咨询平台咨询问题爬取案例
目标:利用Scrapy框架爬取某问题咨询平台指定页码范围的咨询问题的标题、内容、url链接
第1页:http://wz.sun0769.com/index.php/question/huiyin?page=0
第2页:http://wz.sun0769.com/index.php/question/huiyin?page=30
第3页:http://wz.sun0769.com/index.php/question/huiyin?page=60
1. 创建项目
scrapy startproject replyPlatform
2. 在项目内生成主体脚本
scrapy genspider rpPlatform "wz.sun0769.com"
3. 在items中定义目标字段
本次目标字段:title、content、url
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ReplyplatformItem(scrapy.Item):
url=scrapy.Field() # 帖子的url链接
title=scrapy.Field() # 每个帖子的标题
content=scrapy.Field() # 帖子的内容
4. 编写主体脚本:rpPlatform.py
rpPlatform.py
# -*- coding: utf-8 -*-
import scrapy
from replyPlatform.items import ReplyplatformItem
class RpplatformSpider(scrapy.Spider):
name = 'rpPlatform'
allowed_domains = ['wz.sun0769.com']
url="http://wz.sun0769.com/index.php/question/huiyin?page="
num=0
start_urls=[url+str(num)]
# 获取每个帖子的url
def parse(self, response):
# 提取每个帖子的href属性值存放到列表中
links=response.xpath('//div[@class="newsHead clearfix"]/table//td/a[@class="news14"]/@href').extract()
# 发送每个帖子的请求,使用parse_item方法处理
for link in links:
yield scrapy.Request(link,callback=self.parse_item)
# 设置自动翻页
if self.num<=150:
self.num+=30
# 重新发送新的页面请求
yield scrapy.Request(self.url+str(self.num),callback=self.parse)
# 爬取每个帖子的内容
def parse_item(self,response):
item=ReplyplatformItem() # 新建实例
item["url"]=response.url
item["title"]=response.xpath('//span[@class="niae2_top"]/text()').extract()[0]
item["content"]="".join(response.xpath('//td[@class="txt16_3"]/text()').extract())
yield item
5. piplines对items后期处理&数据保存
piplines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ReplyplatformPipeline(object):
def __init__(self):
self.filename=open("reply.txt","a",encoding="utf-8")
def process_item(self, item, spider):
# 构造每个返回的item内容写入到指定文件
result=str(item)+"\n\n"
self.filename.write(result)
return item
def spider_closed(self,spider):
self.filename.closed()
6. 配置项目设置文件settings.py
settings.py
(1)注释robots.txt协议文件的遵从规则:
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
(2)开启管道配置的优先级:
多个管道并存时,数字越小相应的优先级越高;默认数字300
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'replyPlatform.pipelines.ReplyplatformPipeline': 300,
}
7. 执行主体脚本rpPlatform.py
进入目录:\scrapyProject\replyPlatform
执行命令:scrapy crawl rpPlatform
执行完成之后,在该目录下生成指定的txt文本文件,其内存储有爬取的数据,如下:
8. 利用main.py快捷启动执行爬虫命令
在项目目录(\scrapyProject\replyPlatform)下新建一个main.py文件,作为启动执行爬虫命令的快捷启动文件
main.py
from scrapy import cmdline
cmd="scrapy crawl rpPlatform" # 需要执行的爬虫cmd命令
cmdline.execute(cmd.split()) # 执行命令;默认以空格分割切片