Python

[CP_15] Python爬虫框架02:Scrapy框架爬取咨

2019-04-21  本文已影响0人  Fighting_001

目录结构

一、Scrapy框架发送POST请求的应用
    1. Scrapy发送POST请求
    2. 添加请求头信息
二、某咨询平台咨询问题爬取案例
    1. 创建项目
    2. 在项目内生成主体脚本
    3. 在items中定义目标字段
    4. 编写主体脚本:rpPlatform.py
    5. piplines对items后期处理&数据保存
    6. 配置项目设置文件settings.py
    7. 执行主体脚本rpPlatform.py
    8. 利用main.py快捷启动执行爬虫命令

一、Scrapy框架发送POST请求的应用

1. Scrapy发送POST请求

创建项目:scrapy startproject youdaoTranslate
在项目内生成主体脚本:scrapy genspider ydTranslate "fanyi.youdao.com"

ydTranslate.py

# -*- coding: utf-8 -*-
import scrapy

class YdtranslateSpider(scrapy.Spider):
    name = 'ydTranslate'
    allowed_domains = ['fanyi.youdao.com']

    def start_requests(self):
        url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"

        # 向队列中加入一个带有表单信息的POST请求
        yield scrapy.FormRequest(
            url=url,
            formdata={
                "i":"测试",
                "from":"AUTO",
                "to":"AUTO",
                "smartresult":"dict",
                "client":"fanyideskweb",
                "salt":"15556827153720",
                "sign":"067431debbc1e4c7666f6f4b1e204747",
                "ts":"1555682715372",
                "bv":"e2a78ed30c66e16a857c5b6486a1d326",
                "doctype":"json",
                "version":"2.1",
                "keyfrom":"fanyi.web",
                "action":"FY_BY_CLICKBUTTION"
            },
            callback=self.parse # 回调函数
        )

    def parse(self, response):
        print("------------------")
        print(response.body)

执行主体脚本:scrapy crawl ydTranslate

2. 添加请求头信息

ydTranslate.py

# -*- coding: utf-8 -*-
import scrapy
import random

class YdtranslateSpider(scrapy.Spider):
    name = 'ydTranslate'
    allowed_domains = ['fanyi.youdao.com']

    def start_requests(self):
        url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"

        # 多个User-Agent随机取一个
        agent1="User-Agent,Mozilla/5.0 (Windows NT 6.1; rv:65.0) Gecko/20100101 Firefox/65.0"
        agent2="User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
        agent3="User-Agent,Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
        agent4="User-Agent, MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
        agent5="Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36"
        agent6="Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044207 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070332) NetType/4G Language/zh_CN Process/tools"
        ls=[agent1,agent2,agent3,agent4,agent5,agent6]
        agent=random.choice(ls)
        # 构造请求头信息
        header={"User-Agent":agent}

        # 向队列中加入一个带有表单信息的POST请求
        yield scrapy.FormRequest(
            url=url,
            headers=header,
            formdata={
                "i":"测试",
                "from":"AUTO",
                "to":"AUTO",
                "smartresult":"dict",
                "client":"fanyideskweb",
                "salt":"15556827153720",
                "sign":"067431debbc1e4c7666f6f4b1e204747",
                "ts":"1555682715372",
                "bv":"e2a78ed30c66e16a857c5b6486a1d326",
                "doctype":"json",
                "version":"2.1",
                "keyfrom":"fanyi.web",
                "action":"FY_BY_CLICKBUTTION"
            },
            callback=self.parse # 回调函数
        )

    def parse(self, response):
        print("------------------")
        print(response.body)

二、某咨询平台咨询问题爬取案例

目标:利用Scrapy框架爬取某问题咨询平台指定页码范围的咨询问题的标题、内容、url链接

第1页:http://wz.sun0769.com/index.php/question/huiyin?page=0
第2页:http://wz.sun0769.com/index.php/question/huiyin?page=30
第3页:http://wz.sun0769.com/index.php/question/huiyin?page=60

1. 创建项目

scrapy startproject replyPlatform

2. 在项目内生成主体脚本

scrapy genspider rpPlatform "wz.sun0769.com"

3. 在items中定义目标字段

本次目标字段:title、content、url
items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ReplyplatformItem(scrapy.Item):
    url=scrapy.Field()  # 帖子的url链接
    title=scrapy.Field()    # 每个帖子的标题
    content=scrapy.Field()  # 帖子的内容

4. 编写主体脚本:rpPlatform.py

rpPlatform.py

# -*- coding: utf-8 -*-
import scrapy
from replyPlatform.items import ReplyplatformItem

class RpplatformSpider(scrapy.Spider):
    name = 'rpPlatform'
    allowed_domains = ['wz.sun0769.com']
    url="http://wz.sun0769.com/index.php/question/huiyin?page="
    num=0
    start_urls=[url+str(num)]

    # 获取每个帖子的url
    def parse(self, response):
        # 提取每个帖子的href属性值存放到列表中
        links=response.xpath('//div[@class="newsHead clearfix"]/table//td/a[@class="news14"]/@href').extract()
        # 发送每个帖子的请求,使用parse_item方法处理
        for link in links:
            yield scrapy.Request(link,callback=self.parse_item)

        # 设置自动翻页
        if self.num<=150:
            self.num+=30
            # 重新发送新的页面请求
            yield scrapy.Request(self.url+str(self.num),callback=self.parse)

    # 爬取每个帖子的内容
    def parse_item(self,response):
        item=ReplyplatformItem()    # 新建实例
        item["url"]=response.url
        item["title"]=response.xpath('//span[@class="niae2_top"]/text()').extract()[0]
        item["content"]="".join(response.xpath('//td[@class="txt16_3"]/text()').extract())
        yield item

5. piplines对items后期处理&数据保存

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ReplyplatformPipeline(object):
    
    def __init__(self):
        self.filename=open("reply.txt","a",encoding="utf-8")

    def process_item(self, item, spider):
        # 构造每个返回的item内容写入到指定文件
        result=str(item)+"\n\n"
        self.filename.write(result)
        return item

    def spider_closed(self,spider):
        self.filename.closed()

6. 配置项目设置文件settings.py

settings.py
(1)注释robots.txt协议文件的遵从规则:

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

(2)开启管道配置的优先级:
多个管道并存时,数字越小相应的优先级越高;默认数字300

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'replyPlatform.pipelines.ReplyplatformPipeline': 300,
}

7. 执行主体脚本rpPlatform.py

进入目录:\scrapyProject\replyPlatform
执行命令:scrapy crawl rpPlatform

执行完成之后,在该目录下生成指定的txt文本文件,其内存储有爬取的数据,如下:

8. 利用main.py快捷启动执行爬虫命令

在项目目录(\scrapyProject\replyPlatform)下新建一个main.py文件,作为启动执行爬虫命令的快捷启动文件
main.py

from scrapy import cmdline

cmd="scrapy crawl rpPlatform"   # 需要执行的爬虫cmd命令
cmdline.execute(cmd.split())    # 执行命令;默认以空格分割切片
上一篇下一篇

猜你喜欢

热点阅读