爬虫框架scrapy篇五——其他操作：post翻页请求

2021-01-29 本文已影响0人一只酸柠檬精

scrapy实现post翻页请求

scrapy框架默认发送的是get请求，若要发送post请求需要重写scrapy下面的start_requests方法

#认识start_requests的返回值
def start_requests(self):
    url=""
    data = {}
    headers = {}

    yield scrapy.FormRequest(url=url, # 请求的post地址
                             formdata=data, # post携带的数据，是一个字典
                             headers=headers, # 可以定制头信息,在setting.py中定制也可以
                             callback=self.parse # 回调函数
                             )

scrapy实现post翻页请求仅重写start_requests方法，源代码与get方法略有不同，其他setting、items、pipelines等操作与get方法相同
创建项目参考爬虫框架scrapy篇二——创建一个scrapy项目

参考源码如下：

# -*- coding: utf-8 -*-
import re
import scrapy
from ktgg.items import KtggItem


# 爬取网址
# http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp?zd=splc
class FyktggSpider(scrapy.Spider):
    name = 'fyktgg'
    allowed_domains = ['hshfy.sh.cn']
    start_urls = ['http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search_content.jsp']

    # scrapy框架默认发送的是get请求，若要发送post请求需要重写scrapy下面的start_requests方法
    # 方法名：start_requests是固定的
    def start_requests(self):
        data = {"yzm": "WFi4",
                "ft": "",
                "ktrqks": "2021-03-13",
                "ktrqjs": "2021-04-13",
                "spc": "",
                "yg": "",
                "bg": "",
                "ah": "",
                "pagesnum": "3"}
        yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse)

        # 使用代理ip
        # ip = str(json.dumps(IpProxy.getRandomIP())).replace('"', '')
        # proxies = {
        #     'http': 'http://' + str(ip),
        #     'https': 'https://' + str(ip),
        # }
        # yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse, meta={'proxies':proxies})

    # 实现翻页及解析
    def parse(self, response):
        # 解析当页数据
        now_page = response.xpath('//span[@class="current"]/text()').extract()[0].strip()
        print("正在爬取第{}页数据：".format(now_page))

        trs = response.xpath('//table[@id="report"]/tbody/tr')[1:]
        for tr in trs:
            # 创建KtggItem类
            item = KtggItem()

            item['fy'] = tr.xpath('./td[1]/font/text()').extract()[0].strip()
            item['ft'] = tr.xpath('./td[2]/font/text()').extract()[0].strip()
            item['ktrq'] = tr.xpath('./td[3]/text()').extract()[0].strip()
            item['ah'] = tr.xpath('./td[4]/text()').extract()[0].strip()
            item['ay'] = tr.xpath('./td[5]/text()').extract()[0].strip()
            item['cbbm'] = tr.xpath('./td[6]/div/text()').extract()[0].strip()
            item['spz'] = tr.xpath('./td[7]/div/text()').extract()[0].strip()
            item['yg'] = tr.xpath('./td[8]/text()').extract()[0].strip()
            item['bg'] = tr.xpath('./td[9]/text()').extract()[0].strip()
            # 提交item到管道文件（pipelines.py）
            yield item

        # 爬取下一页数据
        next_page = re.findall("\d+", response.xpath('//div[@class="meneame"]/div/a[12]/@href').extract()[0].strip())[0]
        if next_page:
            data = {"yzm": "WFi4",
                    "ft": "",
                    "ktrqks": "2021-03-13",
                    "ktrqjs": "2021-04-13",
                    "spc": "",
                    "yg": "",
                    "bg": "",
                    "ah": "",
                    "pagesnum": "{}".format(next_page)}
            yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse)

items

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class KtggItem(scrapy.Item):
    # define the fields for your item here like:
    fy = scrapy.Field()  # 法院
    ft = scrapy.Field()  # 法庭
    ktrq = scrapy.Field()  # 开庭日期
    ah = scrapy.Field()  # 案号
    ay = scrapy.Field()  # 案由
    cbbm = scrapy.Field()  # 承办部门
    spz = scrapy.Field()  # 审判长
    yg = scrapy.Field()  # 原告
    bg = scrapy.Field()  # 被告

传送门
爬虫框架scrapy篇一——scrapy的架构
https://www.jianshu.com/p/39b326f9cad6
爬虫框架scrapy篇二——创建一个scrapy项目
https://www.jianshu.com/p/00d99a9628b0
爬虫框架scrapy篇三——数据的处理与持久化以及遇到的一些问题
https://www.jianshu.com/p/8824623b551c
爬虫框架scrapy篇四——数据入库（mongodb，mysql）
https://www.jianshu.com/p/573ca74c2277
参考：
scrapy的post简单请求

爬虫框架scrapy篇五——其他操作：post翻页请求

scrapy实现post翻页请求

猜你喜欢

热点阅读