2022-09-19 爬虫日记

2022-09-18 本文已影响0人会爬虫的小蟒蛇

今天工作了一天，也没有学到啥新知识，感觉自己血亏哦。那就水一篇博客压压惊吧！！！

这是一个适合刚入门爬虫的“童鞋”尝试的网站（不能说有点简单，只能说毫无难度）

由于太简单了，没啥好讲的，直接上代码吧

import scrapy


class HeilongjiangfazhanhegaigeweiyuanhuiSpider(scrapy.Spider):
    name = 'HeiLongJiangFaZhanHeGaiGeWeiYuanHui'

    def start_requests(self):
        yield scrapy.Request(
            url='http://hlj.tzxm.gov.cn/xzxk/xzxk_list',
        )


    def parse(self, response):
        trs = response.css("#list-content>tr")

        for tr in trs:
            item = {
                "title_name": tr.css(".info a::text").extract_first(),
                "title_url": "http://hlj.tzxm.gov.cn/xzxk/xzxk_page?APPROVAL_DOC_ID=" + tr.css(".info a::attr(onclick)").extract_first()[9:-2],
                "title_date": tr.css("td:nth-child(4)::text").extract_first(),
                # "content_html": response.css(".deatilContent").extract_first(),
            }
            yield scrapy.Request(
                url=item["title_url"],
                callback=self.context_parse,
                meta={
                    "item": item
                }
            )

        sqlprint = response.xpath('//*[@name="sqlprint"]/@value').extract_first()
        page = response.css("#page::attr(value)").extract_first()
        yield scrapy.FormRequest(
            url="http://hlj.tzxm.gov.cn/xzxk/xzxk_list",
            formdata={
                'pagecount': '10',
                'page': page,
                'action': 'nextPage',
                'sqlprint': sqlprint,
                'totalPage': str(int(page)+1)
            },
            callback=self.parse,
        )

    def context_parse(self, response):
        item = response.meta["item"]
        item["content_html"] = response.text
        yield item

有些地方写的不够优美，大佬勿喷！

2022-09-19 爬虫日记

猜你喜欢

热点阅读