爬虫＜record＞解决方法

2022-06-03 本文已影响0人会爬虫的小蟒蛇

目标网址：http://www.ts.gov.cn/col/col1300641/index.html

在爬取过程中首先会遇到一个304，解决方法参考我的上一篇博客

304解决后我们对新闻列表进行提取

问题：

但是当提取时会发现明明相关数据就在源码中但是不管是xpath还是css都无法定位到

仔细观察源码就会发现以下代码

<script type="text/xml">
    <datastore>
        <nextgroup>
            <![CDATA[<a href="/module/jpage/dataproxy.jsp?page=1&appid=1&appid=1&webid=2326&path=/&columnid=1300641&unitid=4365453&webname=泰顺县人民政府&permissiontype=0"></a>]]>     
        </nextgroup>
        <recordset>
        <record><![CDATA[
<div style="width:100%; height:33px;"><div style="float:left; height:33px; line-height:33px;font-size:14px; margin-left:2px;"><a href='/art/2022/6/2/art_1300641_59043747.html' title='泰顺县环境质量公报2021年' target=_blank >泰顺县环境质量公报2021年</a></div> <div style="float:right; margin-right:3px; height:33px; line-height:33px;font-size:14px; color:#888;">2022-06-02</div></div>]]></record>
        <record><![CDATA[
<div style="width:100%; height:33px;"><div style="float:left; height:33px; line-height:33px;font-size:14px; margin-left:2px;"><a href='/art/2022/6/1/art_1300641_59043730.html' title='温州市生态环境局泰顺分局2022年6月1日拟对浙江乐吹塑料有限公司年产3000吨缠绕膜和5400吨共挤膜建设项目环境影响报告表作出审批意见的公告' target=_blank >温州市生态环境局泰顺分局2022年6月1日拟对浙江乐吹塑料有限公司年产3000吨...</a></div> <div style="float:right; margin-right:3px; height:33px; line-height:33px;font-size:14px; color:#888;">2022-06-01</div></div>]]></record>
        <record><![CDATA[
<div style="width:100%; height:33px;"><div style="float:left; height:33px; line-height:33px;font-size:14px; margin-left:2px;"><a href='/art/2022/5/31/art_1300641_59043707.html' title='关于泰顺雅阳文旅小镇项目（暂）（南区）环境影响报告表的审查意见' target=_blank >关于泰顺雅阳文旅小镇项目（暂）（南区）环境影响报告表的审查意见</a></div> <div style="float:right; margin-right:3px; height:33px; line-height:33px;font-size:14px; color:#888;">2022-05-31</div></div>]]></record>
        <record><![CDATA[
<div style="width:100%; height:33px;"><div style="float:left; height:33px; line-height:33px;font-size:14px; margin-left:2px;"><a href='/art/2022/5/30/art_1300641_59043690.html' title='关于受理《年产30万吨环保型沥青混合料搅拌站技改项目环境影响评价报告表》的公告' target=_blank >关于受理《年产30万吨环保型沥青混合料搅拌站技改项目环境影响评价报告表》的公告</a></div> <div style="float:right; margin-right:3px; height:33px; line-height:33px;font-size:14px; color:#888;">2022-05-30</div></div>]]></record>

我们需要的数据首先被嵌套在了<script>标签中后再次被嵌套到了<record>标签中并且还有 “ ![CDATA[ ” 这种奇怪的东西极难处理

解决方法：

下面我们直接上代码

    def parse(self, response):
        item = {}
        html = response.text.replace('<record>','').replace('</record>','').replace('<![CDATA[','').replace(']]>',"").replace('\n',"")
        reResponse = re.search(r'<datastore>(.*)</datastore>', html).group()
        html = etree.HTML(reResponse)
        div_list = html.xpath("//recordset/div")[1:]
        for div in div_list:
            item["title_url"] = response.urljoin(div.xpath("./div/a/@href")[0])
            item["title_name"] = div.xpath("./div/a/text()")[0]
            item["title_date"] = div.xpath("./div[2]/text()")[0]
            yield scrapy.Request(
                meta={'item': deepcopy(item)},
                url=item["title_url"],
                callback=self.content_parse,
                dont_filter=True)

首先我们先用replace方法把一些奇怪的东西全部处理掉

这里需要注意最后处理的那个换行是为了方便后续用正则表达式提取我们需要的数据

第二步使用正则表达式直接深入script标签内部提取一段核心代码

此时我们提取到的代码已经是干干净净的xml/html代码了

可以直接用xpath定位获取需要的数据

总结：

正则表达式平时用处不大且难以维护但是在一些极端恶劣的环境中能发挥奇效

爬虫＜record＞解决方法

问题：

解决方法：

总结：

猜你喜欢

热点阅读