Python爬虫

三十六. Scrapy实战 - 简书热门专题之CSV

2018-03-01  本文已影响0人  橄榄的世界

爬取网址:https://www.jianshu.com/recommendations/collections?order_by=hot
爬取内容:专题名称、专题介绍、收录文章、关注人数
爬取方式:Scrapy框架
储存方式:csv文件

image.png

使用F12,观察动态加载的URL,共37页:

https://www.jianshu.com/recommendations/collections?page=1&order_by=hot
https://www.jianshu.com/recommendations/collections?page=2&order_by=hot
https://www.jianshu.com/recommendations/collections?page=3&order_by=hot

1.items.py文件

import scrapy

class ZhuantiItem(scrapy.Item):
    # define the fields for your item here:
    name = scrapy.Field()
    content = scrapy.Field()
    article = scrapy.Field()
    fans = scrapy.Field()

2.zhuantispider.py文件

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from zhuanti.items import ZhuantiItem

class zhuanti(CrawlSpider):

    name = "zhuanti"
    start_urls = ["https://www.jianshu.com/recommendations/collections?page=1&order_by=hot"]

    def parse(self, response):
        item = ZhuantiItem()
        selector = Selector(response)
        infos = selector.xpath('//div[@class="collection-wrap"]')
        for info in infos:
            try:
                name = info.xpath('a/h4/text()').extract()[0]
                content = info.xpath('a/p/text()').extract()[0].replace('\n', '')
                article = info.xpath('div[@class="count"]/a/text()').extract()[0]
                fans = info.xpath('div[@class="count"]/text()').extract()[0].strip('· ')

                item['name'] = name
                item['content'] =content
                item['article'] = article
                item['fans'] = fans
                yield item

            except IndexError:
                pass

        #构造第2页到第37页的‘热门专题’URL,通过Request请求URL,并回调parse()函数
        urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(2,37)]
        for url in urls:
            yield Request(url,callback=self.parse)

3.settings.py文件: 此处使用Scrapy自带的存储功能(Feed exports),所以不需要使用pipelines.py进行数据的处理存储。

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'     #请求头
DOWNLOAD_DELAY = 0.5                 #睡眠时间0.5秒
FEED_URI = 'file:F:/zhuanti.csv'
FEED_FORMAT = 'csv'                        #存入csv文件

4.main.py文件

from scrapy import cmdline
cmdline.execute("scrapy crawl zhuanti".split())

运行main.py文件即可得到运行结果,用记事本打开:


image.png

如果用EXCEL文件打开,需要先在Notepad+中选择编码“以UTF-8格式编码”,并重新保存即可。


image.png
上一篇 下一篇

猜你喜欢

热点阅读