利用scrapy爬取itcast的老师信息的超级详细步骤

2019-05-12  本文已影响0人  writ

建包

对于网络爬虫,我们首先要做的便是利用命令行创建文本包,本文命名为cast


scrapy startproject cast

cd cast

scrapy genspider ast itcast.cn

具体步骤如下图:


屏幕快照 2019-05-12 下午09.47.06 上午.png

对生成的item文件进行编写:


import scrapy


class CastItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    position = scrapy.Field()
    detail = scrapy.Field()

对ast文件进行修改

# -*- coding: utf-8 -*-
import scrapy
from cast.items import CastItem


class AstSpider(scrapy.Spider):
    name = 'ast'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):
        node_list = response.xpath('//div[@class="li_txt"]')
        for node in node_list:
            item = CastItem()
            name = response.xpath('//h3/text()').extract()
            position = response.xpath('//h4/text()').extract()
            detail = response.xpath('//p/text()').extract()
            item['name'] = name[0].encode('utf-8')
            item['position'] = position[0].encode('utf-8')
            item['detail'] = detail[0].encode('utf-8')
            yield item

修改管道文件

import json


class CastPipeline(object):
    def __init__(self):
        self.f = open("1.json", "w")

    def process_item(self, item, spider):
        content = json.dumps(str(dict(item)), ensure_ascii=False) + ',\n'
        self.f.write(content)
        return item

    def close_spider(self, spider):
        self.f.close()

开启通道,结束

上一篇 下一篇

猜你喜欢

热点阅读