7、Python Scrapy框架,简单学习

2019-03-11  本文已影响0人  波罗的海de夏天

Crawl Web:美剧天堂

工程搭建流程:
1、cmd: cd PyCharmProject(工程所在目标文件)
2、cmd: scrapy startproject movie
3、cmd: cd movie
4、cmd: scrapy genspider meiju meijutt.com
5、IDE(PyCharm) 打开工程:
items.py -- 该文件定义存储模板,用于结构化数据

import scrapy
class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()

meiju.py -- 存储实际的爬虫代码

import scrapy
from movie.items import MovieItem
class MeijuSpider(scrapy.Spider):
    name = 'meiju'
    allowed_domains = ['meijutt.com']
    start_urls = ['http://www.meijutt.com/new100.html']

    # def start_requests(self):
    #     urls = ['http://www.meijutt.com/new100.html']
    #     for url in urls:
    #         yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        movies = response.xpath('//ul[@class="top-list  fn-clear"]/li')
        for each_movie in movies:
            item = MovieItem()
            item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
            yield item

pipelines.py --该文件定义数据的存储方式,可以是文件、数据库或其他

class MoviePipeline(object):
    def process_item(self, item, spider):
        with open("my_meiju.txt",'a') as fp:
            fp.write(item['name'])
            # fp.write(str(value=item['name'], encoding="utf-8"))
            fp.write('\n------------\n')

setting.py -- 配置文件,可设置用户代理、爬取延时等

ITEM_PIPELINES = {'movie.pipelines.MoviePipeline': 100}

6、cmd: cd movie
7、cmd: scrapy crawl meiju --log 或 scrapy crawl meiju

上一篇下一篇

猜你喜欢

热点阅读