Scrapy抓取豆瓣电影TOP250

2018-10-26  本文已影响12人  我的袜子都是洞

目标站点:


Jietu20181026-110711@2x.jpg

提取结构化条目(电影排名、电影名称、电影评分、电影评价人数):
iterms.py

import scrapy

class DoubanMovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ranking = scrapy.Field()
    movie_name = scrapy.Field()
    score = scrapy.Field()
    score_num = scrapy.Field()

爬取源码:
spider.py

import scrapy
from ..items import  DoubanMovieItem

class SinaSpider(scrapy.Spider):
   name = 'douban'
   start_urls = [
       "https://movie.douban.com/top250",
   ]

   def parse(self, response):
       item = DoubanMovieItem()
       movies = response.xpath("//div[@class='item']")
       for movie in movies:
           item['ranking'] =  movie.xpath("./div/em/text()").extract_first()
           item['movie_name'] = movie.xpath("./div/div/a/span[1]/text()").extract_first()
           item['score'] = movie.xpath("./div/div/div[@class='star']/span[@class='rating_num']/text()").extract_first()
           item['score_num'] = movie.xpath("./div/div/div[@class='star']/span[4]/text()").extract_first()
           yield item
       
       next_page = response.xpath("//div[@class='paginator']/span[@class='next']/a/@href").extract_first()
       if next_page is not None:
           next_url = "https://movie.douban.com/top250" + next_page
           yield scrapy.Request(next_url)

运行效果:


Jietu20181026-111047@2x.jpg
上一篇下一篇

猜你喜欢

热点阅读