【python爬虫】Beyond歌词爬取、分析

2018-04-13 本文已影响0人 GaGLee

测试分析

scrawl shell http://www.lrcgc.com/lyric-263-314689.html
得到response 200，可以爬取

写爬虫

本次爬取的目标网站属于双向爬取

横向：主页切换+下一次的URL。目标：yield next_url
纵向：从主页进入子页，获取歌词信息。目标：yield items
特殊说明：需要明确items，url在parse()方法中的任何位置返回都可以，而且items的多个属性还可以在parse()方法及其调用的son_parse方法中的任意位置提取、返回！本项目中共设置了4个信息字段，其中name album url三个items属性在parse()方法中提取并返回，另外一个lrc歌词属性则在处理子页面信息的son_parse方法中获取，而且parse()、son_parse方法中都需要items = BeyongLrcItem()、yield items。

items.py中，定义四个属性

    name = scrapy.Field()
    album = scrapy.Field()
    url = scrapy.Field()
    lrc = scrapy.Field()

spider.py中，定义parse()、son_parse()两个方法
2.1定义parse()方法
提示：别忘了import items，并设置Source Root
代码核心：谨记两个任务（返回items、next_url）、返回next_url时用if分情况yield

    def parse(self, response):
        items = BeyongLrcItem()
        items["name"] = response.xpath("//div[@class='thread_posts_list']/table/tbody/tr/td[1]/a/text()").extract()
        items["album"] = response.xpath("//div[@class='thread_posts_list']/table/tbody/tr/td[2]").extract()
        items["url"] = response.xpath("//div[@class='thread_posts_list']/table/tbody/tr/td[1]/a/@href").extract()
        now_page_num = response.xpath("//div[@class='pages']/strong/text()").extract()[0]
        # 即使只有一个数字，xpath.extract()返回的是一个list，不能对list用int，必须添加[0]将这个元素取出
        yield items
        print("成功提取到第", now_page_num, "主页中所有歌曲的名称、专辑和链接3个信息")
        for i in range(len(items["url"]) + 1):
            if i < len(items["url"]):
                print("正在进入第", i + 1, "首歌曲的歌词页面")
                next_page = "http://www.lrcgc.com/" + items["url"][i]
                yield scrapy.Request(next_page, callback=self.son_parse)
                print("成功，将发起下一次请求……")
            else:
                offset = int(now_page_num) + 1
                print("当前页面的歌曲信息已经全部提取完成，即将进入第", offset, "个页面")
                next_page = "http://www.lrcgc.com/songlist-263-" + str(offset) + ".html"
                yield scrapy.Request(next_page, callback=self.parse)
                print('-' * 100)

错点提示：

任务一，返回yield items，不是yield items["name"]，items["album"]，items["url"]
response.xpath("").extract()返回的是一个list，不能对list直接进行int()运算。因此，必须添加[0]，更正为response.xpath("").extract()[0]
任务二，返回url。通过if判断，进行两种不同的yield url。
2.2定义son_parse()方法
代码核心：只有一个任务：yield items，无需yield next_url

    def son_parse(self, response):
        items = BeyongLrcItem()
        items["lrc"] = response.xpath("//p[@id='J_lyric']/text()").extract()
        yield items

Done！

【python爬虫】Beyond歌词爬取、分析

测试分析

写爬虫

猜你喜欢

热点阅读