Python + Scrapy爬取高逼格音乐网站《落网》

2018-01-22 本文已影响232人 s_nash

最近，在学习Python的爬虫框架scrapy。现在利用scrapy框架，把之前写过的一个落网爬虫重新实现一遍。
爬虫的具体分析见本人之前写的python爬虫-爬取高逼格音乐网站《落网》
首先，先进入dos模式下面，在合适的目录建一个scrapy的工程，如下图：

上面所示，一个新的scrapy课程创建成功；在spiders目录下面新创建一个爬虫文件，具体的结构如下：

接下来看看具体的实现
实现
items.py (scrapy)
item模块用来定义爬取的目标，就是从非结构性的数据源提取结构性数据。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class LuowangItem(scrapy.Item):
    index = scrapy.Field()  #期刊号
    songName = scrapy.Field()  #歌曲名
    songDownloadURL = scrapy.Field()  #歌曲下载地址

luo_spider.py (scrapy)
spiders模块主要是爬虫的主体部分，爬虫的主要实现。

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.selector import Selector
from ..items import LuowangItem
from scrapy.http import Request
import re

class LuoSpider(Spider):
    name = 'luo'   #爬虫名
    allowed_domains = ['luoo.net']
    start_urls = ['http://www.luoo.net/music/folk'] #爬取的原始url

    def parse(self, response):
        selector = Selector(response)
        vol_list = selector.xpath('//div[@class="vol-list"]/div')
        pattern = re.compile('[0-9]+')
        for vol in vol_list:
            item = LuowangItem()
            index_url = vol.xpath('a/@href').extract()[0]
            index = re.search(pattern,vol.xpath('div/a/text()').extract()[0]).group().lstrip('0')
            item['index'] = index
            yield Request(index_url, meta={'item': item}, callback=self.get_songInfos)  #调用爬取每一期刊里面的内容

        #获取下一页期刊url
        next_url = selector.xpath('//div[@class="paginator"]/a[@class="next"]/@href').extract()
        if next_url:
            next_url = next_url[0]
            yield Request(next_url, callback=self.parse)  #回调爬取下一页期刊

    def get_songInfos(self, response):
        item = response.meta['item']  #传入期刊号
        radio = item['index']
        songinfos = response.xpath('//*[@id="luooPlayerPlaylist"]/ul/li')
        for songinfo in songinfos:
            songName = songinfo.xpath('div/a/text()').extract()[0].split('.')[1].lstrip() #获取歌曲名
            number = songinfo.xpath('div/a/text()').extract()[0].split('.')[0].lstrip()
            songDownloadURL = 'http://mp3-cdn2.luoo.net/low/luoo/radio' + str(radio) + '/' + str(number) + '.mp3' #获取歌曲下载地址
            item['songName'] = songName
            item['songDownloadURL'] = songDownloadURL
            yield item

pipelines.py(scrapy)
主要用来处理爬取的item

# -*- coding: utf-8 -*-

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import urllib2

class LuowangPipeline(object):
    def process_item(self, item, spider):
        songName = item['songName']
        songDownloadURL = item['songDownloadURL']
        try:
            data = urllib2.urlopen(songDownloadURL).read()
        except urllib2.URLError:
            print("######链接不存在，继续下载下一首########")
        with open (('D:\\test\\song\\%s.mp3' %(songName)).decode('utf-8'), 'wb') as f:
            f.write(data)
        return item

到这里，整个落网音乐的爬虫就实现了。到dos正确的目录下面，在爬虫的对应的目录下面，执行

scrapy crawl luo

结果如下：

如果对您有点帮助的话，麻烦您给点个赞，谢谢。

Python + Scrapy爬取高逼格音乐网站《落网》

猜你喜欢

热点阅读