五. 爬虫实战 - 酷狗TOP500的数据
2018-02-14 本文已影响0人
橄榄的世界
爬取网址:http://www.kugou.com/yy/rank/home/1-8888.html
酷狗不能手动翻页,可以尝试将上述网址的1变成2,发现可正常访问。每页有22首歌曲,共500首,所以生成23页即可。
共有23页,规律为:
http://www.kugou.com/yy/rank/home/1-8888.html
http://www.kugou.com/yy/rank/home/2-8888.html
...
http://www.kugou.com/yy/rank/home/23-8888.html
爬取内容:排名情况,歌手,歌曲名,歌曲时长
功能比较简单,直接上代码:
import requests
from bs4 import BeautifulSoup
import time
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
##获取信息
def get_info(url):
r = requests.get(url,headers=headers)
print(r.status_code)
soup = BeautifulSoup(r.text,"lxml")
ranks = soup.select("span.pc_temp_num")
# for rank in ranks:
# print(rank.text.strip())
titles = soup.select("a.pc_temp_songname")
# for title in titles:
# singer = title.text.split("-")[0].strip()
# song = title.text.split("-")[1].strip()
# print(singer,song)
times = soup.select("span.pc_temp_time")
# for time in times:
# time = time.text.strip()
# print(time)
for rank,title,time in zip(ranks,titles,times):
data = {
'rank':rank.text.strip(),
'singer':title.text.split("-")[0].strip(),
'song':title.text.split("-")[1].strip(),
'time':time.text.strip()
}
print(data)
if __name__ == "__main__":
urls = ["http://www.kugou.com/yy/rank/home/{}-8888.html".format(i) for i in range(1,24)] ##构造URL的请求页面
for url in urls:
get_info(url)
time.sleep(1)
部分结果为:
{'rank': '1', 'singer': '于文文', 'song': '体面', 'time': '4:42'}
{'rank': '2', 'singer': '袁娅维', 'song': '说散就散', 'time': '4:02'}
{'rank': '3', 'singer': '新乐尘符', 'song': '123我爱你', 'time': '3:19'}
...
{'rank': '499', 'singer': '薛之谦', 'song': '我好像在哪见过你', 'time': '4:39'}
{'rank': '500', 'singer': '蒋蒋', 'song': '残雪', 'time': '3:59'}