1-5 使用pandas保存豆瓣短评数据
2018-06-13 本文已影响219人
pnjoe
常用的保存函数
- open
- pandas (推荐)
- csv
- numpy
open函数
回顾上节课用Xpath来解析数据 代码如下:
import requests from lxml import etree url = 'https://book.douban.com/subject/1084336/comments/' r = requests.get(url).text s = etree.HTML(r) file = s.xpath('//div[@class ="comment"]/p/text()')
用open来保存数据 演示代码:
with open('pinglun.txt','w',encoding = 'utf-8')as f: for i in file: print(i) f.write(i)
用open函数保存的txt文件
pandas函数
用pandas函数保存数据 演示代码
用pandas保存的xlsx文件import requests from lxml import etree url = 'https://book.douban.com/subject/1084336/comments/' r = requests.get(url).text s = etree.HTML(r) file = s.xpath('//div[@class="comment"]/p/text()') import pandas as pd df = pd.DataFrame(file) df.to_excel('pinglun.xlsx')
课后作业:
小王子的短评有5页。完善代码,将5页短评内容爬取下来。并以csv格式保存成文件。
通过打开第2页时,我们发现浏览器的地址发生了变化,在后面多了/hot?p=2
,再点开第3页。发现变成了/hot?p=3
。那么我们大概知道了通过改变后面的数值来实现翻页的功能。我们来试一下import requests from lxml import etree page = 1 file = [] while page < 6: url = 'https://book.douban.com/subject/1084336/comments/hot?p=' + str(page) r = requests.get(url).text s = etree.HTML(r) file += s.xpath('//div[@class="comment"]/p/text()') page += 1 import pandas as pd df = pd.DataFrame(file) df.to_csv('zuoye.csv',encoding = 'utf-8-sig')
相关阅读
python with as的用法