简单爬取豆瓣书籍并保存csv文件

2017-04-30 本文已影响243人 MingSha

知识点：

1、csv文件的保存
2、requests的text content方法区别
3、xpath的使用

最受关注的读书排行榜，分为两类，虚构和非虚构类。
代码从拷贝的，简单修改一下。

import requests
from lxml import etree
import csv

fp = open('d:\\豆瓣.csv','wt',newline='')
writer= csv.writer(fp)
writer.writerow(('name','days','author','date','publisher','price','booktype','point','comment_num'))
url='https://book.douban.com/'
headers={
    'Accept':'*/*',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Origin':'https://book.douban.com',
    'Referer':'https://book.douban.com/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
}
html = requests.get(url,headers=headers).content
sel= etree.HTML(html)
fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span/a/@href')[0]
non_fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span/a/@href')[1]
print(fiction,non_fiction)
colls=[]
colls.append(url+fiction)
colls.append(url + non_fiction)
for coll in colls:
    html1 = requests.get(coll).content
    sel = etree.HTML(html1)
    infos = sel.xpath('//ul[@class="chart-dashed-list"]/li/div[@class="media__body"]')
    print(len(infos))
    for info in infos:
        name = info.xpath('h2/a/text()')[0].strip()
        days = info.xpath('h2/span/text()')[0].strip()
        bookinfo = info.xpath('p[@class="subject-abstract color-gray"]/text()')[0].strip().split("/")
        author = bookinfo[0]
        date = bookinfo[1]
        publisher = bookinfo[2]
        price = bookinfo[3]
        booktype = bookinfo[4]
        point = info.xpath('p[@class="clearfix w250"]/span[2]/text()')[0].strip()
        comment_num =  info.xpath('p[@class="clearfix w250"]/span[3]/text()')[0].strip()
        print(name,days,author,date,publisher,price,booktype,point,comment_num)
        writer.writerow((name,days,author,date,publisher,price,booktype,point,comment_num))
fp.close()

主要是写csv及保存方法。

Paste_Image.png

另外注意：

requests的text返回的是Unicode型的数据。
requests的content返回的是bytes型也就是二进制的数据。
也就是说，如果你想取文本，可以通过r.text。
如果想取图片，文件，则可以通过r.content。
requests的json()返回的是json格式数据

下面保存图片的代码，则必须用content方法：

import requests
jpg_url = 'http:https://img.haomeiwen.com/i2744623/55f59803c7aa7301.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240'
content = requests.get(jpg_url).content
with open('demo.jpg', 'wb') as fp:
    fp.write(content)

简单爬取豆瓣书籍并保存csv文件

知识点：

猜你喜欢

热点阅读