简书产品改进python爬虫路Python爬虫作业

第二周作业(2) 简书读书与豆瓣读书

2017-04-28  本文已影响83人  谁占了我的一年的称号

这次的作业是爬取豆瓣读书和简书读书专题,并且进行简单的分析、比较。

https://book.douban.com/ 豆瓣
http://www.jianshu.com/c/yD9GAd 简书读书专题

豆瓣我主要爬取的是最受关注的读书排行榜,分为两类,虚构和非虚构类。



简书读书爬取的是热门的页面下的读书笔记。并且从标题中分析出所推荐的书名。
爬虫的部分就不多说了,这两个页面都是很简单的。

豆瓣读书最受关注

import requests
from lxml import etree
import csv

fp = open('d:\\豆瓣.csv','wt',newline='')
writer= csv.writer(fp)
writer.writerow(('name','day','author','point','comment'))
url='https://book.douban.com/'
headers={
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Length':'270',
'Host':'api.growingio.com',
'Origin':'https://book.douban.com',
'Referer':'https://book.douban.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
}
html = requests.get(url,headers=headers).content
sel= etree.HTML(html)
fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span[2]/a/@href')[0]
non_fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span[3]/a/@href')[0]
print(fiction,non_fiction)
colls=[]
colls.append(url+fiction)
colls.append(url + non_fiction)
for coll in colls:
    html1 = requests.get(coll).content
    sel = etree.HTML(html1)
    infos = sel.xpath('//ul[@class="chart-dashed-list"]/li/div[2]')
    print(len(infos))
    for info in infos:
        name = info.xpath('h2/a/text()')[0].strip()
        days = info.xpath('h2/span/text()')[0].strip()
        author = info.xpath('p[1]/text()')[0].strip()
        point = info.xpath('p[2]/span[2]/text()')[0].strip()
        comment_num =  info.xpath('p[2]/span[3]/text()')[0].strip()
        # coll.append(name),coll.append(days),coll.append(author),coll.append(point),coll.append(comment_num)
        print(name, days, author, point, comment_num)
        writer.writerow((name,days,author,point,comment_num))
fp.close()
print('d')

简书读书专题,这里主要是一旦爬的过多了,就被BAN了,所以加了随机选取useragent会爬的多一点。设置了延迟,但是还是会被ban,后续还要加随机IP。最后爬下来了16000条数据。

import requests
from lxml import etree
import csv
import time
import random

fp = open('d:\\简书读书2.csv', 'wt', newline='',encoding='GB18030')
write = csv.writer(fp)
write.writerow(('作者', '发表时间', '标题', '阅读量', '评论量', '点赞量', '打赏量'))
for i in range(1,10000):
    url ='http://www.jianshu.com/c/yD9GAd?order_by=top&page=%s'%i
    header_list=[
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",

    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",

    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",

    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",

    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"]

    a = random.choice(header_list)
    print(a)
    header={
    'Accept':'text/html, */*; q=0.01',
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Cookie':'UM_distinctid=15b37d3ac97314-07c83256049285-1571466f-1fa400-15b37d3ac989d; remember_user_token=W1s0MzI0MzI2XSwiJDJhJDEwJGxXbTNyTXg3UHA5UTFHdGR3NWlWdi4iLCIxNDkyNzY2MDAyLjgzMjI0NzciXQ%3D%3D--377e6cf673717abdbd3e45bdea36f27479473fbe; CNZZDATA1258679142=1667859620-1491287654-https%253A%252F%252Fwww.baidu.com%252F%7C1493188772; _ga=GA1.2.1819733996.1491290271; _session_id=SGNyaVNIV0F6VUVRcXJpN1A5ZHhKaGJidmRiSG1jL3oxYm5qaW93N0tUbXJlMkorWW1CSFhTY2VCWEtjSWZPVjE0RWJwNlBTYVpWS1NoVmVIZ2tOWjJxcHY2LzBJdGRhejFYM0xDZHYwSFBoc09OMkNUOHpHVUdTQTJDQm96VjdXRC9kQVRGRnhTejNVZkF3eTFxWDU5b1J4N0F6ZTllSm5Pd2VnRHVHcUg5RU9NK0dsbnQ1NW5hSTJEN0NYYWtKbzhPSnJDNWt5QlJLOWQxM211d3IrUnJzemJoZzQ0dDk5VXJzZXVHWktVdExvamFtRTBYZ0s0V1B4UE5NTXM3bG1BQ1VlTWcva2dVWFp2S0JNQ0lkQnF2RUxIS3NIaEVIYi9KbTIvVjI3NUExZ3QySEcrQ0lLWDdNV1dFMzBXN2pDRWtwcWFHN016Zk5EVDdkMnQzTm55RSsyWmorRmdkNTh4bkg5aVk5a3BVeFV1ZkJXS1pkY1hHQzFIc3JQb2VEbk9sTXlhcG5WOGFheC9SWDdnRkNDbnM3UWkzcVd5bENxVmp2eDA2VmtObz0tLW03RDBpWHlkVmIzVTJmKyt3c3YrZkE9PQ%3D%3D--5c873af39872ea4dda72db314ed3920cf0328201',
    'Host':'www.jianshu.com',
    'Referer':'http://www.jianshu.com/c/yD9GAd',
    'User-Agent':'%s'%a
    }
    time.sleep(3)
    html = requests.get(url,headers=header).content
    sel = etree.HTML(html)
    infos = sel.xpath('//ul[@class="note-list"]/li/div[@class="content"]')
    for info in infos:
        try:
            author = info.xpath('div[@class="author"]/div/a/text()')[0]
            if len(author)==0:
                print('到底了')
                break
            else:
                get_time = (info.xpath('div[@class="author"]/div/span/@data-shared-at')[0].replace('T', ' ')).replace('+08:00','')
                title = info.xpath('a[@class="title"]/text()')[0]
                read_num = info.xpath('div[@class="meta"]/a[1]/text()')[1][:-1]
                comment_num = info.xpath('div[@class="meta"]/a[2]/text()')[1][:-1]
                point_num = info.xpath('div[@class="meta"]/span[1]/text()')[0]
                reward_num = info.xpath('div[@class="meta"]/span[2]/text()')

                if len(reward_num) == 0:
                    reward_num = '0'
                else:
                    reward_num = reward_num[0]
        except IndexError as e:
            print('有错误')
        print(author, get_time, title, read_num, comment_num, point_num, reward_num)
        write.writerow((author, get_time, title, read_num, comment_num, point_num, reward_num))

结果如下图:

Paste_Image.png

通过正则将标题中的书名给匹配出来。然后用wordcloud画了一个简单的图

Paste_Image.png

简书·读书推荐前十名:

'菜根谭': 74,
'如何阅读一本书': 66,
'红楼梦': 64,
'解忧杂货店': 58,
'活着': 56,
'平凡的世界': 43,
'追风筝的人': 39,
'白夜行': 36,
'围城': 33,
'成为作家': 33,

前十名之中的书,我倒是只看过其中的五本。《解忧杂货店》成为了床头读物,《平凡的世界》是在火车上看完的,《活着》是看完让我极度压抑的一本书。

豆瓣读书:

Paste_Image.png

没想到的是期待前十名里,还有一本地理学的书。羞耻的是,一本没看过。

阅读量前十名:
![Uploading Paste_Image_853015.png . . .]](http:https://img.haomeiwen.com/i4324326/51d0ab5d0e1bc4f6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

写作前十名:

Paste_Image.png

打赏前5名:

Paste_Image.png
上一篇下一篇

猜你喜欢

热点阅读