Python3.6:爬取简书热门文章

2018-01-10 本文已影响6人 james_chang

想爬热门文章，就要先找到热门文章在哪里，主页20篇，7日热门20篇，30日热门还有20篇，那就让我们把这60篇文章抓下来吧：
60篇文章，对应着60个网址，我不可能一个一个的去找网址，也不可能一个一个网页的去爬，所以呢，就要先将这60个网址爬下来放在列表里或者字典里，然后用个for循环轮流爬这60个网页。
先把网址爬下来吧：

先看看网页源码

唔！好整齐！

那URL一定在里面了，让我们来看一看吧：

并没有找到（0-0）这是什么原因呢，让我们仔细看一下，仔细找一找，再多试一试

原来把URL藏在了这里，简书首页网址，加上这段字符，就是这篇文章的网址了

而且，每个a标签里的内容都好整齐啊，弄的我懒病又翻了，都懒得写正则了，直接用BeautifulSoup取到a标签里的内容，然后切片一下不就行了？
让我们来试一试吧：

import requests
from bs4 import BeautifulSoup
import pickle


class Grapjianshu(object):

    def __init__(self, url):
        self.url = url
        self.data_dict = {}

    def url_list(self):
        #得到网页源码
        response = requests.get(self.url).text
        #实例化一个beautifulsoup对象
        soup = BeautifulSoup(response, 'html.parser')
        #爬取热门文章链接
        data = soup.find_all(name='a', class_='title')
        self.data_dict = {}
        for i in data:
            s = str(i)
            name = s[56:-4]
            url = 'https://www.jianshu.com'+s[23:38]
            self.data_dict[name] = url
        for a in self.data_dict:
            print(a, self.data_dict[a])

看下结果吧：

看来文章题目和链接都得到了呢，都没有用到正则，全部存在了data_dict字典里
那接下来就让我们把这些文章页内容爬下来并写入文件把（这里放全部代码）：

import requests
from bs4 import BeautifulSoup
import pickle


class Grapjianshu(object):

    def __init__(self, url):
        self.url = url
        self.data_dict = {}

    def url_list(self):
        #得到网页源码
        response = requests.get(self.url).text
        #实例化一个beautifulsoup对象
        soup = BeautifulSoup(response, 'html.parser')
        #爬取热门文章链接
        data = soup.find_all(name='a', class_='title')
        self.data_dict = {}
        for i in data:
            s = str(i)
            name = s[56:-4]
            url = 'https://www.jianshu.com'+s[23:38]
            self.data_dict[name] = url
        for a in self.data_dict:
            print(a, self.data_dict[a])

    def url_content(self):
        for i in self.data_dict:
            f = open(i+'.html', 'wb')
            response = requests.get(self.data_dict[i]).text

            pickle.dump(response, f)
            f.close()


a = Grapjianshu('https://www.jianshu.com')
a.url_list()
a.url_content()
# c = Grapjianshu('https://www.jianshu.com/trending/monthly')
# c.url_list()
# c.url_content()
# b = Grapjianshu('https://www.jianshu.com/trending/weekly')
# b.url_list()
# b.url_content()

运行下试试：

GET IT！！！
这里由于时间关系我并没有把文章内容挑出来，只是得到了文章页面源码，不过剩下的也简单了，这里我要谢谢简书没有用各种反爬虫来对付我这个爬虫新手
这只是个雏形，有很多需要修正和改进的地方，希望大家可以多给我提出建议和问题，谢谢！

转载请注明出处

python自学技术互助扣扣群：670402334

Python3.6:爬取简书热门文章

猜你喜欢

热点阅读