Python3.6:爬取简书热门文章
2018-01-10 本文已影响6人
james_chang
想爬热门文章,就要先找到热门文章在哪里,主页20篇,7日热门20篇,30日热门还有20篇,那就让我们把这60篇文章抓下来吧:
60篇文章,对应着60个网址,我不可能一个一个的去找网址,也不可能一个一个网页的去爬,所以呢,就要先将这60个网址爬下来放在列表里或者字典里,然后用个for循环轮流爬这60个网页。
先把网址爬下来吧:
![](https://img.haomeiwen.com/i9572744/6f21003834b72843.png)
唔!好整齐!
![](http://upload-images.jianshu.io/upload_images/9572744-8e89a1bcaa1d9d8d.png)
![](http://upload-images.jianshu.io/upload_images/9572744-e7e94664b2e23a35.jpeg)
而且,每个a标签里的内容都好整齐啊,弄的我懒病又翻了,都懒得写正则了,直接用BeautifulSoup取到a标签里的内容,然后切片一下不就行了?
让我们来试一试吧:
import requests
from bs4 import BeautifulSoup
import pickle
class Grapjianshu(object):
def __init__(self, url):
self.url = url
self.data_dict = {}
def url_list(self):
#得到网页源码
response = requests.get(self.url).text
#实例化一个beautifulsoup对象
soup = BeautifulSoup(response, 'html.parser')
#爬取热门文章链接
data = soup.find_all(name='a', class_='title')
self.data_dict = {}
for i in data:
s = str(i)
name = s[56:-4]
url = 'https://www.jianshu.com'+s[23:38]
self.data_dict[name] = url
for a in self.data_dict:
print(a, self.data_dict[a])
看下结果吧:
![](http://upload-images.jianshu.io/upload_images/9572744-2320de13f0b73ce3.png)
看来文章题目和链接都得到了呢,都没有用到正则,全部存在了data_dict字典里
那接下来就让我们把这些文章页内容爬下来并写入文件把(这里放全部代码):
import requests
from bs4 import BeautifulSoup
import pickle
class Grapjianshu(object):
def __init__(self, url):
self.url = url
self.data_dict = {}
def url_list(self):
#得到网页源码
response = requests.get(self.url).text
#实例化一个beautifulsoup对象
soup = BeautifulSoup(response, 'html.parser')
#爬取热门文章链接
data = soup.find_all(name='a', class_='title')
self.data_dict = {}
for i in data:
s = str(i)
name = s[56:-4]
url = 'https://www.jianshu.com'+s[23:38]
self.data_dict[name] = url
for a in self.data_dict:
print(a, self.data_dict[a])
def url_content(self):
for i in self.data_dict:
f = open(i+'.html', 'wb')
response = requests.get(self.data_dict[i]).text
pickle.dump(response, f)
f.close()
a = Grapjianshu('https://www.jianshu.com')
a.url_list()
a.url_content()
# c = Grapjianshu('https://www.jianshu.com/trending/monthly')
# c.url_list()
# c.url_content()
# b = Grapjianshu('https://www.jianshu.com/trending/weekly')
# b.url_list()
# b.url_content()
运行下试试:
![](http://upload-images.jianshu.io/upload_images/9572744-7a7ec7f43e7d3d69.png)
GET IT!!!
这里由于时间关系我并没有把文章内容挑出来,只是得到了文章页面源码,不过剩下的也简单了, 这里我要谢谢简书没有用各种反爬虫来对付我这个爬虫新手
这只是个雏形,有很多需要修正和改进的地方,希望大家可以多给我提出建议和问题,谢谢!
转载请注明出处
python自学技术互助扣扣群:670402334