Python爬虫入门到入职03:全量抓取
全量抓取是一种常见的抓取方式,针对目标网站进行批量抓取,需要我们进行翻页操作,遍历整个网站。
本章知识点:
- 网页中文编码问题
- 处理翻页,实现全量抓取
- 抽取函数,减少重复代码
- 异常处理
处理中文编码
我们以手机天堂-新闻资讯为本次项目,分析网页源码写出简单的抓取代码:
class PhoneHeavenSpider:
def start(self):
rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
print(rsp.text)
执行代码,查看得到的html源码,发现异常字符:
![](https://img.haomeiwen.com/i3912051/3f5555476a86a877.png)
用chrome打开手机天堂-新闻资讯,点击右键,点击“查看网页源码”:
![](https://img.haomeiwen.com/i3912051/ed4b8da7d1362aac.png)
中文有很多编码方式,可以查看中文编码杂谈详细了解。对中文编码有一个基础认知后,我们来修改代码:
class PhoneHeavenSpider:
def start(self):
rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
rsp.encoding = 'gbk'
print(rsp.text)
执行代码,成功获取中文源码:
![](https://img.haomeiwen.com/i3912051/9c164a881f5c6d3d.png)
关于网页编码问题,我们要知其然还要知其所以然,requests使用三种方法自动识别网页编码:
- get_encodings_from_content():通过预设的正则表达式,从页面种识别编码。
- get_encoding_from_headers():从HTTP头部
Content-Type
中识别编码,如果没有设置charset
则默认为ISO-8859-1
。 - 使用chardet库(requests的依赖库)的detect()函数猜测编码。
当requests没有正确识别编码时,在获取rsp.text前手动设置编码,即可按正确编码进行解析。
翻页处理
我们使用一个新的解析工具lxml
,使用pip进行安装:pip install lxml
。继续编写代码,根据上次教程先抓取首页信息:
class PhoneHeavenSpider:
def start(self):
rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
rsp.encoding = 'gbk' # 指定编码方式
# 处理第一页数据
soup = BeautifulSoup(rsp.text, 'lxml') # 使用一个更强大的解析库,需要安装pip install lxml
for div_node in soup.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url = urljoin(rsp.url, href)
# 继续请求新闻页面
rsp_detail = requests.get(url)
rsp_detail.encoding = 'gbk'
soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
info = soup_detail.find('div', class_='top_others_lf').text.strip() # 包含时间、作者信息
infos = info.split('|') # 使用split()对字符串进行切割
publish_time = infos[0].split(':')[-1].strip() # 发布时间
author = infos[1].split(':')[-1].strip() # 作者
summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip() # 简介
article = soup_detail.find('div', class_='zxxq_main_txt').text.strip() # 文章
images = [] # 图片
for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp_detail.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
print(data)
执行代码,控制台打印出结果,成功!
要进行全量抓取的话,还必须做翻页处理,获取每一页的新闻链接。我们先查看新闻列表页的链接特征:
- 第1页:https://www.xpgod.com/shouji/news/zixun.html
- 第2页:https://www.xpgod.com/shouji/news/zixun_2.html
- 第3页:https://www.xpgod.com/shouji/news/zixun_3.html
一般情况下,翻页即是在url中修改对应页码。我们先获取最大页数,再拼接出url即可实现翻页(特殊情况需要特殊处理)。
在源码中找到最大页码:
![](https://img.haomeiwen.com/i3912051/076f8eb6f0ac587d.png)
继续编写代码:
class PhoneHeavenSpider:
def start(self):
rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
rsp.encoding = 'gbk'
# 处理第一页数据
soup = BeautifulSoup(rsp.text, 'lxml')
for div_node in soup.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url = urljoin(rsp.url, href)
# 继续请求新闻页面
rsp_detail = requests.get(url)
rsp_detail.encoding = 'gbk'
soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
info = soup_detail.find('div', class_='top_others_lf').text.strip()
infos = info.split('|')
publish_time = infos[0].split(':')[-1].strip()
author = infos[1].split(':')[-1].strip()
summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
images = []
for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp_detail.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
# 翻页处理
li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
max_page = int(li_node.text.strip()) # 拿到的text是字符串类型,需要转为int类型
for page in range(2, max_page + 1): # 2 ~ max_page+1页,不包含第max_page+1
url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page) # 字符串格式化,使用page的值填充”{}“
print(url)
成功拼接出所有新闻列表页的url:
![](https://img.haomeiwen.com/i3912051/df3a244501576b1c.png)
重复第一步操作,获得每一页新闻链接,抓取每一条新闻数据:
class PhoneHeavenSpider:
def start(self):
rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
rsp.encoding = 'gbk'
# 处理第一页数据
soup = BeautifulSoup(rsp.text, 'lxml')
for div_node in soup.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url = urljoin(rsp.url, href)
# 继续请求新闻页面
rsp_detail = requests.get(url)
rsp_detail.encoding = 'gbk'
soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
info = soup_detail.find('div', class_='top_others_lf').text.strip()
infos = info.split('|')
publish_time = infos[0].split(':')[-1].strip()
author = infos[1].split(':')[-1].strip()
summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
images = []
for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp_detail.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
# 翻页处理
li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
max_page = int(li_node.text.strip())
for page in range(2, max_page + 1):
url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
rsp_index = requests.get(url)
rsp.encoding = 'gbk'
# 处理列表页,类似处理第一页
soup_index = BeautifulSoup(rsp_index.text, 'lxml')
for div_node in soup_index.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url = urljoin(rsp.url, href)
# 继续请求新闻页面
rsp_detail = requests.get(url)
rsp_detail.encoding = 'gbk'
soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
info = soup_detail.find('div', class_='top_others_lf').text.strip()
infos = info.split('|')
publish_time = infos[0].split(':')[-1].strip()
author = infos[1].split(':')[-1].strip()
summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
images = []
for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp_detail.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
print('这是来自第{}页的新闻数据:'.format(page), data)
执行代码,开始全量抓取!
![](https://img.haomeiwen.com/i3912051/23f8c02e9bf1b8c2.png)
函数抽取
回顾代码,我们发现解析第1页和其他列表页的代码一样,而且请求新闻页面的代码写了两次。为了避免这种情况,我们来把重复的代码写成函数进行调用,代码如下:
class PhoneHeavenSpider:
def start(self):
self.crawl_index(1)
# 抓取列表页数据,包含第一页。
def crawl_index(self, page):
if page == 1:
url = 'https://www.xpgod.com/shouji/news/zixun.html'
else:
url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
rsp = requests.get(url)
rsp.encoding = 'gbk'
soup = BeautifulSoup(rsp.text, 'lxml')
for div_node in soup.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url_detail = urljoin(rsp.url, href)
self.crawl_detail(url_detail, page)
if page == 1: # 翻页处理,只有第1页需要
li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
max_page = int(li_node.text.strip())
for new_page in range(2, max_page + 1):
self.crawl_index(new_page)
# 抓取新闻详情数据
def crawl_detail(self, url, page):
rsp = requests.get(url)
rsp.encoding = 'gbk'
soup = BeautifulSoup(rsp.text, 'lxml')
title = soup.find('div', class_='youxizt_top_title').text.strip()
info = soup.find('div', class_='top_others_lf').text.strip()
infos = info.split('|')
publish_time = infos[0].split(':')[-1].strip()
author = infos[1].split(':')[-1].strip()
summary = soup.find('div', class_='zxxq_main_jianjie').text.strip()
article = soup.find('div', class_='zxxq_main_txt').text.strip()
images = []
for node in soup.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
print('第{}页新闻:'.format(page), data)
代码清爽了很多,根据功能划分函数,使项目更有条理!如果以后需要修改代码,只要在函数内修改一次即可,不用把重复的代码都去修改一次,这就是软件开发的一个基本原则:避免代码重复性。
异常处理
采集过程中,我们可能遇到很多未知情况,比如一次请求异常:
Traceback (most recent call last):
File "E:/JuniorProject/tutorial/phone_heaven.py", line 63, in <module>
File "E:/JuniorProject/tutorial/phone_heaven.py", line 8, in start
self.crawl_index(1)
File "E:/JuniorProject/tutorial/phone_heaven.py", line 30, in crawl_index
max_page = int(li_node.text.strip()) # 拿到的text是字符串类型,需要转为int类型
File "E:/JuniorProject/tutorial/phone_heaven.py", line 24, in crawl_index
href = a_node['href']
File "E:\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "E:\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "E:\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "E:\Python37\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "E:\Python37\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.xpgod.com', port=443): Max retries exceeded with url: /shouji/news/17804.html (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001C8D56FB0F0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
在大规模抓取中会遇到很多异常情况:
- 代码逻辑不够完善
- 网络请求异常
- 非常规页面导致的解析异常
- 脏数据导致的解析异常
- 其他异常
这就要求在代码中进行异常处理,今天我们先简单把异常过滤掉:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
class PhoneHeavenSpider:
def start(self):
self.crawl_index(1)
# 抓取列表页数据,包含第一页。
def crawl_index(self, page):
try:
if page == 1:
url = 'https://www.xpgod.com/shouji/news/zixun.html'
else:
url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
rsp = requests.get(url)
rsp.encoding = 'gbk' # 指定编码方式
soup = BeautifulSoup(rsp.text, 'lxml') # 更强大的解析库lxml
for div_node in soup.find_all('div', class_='zixun_li_title'):
a_node = div_node.find('a')
href = a_node['href']
url_detail = urljoin(rsp.url, href)
self.crawl_detail(url_detail, page) # 抓取新闻url,抓取具体数据
if page == 1: # 翻页处理,只有第1页需要
li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
max_page = int(li_node.text.strip()) # 拿到的text是字符串类型,需要转为int类型
for new_page in range(2, max_page + 1): # 2 ~ max_page+1页,不包含第max_page+1
self.crawl_index(new_page) # 继续调用自身,写上return避免迭代过多导致异常
except:
pass
# 抓取新闻详情数据
def crawl_detail(self, url, page):
try:
rsp = requests.get(url)
rsp.encoding = 'gbk'
soup = BeautifulSoup(rsp.text, 'lxml')
title = soup.find('div', class_='youxizt_top_title').text.strip()
info = soup.find('div', class_='top_others_lf').text.strip() # 包含时间、作者信息
infos = info.split('|') # 使用split()对字符串进行切割
publish_time = infos[0].split(':')[-1].strip() # 发布时间
author = infos[1].split(':')[-1].strip() # 作者
summary = soup.find('div', class_='zxxq_main_jianjie').text.strip() # 简介
article = soup.find('div', class_='zxxq_main_txt').text.strip() # 文章
images = [] # 图片
for node in soup.find('div', class_='zxxq_main_txt').find_all('img'):
src = node['src']
img_url = urljoin(rsp.url, src)
images.append(img_url)
data = {
'title': title,
'publish_time': publish_time,
'author': author,
'summary': summary,
'article': article,
'images': images,
}
print('第{}页新闻:'.format(page), data)
except:
pass
if __name__ == '__main__':
PhoneHeavenSpider().start()
课外练习:
- 游戏葡萄:全量抓取
练习答案:
下一章 >> Python爬虫入门到入职04:数据存储(努力写作中。。。)