07 利用python爬虫技术爬取贴吧源码案例
2017-11-21 本文已影响536人
python_spider
以爬取“李毅”吧为例,写一个小程序,完成自动的爬取与本地保存工作,此处在python3环境下运行,python2环境下response.content是字符串,不需要解码,去掉本代码中的decode()即可,具体区别参照文章 04requests模块在python2和python3环境下的小小区别
另外python2环境下,代码中的save方法encoding参数需要去掉,代码中已注释
# coding=utf-8
import requests
class TiebaSpider:
def __init__(self, tieba_name):
self.tieba_name = tieba_name
# 定义一个临时的url
self.temp_url = 'https://tieba.baidu.com/f?kw='+tieba_name+'&pn={}'
self.headers = {
'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
def get_url_list(self): # 构造url列表
url_list = [self.temp_url.format(i*50) for i in range(1000)]
return url_list
def parse_url(self, url): # 发送请求获取响应
print('now parse', url)
response = requests.get(url, headers=self.headers)
return response.content.decode()
def save_html(self, html, page_num): # 保存html
file_path = self.tieba_name + "_" + str(page_num) + ".html"
with open(file_path, "w", encoding='utf-8') as f: # windows下需要加encoding = 'utf-8',因为windows 默认编码方式是gbk\
如果是python2环境下运行,需要去掉encoding这个参数,否则报错
f.write(html)
print("保存成功")
def run(self):
# 1.url list
url_list = self.get_url_list()
# 2.发送请求,获取响应
for url in url_list:
html_str = self.parse_url(url)
# 3.保存
page_num = url_list.index(url) + 1 # index方法获取当前要保存的页码数
self.save_html(html_str, page_num)
if __name__ == '__main__':
tieba = TiebaSpider("李毅")
tieba.run()
运行代码,保存本地结果展示如下
结果展示如图所示