spider

07 利用python爬虫技术爬取贴吧源码案例

2017-11-21  本文已影响536人  python_spider

以爬取“李毅”吧为例,写一个小程序,完成自动的爬取与本地保存工作,此处在python3环境下运行,python2环境下response.content是字符串,不需要解码,去掉本代码中的decode()即可,具体区别参照文章 04requests模块在python2和python3环境下的小小区别
另外python2环境下,代码中的save方法encoding参数需要去掉,代码中已注释

# coding=utf-8
import requests


class TiebaSpider:
    def __init__(self, tieba_name):
        self.tieba_name = tieba_name
        # 定义一个临时的url
        self.temp_url = 'https://tieba.baidu.com/f?kw='+tieba_name+'&pn={}'
        self.headers = {
            'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
        }

    def get_url_list(self):   # 构造url列表
        url_list = [self.temp_url.format(i*50) for i in range(1000)]
        return url_list

    def parse_url(self, url):  # 发送请求获取响应
        print('now parse', url)
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def save_html(self, html, page_num):  # 保存html
        file_path = self.tieba_name + "_" + str(page_num) + ".html"
        with open(file_path, "w", encoding='utf-8') as f:   # windows下需要加encoding = 'utf-8',因为windows 默认编码方式是gbk\
            如果是python2环境下运行,需要去掉encoding这个参数,否则报错
            f.write(html)
        print("保存成功")

    def run(self):
        # 1.url list
        url_list = self.get_url_list()
        # 2.发送请求,获取响应
        for url in url_list:
            html_str = self.parse_url(url)
        # 3.保存
            page_num = url_list.index(url) + 1  # index方法获取当前要保存的页码数
            self.save_html(html_str, page_num)

if __name__ == '__main__':
    tieba = TiebaSpider("李毅")
    tieba.run()

运行代码,保存本地结果展示如下


结果展示如图所示
上一篇下一篇

猜你喜欢

热点阅读