简单静态网站的爬取

2019-02-18 本文已影响1人留思

利用BeautifulSoup以及requests爬取静态网站http://seputu.com/的标题、章节、章节名称，并将爬取内容存储为JSON格式

第一步：使用Requests访问网站获取HTML文档内容，并打印文档内容

import requests
user_agent = 'Mozilla/4.0(compatible;MSTE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}#模拟浏览器访问
r = requests.get('http://seputu.com/',headers= headers)
print r.text

第二步：分析网站首页结构，确定要抽取标记的位置

分析如下：标题和章节都被包含在<div class="mulu">标记下，标题位于其中的<div class="mulu-title">下的<h2>中，章节位于其中的<div class="box">下的<a>中

soup = BeautifulSoup(r.text,'html.parser',from_encoding='utf-8')#html.parser
content = []
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find('h2')
    if h2!=None:
        h2_title = h2.string #获取标题
        list = []
        for a in mulu.find(class_='box').find_all('a'): #获取所有的a标记中url和章节内容
            href = a.get('href')
            box_title = a.get('title')
            list.append({'href':href,'box_title':box_title})
        content.append({'title':h2_title,'content':list})

第三步：存储为JSON格式

with open('qiye.json','wb')as fp:
    json.dump(content,fp=fp,indent=4)

简单静态网站的爬取

第一步：使用Requests访问网站获取HTML文档内容，并打印文档内容

第二步：分析网站首页结构，确定要抽取标记的位置

第三步：存储为JSON格式

猜你喜欢

热点阅读