Python爬虫——爬取贵州省乡镇级行政区划代码（一）

2020-06-16 本文已影响0人模仿打酱油

对于Python爬虫，有很多成熟的框架，如scrapy、urllib2模块等，本人常用的requests+beautifulsoup，而且beautifulsoup只需要选择器select，个人认为对初学者比较容易掌握。下面以爬取行政区划代码为例，抛砖引玉！
目标网址：http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html

1.安装requests+beautifulsoup

pip install requests
pip install beautifulsoup4

2.导入requests+beautifulsoup

import requests
from bs4 import  BeautifulSoup

3.获取网页请求

url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html'
res = requests.get(url=url).content
soup = BeautifulSoup(res, 'html.parser', from_encoding='GBK')

4.检查网页元素

光标移到行政区划代码上，右键选择“检查”，发现行政区划代码都在类名“citytr”的列表中。

1592278386(1).jpg

5.获取采集元素并打印

citys = soup.select('.citytr a')

for city in citys:
    print(city.string)

输出结果如下：

1592278735(1).jpg

完整代码如下：

import requests
from bs4 import BeautifulSoup


url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html'
res = requests.get(url=url).content
soup = BeautifulSoup(res, 'html.parser', from_encoding='GBK')

citys = soup.select('.citytr a')

for city in citys:
    print(city.string)

下节，我们将介绍如何结构化采集数据并保存！

Python爬虫——爬取贵州省乡镇级行政区划代码（一）

1.安装requests+beautifulsoup

2.导入requests+beautifulsoup

3.获取网页请求

4.检查网页元素

5.获取采集元素并打印

注：解析代码时，from_encoding='GBK'属性，是为了避免同一页面不同字符集编码问题！！！

猜你喜欢

热点阅读