十七. API实战 - 爬取糗事百科网的用户地址信息
2018-02-20 本文已影响0人
橄榄的世界
- 使用个人BDP
爬取网址:https://www.qiushibaike.com/text/
爬取内容:用户地址
爬取方式:lxml
爬取思路:先确定几大模块,get_url(url)获取用户的详细页面,get_address(url)在用户的详细页面中爬取地址信息,例如:广东、四川等,get_geo(address)通过调用API来获取各省份对应的经度、纬度,最后将获取的地址、经度、纬度信息保存到CSV文件中。
代码:
import requests
from lxml import etree
import json
import csv
import time
def get_url(url):
r = requests.get(url,headers = headers)
print(r.status_code)
html = etree.HTML(r.text)
url_infos = html.xpath('//div[@class="author clearfix"]')
user_link_list = []
for url_info in url_infos:
user_part_link = url_info.xpath('a[1]/@href')
if len(user_part_link) != 0:
user_part_link = user_part_link[0]
user_link = "https://www.qiushibaike.com" + user_part_link
user_link_list.append(user_link)
else:
pass
return user_link_list
def get_address(url):
r = requests.get(user_link,headers = headers)
print(r.status_code)
html = etree.HTML(r.text)
if html.xpath('//div[2]/div[3]/div[2]/ul/li[4]/text()'):
address = html.xpath('//div[2]/div[3]/div[2]/ul/li[4]/text()')[0].split(' · ')[0]
get_geo(address)
else:
pass
def get_geo(address):
par = {'address': address, 'key': 'cb649a25c1f81c1451adbeca73623251'}
url = 'http://restapi.amap.com/v3/geocode/geo'
r = requests.get(url, par)
json_data = json.loads(r.text)
try:
geo = json_data['geocodes'][0]['location']
longtitude = geo.split(',')[0] #经度
latitude = geo.split(',')[1] #纬度
writer.writerow((address,longtitude,latitude))
#print(address,longtitude,latitude)
except IndexError:
pass
if __name__=="__main__":
f = open('F://map.csv','w',newline='')
writer = csv.writer(f)
writer.writerow(('地址','经度','纬度'))
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)]
for url in url_list:
user_link_list = get_url(url)
for user_link in user_link_list:
address = get_address(user_link)
f.close()
结果:
![](https://img.haomeiwen.com/i10026411/453667ef42b142b1.png)
然后利用excel的插入→数据透视表功能整理数据,最终变成:
![](https://img.haomeiwen.com/i10026411/d2a70766aab52de7.png)
然后利用BDP个人版中新建工作表→上传数据→新建图表(选择地图图表)→选择颜色和尺寸等。
最后的效果如下(不同颜色代表不同省份,形状越大的用户分布越多):
![](https://img.haomeiwen.com/i10026411/c7738e6413a55b43.png)
当然,也可以方便地做出常规图表:
![](https://img.haomeiwen.com/i10026411/9689327c6b05dc30.png)
具体可以查看以下链接:https://me.bdp.cn/api/su/5YWKREVA