新手向——简单维基表格数据抓取与可视化

2017-07-28 本文已影响972人 treelake

英文原文：Webscraping and beyond

目标：欧洲国家的医疗保健排名可视化（从维基百科上抓取表格数据、合并以及可视化）(python3)
去除警告信息（可选）

import warnings
warnings.filterwarnings("ignore")

How to ignore deprecation warnings in Python

引入库

import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
import re
import urllib
from datetime import datetime
import os
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

os.chdir('./your-directory/')
os.getcwd()

请求网页并解析

url = 'https://en.wikipedia.org/wiki/Healthcare_in_Europe' 
r = requests.get(url)
HCE = BeautifulSoup(r.content)
type(HCE)
# print(HCE.prettify())来显示美化后的代码

找到所有的table标签

htmlpage = urllib.request.urlopen(url)
# 此处原文用urllib库重新请求，实际上没有这个必要
# 直接 re.findall('<table class="([^"]*)"', r.text) 找到所有table类名即可
lst = []
for line in htmlpage:
    line = line.rstrip()
    if re.search('table class', line.decode('utf-8')) :
        lst.append(line)
print(lst)

根据table类名找到它

table=HCE.find('table', {'class', 'wikitable sortable'})
type(table)

分析该table，找到表头

headers= [header.text for header in table.find_all('th')]
print(headers)

读取table的每行

rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])

展示头几行数据

for n, r in enumerate(rows):
    print(n, r)
    if n > 6: break

现在利用得到的元素构建数据帧（DataFrame）df1

df1 = pd.DataFrame(rows, columns=headers)
df1.head(7)

请求类似的另一网页得到各国医疗开销数据生成df2数据帧

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita' 
r = requests.get(url)
HCE = BeautifulSoup(r.content)
second_table_class_name = re.findall('<table class="([^"]*)"', r.text)[1]
table=HCE.find('table', {'class', second_table_class_name})
headers= [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])
df2 = pd.DataFrame(rows, columns=headers)
df2.head(7)

查看数据类型

df1.dtypes

自动将可转换的数据转化为数字类型

df1 = df1.convert_objects(convert_numeric=True)
df2 = df2.convert_objects(convert_numeric=True)

检查新的数据类型

df2.dtypes

替换为更简单的列名（表头）并展示基础数据分析

print(df1.columns)
df1.columns = ['Country', 'Ranking', 'Patientrights', 'Accessibility', 'Outcomes', 'Range', 'Prevention', 'Pharmaceuticals']
df1.describe()

删除df1和df2的第一行（全None行）并更名df2表头

df1 = df1.drop(df1.index[0])
df2 = df2.drop(df2.index[0])
df2.columns = ['Country', 'y2012', 'y2013', 'y2014', 'y2015']
df2.head(5)

混合不同的数据(df1与df2)

pd.merge(df1,df2, how='left', on='Country').head(10)

可以看到数据混合后出现了很多NaN（即数据缺失）。说明两个表格中的城市并不一致。下面，具体来看两个表格中的重叠城市：

set(df1['Country']) & set(df2['Country'])

显示为空集合。但实际上，有不少重叠城市。我们将这两个表格输出到csv文件并打开查看

df1.to_csv('df1example.csv', sep=",")
df2.to_csv('df2example.csv', sep=",")

用excel打开df1example.csv并截图存为Screenshot.png，可看到Country列中出现奇怪的字符

# 可忽略，原文中读取截图并显示，多此一举
img = mpimg.imread('Screenshot.png')
plt.figure(figsize = (9, 9))
plt.axis('off')
plt.imshow(img)
plt.show()

使用repr函数让我们看到df1中奇怪的字符

repr(df1.Country)

df2中是没有的

repr(df2['Country'])

将df1的Country列中所有奇怪字符去除

df1.Country = df1.Country.apply(lambda x: x.strip())
repr(df1['Country'])

重新查看两个表格的城市交集

set(df1['Country']) & set(df2['Country'])

现在可以成功合并了，虽然还有一部分缺失，但大部分都有了

pd.merge(df1,df2, how='left', on='Country')

合并生成新的df3，并去除有缺失值的行，并转换为数字类型

df3 = pd.merge(df1,df2, how='left', on='Country')
df3.dropna(how='any', inplace=True)
df3 = df3.convert_objects(convert_numeric=True)

画出散点图，并利用adjustText库(pip install adjustText)避免国家标签产生重叠。散点大小表示该国2014年的医疗开销。

import matplotlib as mpl
import matplotlib.pyplot as plt
from adjustText import adjust_text

def plot_df3(adjust=True):
    mpl.rcParams['font.size'] = 12.0
    plt.figure(figsize = (14, 14))
    plt.scatter(df3.Patientrights, df3.Outcomes, facecolors='none', edgecolors='red', linewidth=1.2, s=0.1*df3.y2014)
    texts = []
    plt.title('Relation between different health parameters')
    plt.xlabel('Patient rights')
    plt.ylabel('Outcomes')
    plt.xlim(0, 40)
    plt.ylim(0, 40)

    for x, y, s in zip(df3['Patientrights'], df3['Outcomes'], df3['Country']):
        texts.append(plt.text(x, y, s, size=12))
    if adjust:
        plt.title(str(adjust_text(texts, arrowprops=dict(arrowstyle="-", color='black', lw=0.5))))
_ = plot_df3()
plt.show()

可以看出病人权利排名靠前的国家，医疗成果排名相对靠前。
医疗开销大的国家，成果和权利排名相对靠前。

参考

[1]. Data is the new oil. https://www.changethislimited.co.uk/2017/01/data-is-the-new-oil/
[2]. Data on internet. https://www.livescience.com/54094-how-big-is-the-internet.html
[3]. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/
[4]. Healthcare Europe. https://en.wikipedia.org/wiki/Healthcare_in_Europe
[5]. Visualizing European healthcare using Tableau. https://rrighart.github.io/HE-Tableau/
[6]. Scraping tables. https://stackoverflow.com/questions/17196018/extracting-table-contents-from-html-with-python-and-beautifulsoup
[7]. Health expenditure. https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita
[8]. Hidden characters. https://stackoverflow.com/questions/31341351/how-can-i-identify-invisible-characters-in-python-strings
[9]. Adjust text package. https://github.com/Phlya/adjustText/blob/master/examples/Examples.ipynb

新手向——简单维基表格数据抓取与可视化

参考

猜你喜欢

热点阅读