python爬虫日记本首页投稿(暂停使用,暂停投稿)@产品首页推荐

新手向——简单维基表格数据抓取与可视化

2017-07-28  本文已影响972人  treelake

英文原文:Webscraping and beyond

import warnings
warnings.filterwarnings("ignore")

How to ignore deprecation warnings in Python

import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
import re
import urllib
from datetime import datetime
import os
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
os.chdir('./your-directory/')
os.getcwd()
url = 'https://en.wikipedia.org/wiki/Healthcare_in_Europe' 
r = requests.get(url)
HCE = BeautifulSoup(r.content)
type(HCE)
# print(HCE.prettify())来显示美化后的代码
htmlpage = urllib.request.urlopen(url)
# 此处原文用urllib库重新请求,实际上没有这个必要
# 直接 re.findall('<table class="([^"]*)"', r.text) 找到所有table类名即可
lst = []
for line in htmlpage:
    line = line.rstrip()
    if re.search('table class', line.decode('utf-8')) :
        lst.append(line)
print(lst)
table=HCE.find('table', {'class', 'wikitable sortable'})
type(table)
headers= [header.text for header in table.find_all('th')]
print(headers)
rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])
for n, r in enumerate(rows):
    print(n, r)
    if n > 6: break
df1 = pd.DataFrame(rows, columns=headers)
df1.head(7)
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita' 
r = requests.get(url)
HCE = BeautifulSoup(r.content)
second_table_class_name = re.findall('<table class="([^"]*)"', r.text)[1]
table=HCE.find('table', {'class', second_table_class_name})
headers= [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])
df2 = pd.DataFrame(rows, columns=headers)
df2.head(7)
df1.dtypes
df1 = df1.convert_objects(convert_numeric=True)
df2 = df2.convert_objects(convert_numeric=True)
df2.dtypes
print(df1.columns)
df1.columns = ['Country', 'Ranking', 'Patientrights', 'Accessibility', 'Outcomes', 'Range', 'Prevention', 'Pharmaceuticals']
df1.describe()
df1 = df1.drop(df1.index[0])
df2 = df2.drop(df2.index[0])
df2.columns = ['Country', 'y2012', 'y2013', 'y2014', 'y2015']
df2.head(5)
pd.merge(df1,df2, how='left', on='Country').head(10)

可以看到数据混合后出现了很多NaN(即数据缺失)。说明两个表格中的城市并不一致。下面,具体来看两个表格中的重叠城市:

set(df1['Country']) & set(df2['Country'])

显示为空集合。但实际上,有不少重叠城市。我们将这两个表格输出到csv文件并打开查看

df1.to_csv('df1example.csv', sep=",")
df2.to_csv('df2example.csv', sep=",")

用excel打开df1example.csv并截图存为Screenshot.png,可看到Country列中出现奇怪的字符

# 可忽略,原文中读取截图并显示,多此一举
img = mpimg.imread('Screenshot.png')
plt.figure(figsize = (9, 9))
plt.axis('off')
plt.imshow(img)
plt.show()
repr(df1.Country)
repr(df2['Country'])
df1.Country = df1.Country.apply(lambda x: x.strip())
repr(df1['Country'])
set(df1['Country']) & set(df2['Country'])
pd.merge(df1,df2, how='left', on='Country')
df3 = pd.merge(df1,df2, how='left', on='Country')
df3.dropna(how='any', inplace=True)
df3 = df3.convert_objects(convert_numeric=True)
import matplotlib as mpl
import matplotlib.pyplot as plt
from adjustText import adjust_text

def plot_df3(adjust=True):
    mpl.rcParams['font.size'] = 12.0
    plt.figure(figsize = (14, 14))
    plt.scatter(df3.Patientrights, df3.Outcomes, facecolors='none', edgecolors='red', linewidth=1.2, s=0.1*df3.y2014)
    texts = []
    plt.title('Relation between different health parameters')
    plt.xlabel('Patient rights')
    plt.ylabel('Outcomes')
    plt.xlim(0, 40)
    plt.ylim(0, 40)

    for x, y, s in zip(df3['Patientrights'], df3['Outcomes'], df3['Country']):
        texts.append(plt.text(x, y, s, size=12))
    if adjust:
        plt.title(str(adjust_text(texts, arrowprops=dict(arrowstyle="-", color='black', lw=0.5))))
_ = plot_df3()
plt.show()

可以看出病人权利排名靠前的国家,医疗成果排名相对靠前。
医疗开销大的国家,成果和权利排名相对靠前。

参考

[1]. Data is the new oil. https://www.changethislimited.co.uk/2017/01/data-is-the-new-oil/
[2]. Data on internet. https://www.livescience.com/54094-how-big-is-the-internet.html
[3]. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/
[4]. Healthcare Europe. https://en.wikipedia.org/wiki/Healthcare_in_Europe
[5]. Visualizing European healthcare using Tableau. https://rrighart.github.io/HE-Tableau/
[6]. Scraping tables. https://stackoverflow.com/questions/17196018/extracting-table-contents-from-html-with-python-and-beautifulsoup
[7]. Health expenditure. https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita
[8]. Hidden characters. https://stackoverflow.com/questions/31341351/how-can-i-identify-invisible-characters-in-python-strings
[9]. Adjust text package. https://github.com/Phlya/adjustText/blob/master/examples/Examples.ipynb

上一篇 下一篇

猜你喜欢

热点阅读