网页中表格数据解析

2017-07-24  本文已影响0人  无事扯淡

1.任务

把下面网页中的表格数据解析成pandas数据
https://en.wikipedia.org/wiki/Harvard_University

Paste_Image.png

2.方法

import requests
response = requests.get('https://en.wikipedia.org/wiki/Harvard_University')
from lxml import etree
html = etree.HTML(response.text)
table = etree.xpath('//table[@class="wikitable"]')[0]
tr_array = table.findall('tr')
texts = []
for tr in tr_array:
    line = []
    for c in tr.iterchildren():
        line.append(c.text)
    texts.append(line)
col_names = texts[0][1:]
index_names = [t[0] for t in texts[1:]]
values = []
for line in texts[1:]:
    row = []
    for v in line[1:]:
        v = v.strip()
        if v == 'N/A':
            v = None
        elif v.endswith('%'):
            v = int(v[:v.rfind('%')])
        row.append(v)
    values.append(row) 
import pandas as pd
students = pd.DataFrame(values,columns=col_names,index=index_names)
数据转换
>students.dtypes
Undergraduate      int64
Graduate           int64
U.S. Census      float64
dtype: object

把数据NAN转为0,并把数据类型转换为int

dfclearn = students.fillna(0).astype('int64')
数据类型转换
上一篇 下一篇

猜你喜欢

热点阅读