《利用Python进行数据分析》-数据规整

2019-08-06 本文已影响15人皮皮大

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

层次化索引hierarchical indexing

数据分散在不同的文件或者数据库中
层次化索引在⼀个轴上拥有多个（两个以上）索引级别
低维度形式处理高维度数据

# 创建S:索引是一个数组组成的列表
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

image.png

data['b']

# 部分索引选取数据子集
# 切片形式
data['b':'c']

# 列表形式
data.loc[['b', 'c']]

image.png

data.loc[['b', 'd']]
data.loc[:, 2]

image.png

# 对于DF类型数据
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

# 索引设置名字
frame.index.names = ['key1', 'key2']
# 属性设置名字
frame.columns.names = ['state', 'color']
frame

image.png

重排与分级排序

重新调整某条轴上的各级别的顺序
指定级别上的值对数据进行排序
swaplevel()接受两个级别编号或名称

image.png

# level=0 通过第一层索引key1排序
frame.sort_index(level=0)

# level=1 通过第一层索引key2排序
frame.sort_index(level=1)

image.png

根据级别统计求和

通过level指定某条轴
指定行或者列

image.png

合并数据集

pandas.merge：根据键将不同DF中的行连接起来，类似于数据库的join操作
pandas.concat：沿着轴将对象叠在一起
法combine_first可以将重复数据拼接在⼀起，⽤⼀个对象中的值填充另⼀个的缺失值

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

# 默认根据重叠列名key根据进行合并
pd.merge(df1,df2)
pd.merge(df1, df2, on='key')

image.png

merge

默认是内连接
结果中的键是交集：只有a、b在两个DF中同时存在

image.png

image.png

索引行的合并

DF的连接键有时位于索引
传入left_index=True或right_index=True

image.png

join()

按照索引合并
合并多个DF对象，要求没有重复的列
默认使用左连接，保留左边的行索引
简单的合并参数可以是一组DF

image.png

轴向索引

连接concatentation、绑定binding、堆叠stacking
Numpy的concatenate()函数实现
pandas的concat()函数实现

image.png

合并与重叠

索引全部或者部分重叠的两个数据

Numpy的where函数：类似if-else
Series有⼀个combine_first⽅法

a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])

b[-1] = np.nan

np.where(pd.isnull(a), b, a)

# Series有⼀个combine_first⽅法
b[:-2].combine_first(a[2:])