python合道-常规Pandas（1）

2019-04-24 本文已影响0人 oopp8

简介

快速入门

Pandas---Series

系列(Series)是带有标签的一维数组。能够保存任何类型的数据(整数，字符串，浮点数，Python对象等)的一维标记数组。轴标签统称为索引。

Series 结构

s = pd.Series(np.random.rand(5))
print(s)#查看数据
print(s.index)#.index查看series索引
print(s.values)#.values查看series值

Series 创建

dic = {1:'a' ,2:('a','b') , 3:'c', 4:4.56, 5:[1,2,3]}
pd.Series(dic)
dic = {'1':'a' ,2:('a','b') , '3':'c', 4:4.56, 5:[1,2,3]}
pd.Series(dic)

pd.Series(np.random.randn(4),name = '名称属性')#名称属性
pd.Series(np.random.randn(4), index = ['a','b','c','d'],dtype = np.object)

pd.Series(10, index = range(4))#多个必须提供索引

d=pd.Series(np.random.randn(4),name = '名称属性')#名称属性
print(d.name)
dd=d.rename('xxx',inplace=True)#inplace=True替换原先的
print(d.name,dd.name)

Series 索引

d=pd.Series(range(5), index = ['a','b','c','d','e'])
print(d[1])
print(d[1:3])
print(d[1:3][::-1])
print(d[['c','d']])

#None代表空值，NaN代表有问题的数值，两个都会识别为空值
d.isnull()
d.notnull()
d.dtype
d[d>3]
d[d<7][d>3]

Series 基本操作

dd=pd.Series([22,33],index=['d1','d2'])
d=pd.Series(np.random.rand(3), index = ['a','b','c'])
d.head(3)# .head()查看头部数据
d.tail(3)# .tail()查看尾部数据
d.reindex([1,2,3,'a','b','c'], fill_value = 0)#根据索引重新排序，如果当前索引不存在，则引入缺失值
d.drop(['a','b'])# .drop()删除元素之后返回副本
d.drop(['a','b'],inplace=True)
d['a'] = 11# 添加、修改
d.append(dd)# .append()方法，直接添加一个数组

s1 = pd.Series(np.random.rand(3), index = ['a','b','c'])
s2 = pd.Series(np.random.rand(3), index = ['d','a','b'])
print(s1)
print(s2)
print(s1+s2)#根据标签自动排序

Pandas---Dataframe

数据帧(DataFrame)是二维数据结构，即数据以行和列的表格方式排列。

数据帧(DataFrame)的功能特点：
1、潜在的列是不同的类型
2、大小可变标
3、记轴(行和列)
4、可以对行和列执行算术运算

Dataframe 构造函数

参数	描述
data	数据采取各种形式，如:`ndarray`，`series`，`map`，`lists`，`dict`，`constant`和另一个DataFrame。
index	对于行标签，要用于结果帧的索引是可选缺省值`np.arrange(n)`，如果没有传递索引值。
columns	对于列标签，可选的默认语法是 - `np.arange(n)`。这只有在没有索引传递的情况下才是这样。
dtype	每列的数据类型。
copy	如果默认值为`False`，则此命令(或任何它)用于复制数据。

Dataframe 创建

# Dataframe带有index（行标签）和columns（列标签）
# 字典list
data1 = {'name': ['Jack', 'Tom', 'Mary'],
         'age': [11, 22, 33],
         'gender': ['m', 'm', 'f']}
df1 = pd.DataFrame(data1)
df1  # 查看数据，数据类型为dataframe
type(df1)
df1.index  # .index查看行标签
list(df1.columns)  # .columns查看列标签
df1.values  # .values查看值，数据类型为ndarray

# Series组成的字典
data2 = {'one': np.random.rand(3),
         'two': np.random.rand(3)}
pd.DataFrame(data2)
pd.DataFrame(data2, columns=['v', 'c'])
pd.DataFrame(data2, columns=['one', 'b', 'c'])
pd.DataFrame(data2, index=['a', 'b', 'c'])

# 二维数组
ar = np.random.rand(12).reshape(3, 4)
pd.DataFrame(ar, index=['a', 'b', 'c'], columns=['one', 'two', 'three', 'four'])

# 字典组成的列表
data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
pd.DataFrame(data, index=['a', 'b'], columns=['one', 'two'])

# 字典
data = {'Jack': {'math': 90, 'english': 89, 'art': 78},
        'Marry': {'math': 82, 'english': 95, 'art': 92},
        'Tom': {'math': 78, 'english': 67}}
pd.DataFrame(data)

Dataframe 选择

ar = np.array(range(12)).reshape(3, 4)
df = pd.DataFrame(ar, index=['one', 'two', 'three'],
                  columns=['a', 'b', 'c', 'd'])
# df[]默认选择列，[]中写列名（所以一般数据colunms都会单独制定，不会用默认数字列名，以免和index冲突）
# 单选列为Series，print结果为Series格式
# 多选列为Dataframe，print结果为Dataframe格式
df['a']
df[['a', 'c']]
df.loc['one']  # 按index选择行
df.loc[['one', 'two']]
df.iloc[2]  # 按照整数位置（从轴的0到length-1）选择行
df.iloc[[0, 2]]  # 多位置索引
df.iloc[1:3]  # 切片索引

# df[]中为数字时，默认选择行，且只能进行切片的选择，不能单独选择（df[0]）
# df[]不能通过索引标签名来选择行(df['one'])
# 核心笔记：df[col]一般用于选择列，[]中写列名
df[:2]
# df['one']

df < 5  # 对数据每个值进行判断
df[df < 5]  # 索引结果保留：True返回原数据，False返回值为NaN
df[df[['a', 'b']] > 5]
df[df.loc[['one', 'three']] < 5]

Dataframe 排序

df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=['a', 'b', 'c', 'd'])
df.sort_values(['a'], ascending=True)  # 升序
df.sort_values(['a'], ascending=False)  # 降序
df['z'] = 100
df.loc[2:3]['z'] = 10
df.sort_values(['z', 'a'], ascending=True)  # 升序

df.sort_index()  # 按照index排序。默认ascending=True, inplace=False

Pandas---时间序列

时间序列 Timestamp

import datetime
dt = datetime.datetime(2019, 4, 18, 15, 45, 30)
pd.Timestamp(dt)  # pandas的时刻数据
pd.Timestamp('2019-04-18')
pd.Timestamp('2019-04-18 15:22:11')

时间序列索引

pd.to_datetime(dt)  # 单个时间数据，数据类型为Timestamp
lst_date = ['2019-04-1', '2019-04-2', '2019-04-3']
pd.to_datetime(lst_date)  # 多个时间数据，将会转换为pandas的DatetimeIndex

# 直接生成时间戳索引，支持str、datetime.datetime
di = pd.DatetimeIndex(
    ['4/1/2019', '4/2/2019', '4/3/2019', '4/4/2019', '4/5/2019'])
pd.Series(np.random.rand(len(di)), index=di)  # 以DatetimeIndex为index的Series

pd.date_range('4/1/2019', '4/10/2019', normalize=True)
pd.date_range('2017/1/1', '2017/1/2', freq='H', closed='left')  # H：每小时
# pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None, **kwargs)
# start：开始时间
# end：结束时间
# periods：偏移量
# freq：频率，默认天，pd.date_range()默认频率为日历日，pd.bdate_range()默认频率为工作日
# closed：默认为None的情况下，左闭右闭，left则左闭右开，right则左开右闭
# tz：时区
pd.date_range(start='4/1/2019', periods=10)

大量的字符串别名被赋予常用的时间序列频率。我们把这些别名称为偏移别名。

别名	描述说明
B	工作日频率
BQS	商务季度开始频率
D	日历/自然日频率
A	年度(年)结束频率
W	每周频率
BA	商务年底结束
M	月结束频率
BAS	商务年度开始频率
SM	半月结束频率
BH	商务时间频率
SM	半月结束频率
BH	商务时间频率
BM	商务月结束频率
H	小时频率
MS	月起始频率
T, min	分钟的频率
SMS	SMS半开始频率
S	秒频率
BMS	商务月开始频率
L, ms	毫秒
Q	季度结束频率
U, us	微秒
BQ	商务季度结束频率
N	纳秒
BQ	商务季度结束频率
QS	季度开始频率

时间序列索引

p=pd.Period('2019', freq = 'M')#生成一个以2019-01开始，月为频率的时间构造器
p.asfreq('M', how = 'start')#.asfreq(freq, method=None, how=None)方法转换成别的频率
p.asfreq('D', how = 'end')
pr=pd.period_range('1/1/2018', '1/1/2019', freq='M')#时间序列
ts=pd.Series(np.random.rand(len(pr)), index = pr)

dates = pd.DatetimeIndex(['1/1/2019','1/2/2019','1/3/2019','1/4/2019','1/1/2019','1/2/2019'])
ts = pd.Series(np.random.rand(6), index = dates)
ts.groupby(level = 0).mean()#第一列分组取均值

#时间戳与时期之间的转换：pd.to_period()、pd.to_timestamp()
p1 = pd.date_range('2019/1/1', periods = 10, freq = 'M')
p2 = pd.period_range('2018','2019', freq = 'M')

ts1 = pd.Series(np.random.rand(len(p1)), index = p1)
print(ts1.head())
print(ts1.to_period().head())
# 每月最后一日，转化为每月

ts2 = pd.Series(np.random.rand(len(p2)), index = p2)
print(ts2.head())
print(ts2.to_timestamp().head())

时间序列重采样

将时间序列从一个频率转换为另一个频率的过程，且会有数据的结合
降采样：高频数据 → 低频数据，eg.以天为频率的数据转为以月为频率的数据
升采样：低频数据 → 高频数据，eg.以年为频率的数据转为以月为频率的数据

rng = pd.date_range('20190101', periods=28)  # 创建一个以天为频率的TimeSeries
ts = pd.Series(np.arange(len(rng)), index=rng)
ts5d = ts.resample('5D')  # 重采样构建器，频率改为5天
ts5d.mean()  # 求平均值
ts5d.max()  # 求最大值
ts5d.min()  # 求最小值
ts5d.median()  # 求中值
ts5d.first()  # 返回第一个值
ts5d.last()  # 返回最后一个值
ts5d.ohlc()  # OHLC重采样

prng = pd.period_range('2017','2019',freq = 'M')
ts = pd.Series(np.arange(len(prng))+1, index = prng)
ts.resample('15D').ffill()  # 升采样
ts.resample('Y').mean()  # 降采样

# 低频转高频，主要是如何插值
rng = pd.date_range('2019/1/1 0:0:0', periods=5, freq='H')
ts = pd.DataFrame(np.arange(15).reshape(5, 3),
                  index=rng, columns=['a', 'b', 'c'])
ts.resample('15T').asfreq()  # .asfreq()：不做填充，返回Nan
ts.resample('15T').ffill()  # .ffill()：向上填充
ts.resample('15T').bfill()  # .bfill()：向下填充

python合道-常规Pandas（1）

简介

快速入门

Pandas---Series

Pandas---Dataframe

Pandas---时间序列

猜你喜欢

热点阅读