科学计算系列学习 02:Pandas
2019-07-01 本文已影响0人
我问你瓜保熟吗
科学计算系列学习 01:Numpy
科学计算系列学习 02:Pandas
科学计算系列学习 03:Matplotlib
-
生成日期数据
In [50]: pd.date_range('20190701',periods=6)
Out[50]:
DatetimeIndex(['2019-07-01', '2019-07-02', '2019-07-03', '2019-07-04',
'2019-07-05', '2019-07-06'],
dtype='datetime64[ns]', freq='D')
-
Series
In [55]: np.Series([1,3,4,8,-2],index=['a','b','c','d','e'])
Out[55]:
a 1
b 3
c 4
d 8
e -2
dtype: int64
-
用np或者直接导入数据生成DataFrame
In [49]: pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
Out[49]:
a b c d
2019-07-01 -0.943911 0.930244 -1.002432 -1.495716
2019-07-02 -0.529640 0.559569 -0.552342 -1.403447
2019-07-03 1.226341 0.277729 0.014151 0.154364
2019-07-04 -1.767719 -0.798156 -0.555459 -0.746608
2019-07-05 -0.922795 0.592672 0.295197 -0.187842
2019-07-06 1.384318 0.924977 1.320110 -0.784771
In [48]: pd.DataFrame(np.arange(15).reshape(3,5))
Out[48]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
-
传入字典的方式构建DataFrame
key代表列名,行名自动生成从0开始
In [59]: df = pd.DataFrame({'A':1.,'B':pd.Timestamp("20190701"),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})
Out[59]:
A B C D E F
0 1.0 2019-07-01 1.0 3 test foo
1 1.0 2019-07-01 1.0 3 train foo
2 1.0 2019-07-01 1.0 3 test foo
3 1.0 2019-07-01 1.0 3 train foo
- 查看每列数据类型
In [61]: df.dtypes
Out[61]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
- 查看行索引
In [62]: df.index
Out[62]: Int64Index([0, 1, 2, 3], dtype='int64')
- 查看列名称
In [63]: df.columns
Out[63]: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
- 查看所有的值
In [64]: df.values
Out[64]:
array([[1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2019-07-01 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
- 查看描述
只支持数字类型
In [66]: df.describe()
Out[66]:
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
- 矩阵颠倒,行变列,列变行
In [67]: df.T
Out[67]:
0 1 2 3
A 1 1 1 1
B 2019-07-01 00:00:00 2019-07-01 00:00:00 2019-07-01 00:00:00 2019-07-01 00:00:00
C 1 1 1 1
D 3 3 3 3
E test train test train
F foo foo foo foo
- 排序
axis:0表示对行索引进行排序,1表示对列索引进行排序;ascending=False表示倒序
In [71]: df.sort_index(axis=0,ascending=False)
Out[71]:
A B C D E F
3 1.0 2019-07-01 1.0 3 train foo
2 1.0 2019-07-01 1.0 3 test foo
1 1.0 2019-07-01 1.0 3 train foo
0 1.0 2019-07-01 1.0 3 test foo
对指定列
E
进行排序
In [74]: df.sort_values(by='E')
Out[74]:
A B C D E F
0 1.0 2019-07-01 1.0 3 test foo
2 1.0 2019-07-01 1.0 3 test foo
1 1.0 2019-07-01 1.0 3 train foo
3 1.0 2019-07-01 1.0 3 train foo
二、选择数据
- 列切片
In [76]: dates = pd.date_range('20190701',periods=6)
In [79]: df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
Out[79]:
A B C D
2019-07-01 0 1 2 3
2019-07-02 4 5 6 7
2019-07-03 8 9 10 11
2019-07-04 12 13 14 15
2019-07-05 16 17 18 19
2019-07-06 20 21 22 23
In [81]: df.A
Out[81]:
2019-07-01 0
2019-07-02 4
2019-07-03 8
2019-07-04 12
2019-07-05 16
2019-07-06 20
Freq: D, Name: A, dtype: int32
In [82]: df['A']
Out[82]:
2019-07-01 0
2019-07-02 4
2019-07-03 8
2019-07-04 12
2019-07-05 16
2019-07-06 20
Freq: D, Name: A, dtype: int32
- 行切片
In [84]: df[0:3]
Out[84]:
A B C D
2019-07-01 0 1 2 3
2019-07-02 4 5 6 7
2019-07-03 8 9 10 11
In [85]: df['2019-07-05':'2019-07-06']
Out[85]:
A B C D
2019-07-05 16 17 18 19
2019-07-06 20 21 22 23
- 标签选择
单行
In [94]: df.loc['20190701']
Out[94]:
A 0
B 1
C 2
D 3
Name: 2019-07-01 00:00:00, dtype: int32
多列
In [98]: df.loc[:,['A','B']]
Out[98]:
A B
2019-07-01 0 1
2019-07-02 4 5
2019-07-03 8 9
2019-07-04 12 13
2019-07-05 16 17
2019-07-06 20 21
- 位置选择
第三行到第五行,第一列到第三列
In [100]: df.iloc[3:5,1:3]
Out[100]:
B C
2019-07-04 13 14
2019-07-05 17 18
- 大于小于等于
打印出A列大于8的所有行
In [103]: df[df.A>8]
Out[103]:
A B C D
2019-07-04 12 13 14 15
2019-07-05 16 17 18 19
2019-07-06 20 21 22 23
三、更改数据
- 指定标签
In [20]: df.loc['20190701','A']=6
In [21]: df
Out[21]:
A B C D
2019-07-01 6 1 2 3
2019-07-02 4 5 6 7
2019-07-03 8 9 10 11
2019-07-04 12 13 14 15
2019-07-05 16 17 18 19
2019-07-06 20 21 22 23
- 指定位置
In [24]: df.iloc[2,2]=22
In [25]: df
Out[25]:
A B C D
2019-07-01 6 1 2 3
2019-07-02 4 5 6 7
2019-07-03 8 9 22 11
2019-07-04 12 13 14 15
2019-07-05 16 17 18 19
2019-07-06 20 21 22 23
- 根据判断条件进行更改
In [26]: df.A[df.A>6]=0
In [27]: df
Out[27]:
A B C D
2019-07-01 6 1 2 3
2019-07-02 4 5 6 7
2019-07-03 0 9 22 11
2019-07-04 0 13 14 15
2019-07-05 0 17 18 19
2019-07-06 0 21 22 23
四、添加列
- 添加一列
E
,设其默认值为NaN
In [29]: df['E']=np.nan
In [30]: df
Out[30]:
A B C D E
2019-07-01 6 1 2 3 NaN
2019-07-02 4 5 6 7 NaN
2019-07-03 0 9 22 11 NaN
2019-07-04 0 13 14 15 NaN
2019-07-05 0 17 18 19 NaN
2019-07-06 0 21 22 23 NaN
- 添加一列
F
,指定值,需要指定index
In [31]: df['F'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190701',periods=6))
In [32]: df
Out[32]:
A B C D E F
2019-07-01 6 1 2 3 NaN 1
2019-07-02 4 5 6 7 NaN 2
2019-07-03 0 9 22 11 NaN 3
2019-07-04 0 13 14 15 NaN 4
2019-07-05 0 17 18 19 NaN 5
2019-07-06 0 21 22 23 NaN 6
五、处理丢失数据:
- 删除包含NaN的行或列
axis:0代表行,1代表列;any代表包含任意个,all表示全部是。默认为any
In [51]: df
Out[51]:
A B C D
2019-07-01 0 NaN 2.0 3
2019-07-02 4 5.0 NaN 7
2019-07-03 8 9.0 10.0 11
2019-07-04 12 13.0 14.0 15
2019-07-05 16 17.0 18.0 19
2019-07-06 20 21.0 22.0 23
In [52]: df.dropna(axis=0,how='any')
Out[52]:
A B C D
2019-07-03 8 9.0 10.0 11
2019-07-04 12 13.0 14.0 15
2019-07-05 16 17.0 18.0 19
2019-07-06 20 21.0 22.0 23
- 将NaN填充指定值
In [60]: df.fillna(value=999)
Out[60]:
A B C D
2019-07-01 999.0 999.0 2.0 3
2019-07-02 4.0 999.0 999.0 7
2019-07-03 8.0 9.0 10.0 11
2019-07-04 12.0 13.0 14.0 15
2019-07-05 16.0 17.0 18.0 19
2019-07-06 20.0 21.0 22.0 23
- 判断是否有数据丢失
np.any(df.isnull()) ==True
六、导入导出到文件
- 导出
df.to_csv ('a.csv')
- 导入
默认会自动添加行索引
pd.read_csv('a.csv')
七、数据合并
1、concat
- 纵向或横向合并
默认axis=0,axis=0:纵向合并,axis=1:横向合并。
默认:join =outer
,纵向合并行,横向合并列:outer:合并成并集,inner:合并成交集。
ignore_index=True 横向合并重新排列 列索引号,纵向合并重新排列行索引号
join_axes=[df1.index],横向合并时,以df1的横向索引为标准
In [100]: df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
In [101]: df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])
In [123]: pd.concat([df1,df2])
Out[123]:
a b c d e
0 0.0 0.0 0.0 0.0 NaN
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0
In [130]: pd.concat([df1,df2],axis=1)
Out[130]:
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
In [131]: pd.concat([df1,df2],axis=1,ignore_index=True)
Out[131]:
0 1 2 3 4 5 6 7
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
In [137]: pd.concat([df1,df2],axis=1,join_axes=[df1.index])
Out[137]:
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
2、merge
3、追加元素 append
In [143]: df1
Out[143]:
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [144]: pd.Series([1,2,3,4],index=['a','b','c','d'])
Out[144]:
a 1
b 2
c 3
d 4
dtype: int64
In [145]: df1.append(s1,ignore_index=True)
Out[145]:
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0