机器学习

Pandas快速入门(二)

2019-10-08  本文已影响0人  乔治大叔

Pandas快速入门(一),接着写:

布尔索引
print(df[df.A>0]) #取值df.A>0的所有
print(df[df>0]) #显示大于0的值,else显示NaN
                   A         B         C         D
2019-09-01  0.586356  1.969502  1.125890 -0.831724
2019-09-04  0.886695  1.543536 -0.170274  0.867814
2019-09-06  0.297143 -0.317093  1.125189  1.023567

                   A         B         C         D
2019-09-01  0.586356  1.969502  1.125890       NaN
2019-09-02       NaN       NaN       NaN       NaN
2019-09-03       NaN  1.162615  0.699749  1.224788
2019-09-04  0.886695  1.543536       NaN  0.867814
2019-09-05       NaN  0.445182       NaN       NaN
2019-09-06  0.297143       NaN  1.125189  1.023567
过滤

使用 isin() 方法过滤:

df['E'] = ['one','two','three','four','five','six']
print(df)
print(df[df['E'].isin(['two','three'])])
                   A         B         C         D      E
2019-09-01  0.586356  1.969502  1.125890 -0.831724    one
2019-09-02 -0.665937 -0.897839 -1.208598 -1.226119    two
2019-09-03 -2.418687  1.162615  0.699749  1.224788  three
2019-09-04  0.886695  1.543536 -0.170274  0.867814   four
2019-09-05 -0.671953  0.445182 -0.614136 -0.064305   five
2019-09-06  0.297143 -0.317093  1.125189  1.023567    six
                   A         B         C         D      E
2019-09-02 -0.665937 -0.897839 -1.208598 -1.226119    two
2019-09-03 -2.418687  1.162615  0.699749  1.224788  three

赋值

虽然用于选择和赋值的标准Python / Numpy表达式非常直观,并且便于交互工作,但是对于生产环境的代码,我们推荐优化的Pandas数据访问方法.at、.iat、.loc和.iloc。

添加新列将自动根据索引对齐数据:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20191001', periods=6))
print(s1)
2019-10-01    1
2019-10-02    2
2019-10-03    3
2019-10-04    4
2019-10-05    5
2019-10-06    6
Freq: D, dtype: int64

通过标签赋值:

datas = pd.date_range('20190901',periods=6)
print(datas)
df.at[dates[0], 'C'] = 0  #dates[0]='2019-09-01'
DatetimeIndex(['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04',
               '2019-09-05', '2019-09-06'],
              dtype='datetime64[ns]', freq='D')

                   A         B         C         D
2019-09-01 -0.397827 -1.102112  0.000000  0.161291
2019-09-02 -0.751784 -0.759627 -1.311447 -0.919117
2019-09-03  0.531277  0.550232 -1.253598  0.647749
2019-09-04 -0.549671  1.000032 -0.927265  0.094845
2019-09-05 -0.046609  0.399075  1.111344  1.722658
2019-09-06 -1.424410 -1.328193  2.587026  0.463605

通过位置赋值:

df.iat[0,2] = 0  #第0行,第2列
                   A         B         C         D
2019-09-01 -0.921584 -0.207005  0.000000 -0.548157
2019-09-02 -0.899229  0.561346  0.574105 -1.558532
2019-09-03 -1.277597 -0.583355  1.247190 -0.916555
2019-09-04 -1.227783  0.522624 -2.151186 -0.281190
2019-09-05  0.553149 -0.114055  0.616718  0.875897
2019-09-06  1.140854 -0.052508  0.943119  1.269147

使用NumPy数组赋值:

df.loc[:,'D'] = np.array([5]*len(df)) #通过NumPy赋值,[]不能省
                   A         B         C  D
2019-09-01  0.260309 -0.786362  0.900311  5
2019-09-02 -1.035287  1.727411 -0.041896  5
2019-09-03 -0.495706  0.687953 -0.121707  5
2019-09-04 -0.365145 -0.844624 -0.764868  5
2019-09-05  0.309504  0.465509 -0.363573  5
2019-09-06 -0.143167 -0.405704 -1.102475  5

带有where条件的赋值操作:

df2 = df.copy()
df2[df2<0] = -df2 #如果小于零,则为正数
print(df2)
                   A         B         C         D
2019-09-01  0.608456  1.503148 -0.194184  0.149963
2019-09-02 -0.654379  1.039558 -0.321524  1.771350
2019-09-03 -2.084704 -0.734897  0.260852 -1.163411
2019-09-04 -0.461798  0.311986  1.860293 -1.353793
2019-09-05  0.660783 -2.050908 -0.480054 -1.123917
2019-09-06  0.070030 -0.405595  0.687804  0.119593
                   A         B         C         D
2019-09-01  0.608456  1.503148  0.194184  0.149963
2019-09-02  0.654379  1.039558  0.321524  1.771350
2019-09-03  2.084704  0.734897  0.260852  1.163411
2019-09-04  0.461798  0.311986  1.860293  1.353793
2019-09-05  0.660783  2.050908  0.480054  1.123917
2019-09-06  0.070030  0.405595  0.687804  0.119593

缺失值

Pandas主要使用值np.nan来表示缺失的数据。

df2 = df2[df2>0] #显示大于0的值,else显示NaN
print(df2)
print(df2.dropna(how='any')) #删除任何带有缺失值的行
print(df2.fillna(value=5)) #填充缺失值
print(pd.isna(df2)) #获取值为nan的掩码,nan为true
                   A         B         C         D
2019-09-01  2.504590       NaN  1.139982       NaN
2019-09-02       NaN  0.604752  0.655428       NaN
2019-09-03       NaN       NaN  1.086983  0.600510
2019-09-04       NaN       NaN       NaN  0.459104
2019-09-05       NaN       NaN       NaN  1.349749
2019-09-06  0.803654  1.542528  0.041647  1.053980

                   A         B         C        D
2019-09-06  0.803654  1.542528  0.041647  1.05398

                   A         B         C         D
2019-09-01  2.504590  5.000000  1.139982  5.000000
2019-09-02  5.000000  0.604752  0.655428  5.000000
2019-09-03  5.000000  5.000000  1.086983  0.600510
2019-09-04  5.000000  5.000000  5.000000  0.459104
2019-09-05  5.000000  5.000000  5.000000  1.349749
2019-09-06  0.803654  1.542528  0.041647  1.053980

                A      B      C      D
2019-09-01  False   True  False   True
2019-09-02   True  False  False   True
2019-09-03   True   True  False  False
2019-09-04   True   True   True  False
2019-09-05   True   True   True  False
2019-09-06  False  False  False  False

上一篇下一篇

猜你喜欢

热点阅读