Pandas使用总结

2017-08-01  本文已影响31人  Jlan

索引

按行名索引 data.ix['row_name']
按行位置索引 data.iloc[2]
过滤指定列包含某些字符串 df_data['分类'].str.contains('婚|同居|抚养|赡养')
索引指定列data = data.loc[:, ['account', '首贷申请时间']]

分组并排序

In [84]: df = pd.DataFrame({'key1':['a','a','b','b','a'],
    ...:                    'key2':['one','two','one','two','one'],
    ...:                    'data1':np.random.randn(5),
    ...:                    'data2':np.random.randn(5)})                 

In [85]: df
Out[85]: 
      data1     data2 key1 key2
0  1.579140  0.428876    a  one
1  0.494486  0.397206    a  two
2 -0.445459 -1.447018    b  one
3  1.114477 -1.539330    b  two
4  0.899226 -2.082411    a  one

# 分组然后排序,多关键字排序,ascending的每个元素表示每个排序关键字的排序方式
In [86]: sort_func = lambda x: x.sort_values(['data1', 'data2'], ascending=[1, 0])
In [87]: dfgs = df.groupby(['key1', 'key2']).apply(sort_func)
In [88]: dfgs
Out[88]: 
                data1     data2 key1 key2
key1 key2                                
a    one  4  0.899226 -2.082411    a  one
          0  1.579140  0.428876    a  one
     two  1  0.494486  0.397206    a  two
b    one  2 -0.445459 -1.447018    b  one
     two  3  1.114477 -1.539330    b  two

# 分组排序后只取前n个值
In [89]: sort_func = lambda x: x.sort_values(['data1', 'data2'], ascending=[1, 0]).head(1)
In [90]: dfgsh = df.groupby(['key1', 'key2']).apply(sort_func)
In [91]: dfgsh
Out[91]: 
                data1     data2 key1 key2
key1 key2                                
a    one  4  0.899226 -2.082411    a  one
     two  1  0.494486  0.397206    a  two
b    one  2 -0.445459 -1.447018    b  one
     two  3  1.114477 -1.539330    b  two

读取数据时设置格式

data = pd.read_excel(host_file, dtype={'emergency_contact_mobile(紧急联系人1)': str,
                                       'emergency_contact_mobile2nd(紧急联系人2)': str,
                                       'account': str})

修改column名

data.rename(columns={'emergency_contact_mobile(紧急联系人1)': 'emergency1',
                         'emergency_contact_mobile2nd(紧急联系人2)': 'emergency2',
                         '首贷申请时间': 'first_time'},
                inplace=True)

dataframe转换为dict

data = data.to_dict('records')

参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

去除nan, inf等

df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
df_data = df_data[df_data['问题'].notnull()]
df_data.fillna('', inplace=True)
data.dropna(inplace=True)

一些实例

一个读取文件的例子

def parse_application_file(application_file):
    data = pd.read_table(application_file, sep='\t', encoding='utf-8', engine='python', dtype = {'account' : str})  # 读取txt文件,以'\t'为分割,'account'列的格式转换成str
    data = data.loc[:, ['account', 'addtime']]
    data.rename(columns={'addtime': 'first_time'}, inplace=True) # 修改列名
    data = data.set_index('account') # 把account列作为索引
    print(data)
    return data

# index和column的转化,参考https://www.cnblogs.com/hhh5460/p/7067928.html
上一篇 下一篇

猜你喜欢

热点阅读