pandas中的习惯用法示例说明
2019-01-16 本文已影响74人
筝韵徽
import pandas as pd
import numpy as np
pandas中的习惯用法示例说明
df=pd.read_csv('data/sample_data.csv',index_col=0)
df

- 选择列,单列,多列
df['age']
Jane 30
Niko 2
Aaron 12
Penelope 4
Dean 32
Christina 33
Cornelia 69
Name: age, dtype: int64
df[['age','color']]

- 同时选择行列
df.loc[['Niko','Penelope'],['height','age']]

df.iloc[[2,4],-2:]

- 根据条件筛选数据
- 条件1:颜色是red,green
- 条件2:height<80
- 条件2:只选择score列
符合条件的如下
cri1=df['color'].isin(['red','green'])
cri2=df['height']<80
df.loc[(cri1|cri2),'score']
Niko 8.3
Aaron 9.0
Cornelia 2.2
Name: score, dtype: float64
- 索引对齐
- 使用Series给dataframe某列赋值时,要注意行索引一致,否则不是想要的结果
- Series 与Series运算时也是一样的
aa=pd.Series(df['age'].values,index=['Jane1','Niko1','Aaron1','Penelope1','Dean1','Christina1','Cornelia1'])
aa
Jane1 30
Niko1 2
Aaron1 12
Penelope1 4
Dean1 32
Christina1 33
Cornelia1 69
dtype: int64
df['age1']=aa
df

看到上边结果了,索引不一致,没有匹配的,都为nan值
s=df['age']
s+aa
Aaron NaN
Aaron1 NaN
Christina NaN
Christina1 NaN
Cornelia NaN
Cornelia1 NaN
Dean NaN
Dean1 NaN
Jane NaN
Jane1 NaN
Niko NaN
Niko1 NaN
Penelope NaN
Penelope1 NaN
dtype: float64
s+s
Jane 60
Niko 4
Aaron 24
Penelope 8
Dean 64
Christina 66
Cornelia 138
Name: age, dtype: int64
series索引不匹配,都是Nan;索引匹配则进行数值运算
pandas内置函数与python内置函数性能比较
college=pd.read_csv('data/college.csv',index_col=0)
college.head()

s=college.UGDS
%timeit s.any() #pandas内置函数 any函数检查是否至少有一个为true的元素
218 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit any(s) #python 内置函数
475 µs ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s.all() #pandas内置函数 检查所有元素是否都为true
204 µs ± 9.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit all(s) #python内置函数
484 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
明显pandas内置函数效率更高,不在列举,尽量使用pandas内置函数
map vs apply
n = 1000000 # 1 million
s = pd.Series(np.random.randint(1, 101, n))
%timeit s.apply(lambda x: 'odd' if x % 2 else 'even')
647 ms ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
d={i:'odd' if i % 2 else 'even' for i in range(1,101)}
s.map(d)
55.9 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit s.mod(2).map({1:'odd',0:'even'})
87.4 ms ± 6.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
d={i:'odd' if i % 2 else 'even' for i in range(1,101)}
pd.Series([d[val] for val in s])
334 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
尽量使用pandas的map而非apply,尽量使用pandas内置函数
- 下面列一些方法替换apply
df = pd.DataFrame(np.random.rand(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df.head()

df.apply(np.cumsum).head()

df.cumsum().head()

df.apply(lambda x:x.max()-x.min())
a 0.986689
b 0.982148
c 0.943625
d 0.997441
e 0.981994
dtype: float64
df.max()-df.min()
a 0.986689
b 0.982148
c 0.943625
d 0.997441
e 0.981994
dtype: float64
df = pd.DataFrame(np.random.randint(0, 20, (100000, 4)),
columns=['x1', 'y1', 'x2', 'y2'])
df.head()

def dist_calc(s):
x_diff = (s['x1'] - s['x2']) ** 2
y_diff = (s['y1'] - s['y2']) ** 2
return np.sqrt(x_diff + y_diff).round(2)
df['distance'] = df.apply(dist_calc, axis='columns')
df.head()

np.sqrt((df['x1']-df['x2'])**2+(df['y1']-df['y2'])**2).head()
0 5.385165
1 10.198039
2 11.180340
3 16.643317
4 14.866069
dtype: float64
%timeit df.apply(dist_calc, axis='columns')
15.1 s ± 591 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.sqrt((df['x1']-df['x2'])**2+(df['y1']-df['y2'])**2)
12.7 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
时间效率上可见一斑