Python_Pandas_性能提升

2020-03-19 本文已影响0人 Kaspar433

Python_Pandas_性能提升

合理使用numpy及Pandas的一些方法，可以使运算速度成倍提升。本文将介绍一些常用的方法，并进行运算速度对比。

首先读取数据。

import pandas as pd
import numpy as np

data = pd.read_csv("gun_deaths_in_america.csv",header=0)
data.head()

	year	month	intent	sex	age	race	hispanic	place	education
0	2012	1	Suicide	M	34	Asian/Pacific Islander	100	Home	4
1	2012	1	Suicide	F	21	White	100	Street	3
2	2012	1	Suicide	M	60	White	100	Other specified	4
3	2012	2	Suicide	M	64	White	100	Home	4
4	2012	2	Suicide	M	31	White	100	Other specified	2

一般的apply()方法

def judge_edu(row):
    if row['education'] > 3:
        return 'high'
    else:
        return row['education']

%timeit data['judge_edu'] = data.apply(judge_edu,axis=1)

out:
1.81 s ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用np.where()

where(condition, [x, y]) ，类似于if...else...，如果满足条件则返回x，否则返回y，可以嵌套。

%timeit data['judge_edu'] = np.where(data['education']>3,'high',data['education'])

out:
58.2 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

使用np.vectorize()

def judge_edu_2(col):
    if col > 3:
        return 'high'
    else:
        return col
    
vectfunc = np.vectorize(judge_edu_2)
%timeit data['judge_edu'] = vectfunc(data['education'])

out:
52.3 ms ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

多条件np.select()

apply()

def judge_age(row):
    if row['age'] > 60:
        return 'old'
    elif row['age'] > 40:
        return 'mid'
    elif row['age'] > 20:
        return 'young'
    elif row['age'] > 10:
        return 'teen'
    else:
        return 'child'

%timeit data['judge_age'] = data.apply(judge_age,axis=1)

out:
2.26 s ± 72.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.where()

%timeit data['judage_age_2'] = np.where(data['age']>60,'old',\
                                        np.where(data['age']>40,'mid',\
                                        np.where(data['age']>20,'young',\
                                        np.where(data['age']>10,'teen','child'))))

out:
17.9 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.select()

np.select(condlist, choicelist, default=0) ，类似Excel中的choose函数。

conditions = [data['age']>60,
             data['age']>40,
             data['age']>20,
             data['age']>10]
choices = ['old','mid','young','teen']

%timeit data['judge_age_3'] = np.select(conditions,choices,default='child')

out:
13.4 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

总结： 相较于pandas的apply方法，对于这种条件判断的计算，计算速度np.select > np.where > apply。

Python_Pandas_性能提升

Python_Pandas_性能提升

一般的apply()方法

使用np.where()

使用np.vectorize()

多条件np.select()

apply()

np.where()

np.select()

猜你喜欢

热点阅读