python自学

Python_Pandas_性能提升

2020-03-19  本文已影响0人  Kaspar433

Python_Pandas_性能提升

合理使用numpy及Pandas的一些方法,可以使运算速度成倍提升。本文将介绍一些常用的方法,并进行运算速度对比。

首先读取数据。

import pandas as pd
import numpy as np

data = pd.read_csv("gun_deaths_in_america.csv",header=0)
data.head()
year month intent police sex age race hispanic place education
0 2012 1 Suicide 0 M 34 Asian/Pacific Islander 100 Home 4
1 2012 1 Suicide 0 F 21 White 100 Street 3
2 2012 1 Suicide 0 M 60 White 100 Other specified 4
3 2012 2 Suicide 0 M 64 White 100 Home 4
4 2012 2 Suicide 0 M 31 White 100 Other specified 2

一般的apply()方法

def judge_edu(row):
    if row['education'] > 3:
        return 'high'
    else:
        return row['education']

%timeit data['judge_edu'] = data.apply(judge_edu,axis=1)

out:
1.81 s ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用np.where()

where(condition, [x, y]) ,类似于if...else...,如果满足条件则返回x,否则返回y,可以嵌套。

%timeit data['judge_edu'] = np.where(data['education']>3,'high',data['education'])

out:
58.2 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

使用np.vectorize()

def judge_edu_2(col):
    if col > 3:
        return 'high'
    else:
        return col
    
vectfunc = np.vectorize(judge_edu_2)
%timeit data['judge_edu'] = vectfunc(data['education'])

out:
52.3 ms ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

多条件np.select()

apply()

def judge_age(row):
    if row['age'] > 60:
        return 'old'
    elif row['age'] > 40:
        return 'mid'
    elif row['age'] > 20:
        return 'young'
    elif row['age'] > 10:
        return 'teen'
    else:
        return 'child'

%timeit data['judge_age'] = data.apply(judge_age,axis=1)

out:
2.26 s ± 72.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.where()

%timeit data['judage_age_2'] = np.where(data['age']>60,'old',\
                                        np.where(data['age']>40,'mid',\
                                        np.where(data['age']>20,'young',\
                                        np.where(data['age']>10,'teen','child'))))

out:
17.9 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.select()

np.select(condlist, choicelist, default=0) ,类似Excel中的choose函数。

conditions = [data['age']>60,
             data['age']>40,
             data['age']>20,
             data['age']>10]
choices = ['old','mid','young','teen']

%timeit data['judge_age_3'] = np.select(conditions,choices,default='child')

out:
13.4 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

总结: 相较于pandas的apply方法,对于这种条件判断的计算,计算速度np.select > np.where > apply。

上一篇 下一篇

猜你喜欢

热点阅读