pandas 小结

2017-09-19 本文已影响78人 Rokkia

很早之前就听说pandas数据是一个很强的工具，但是自己一直懒得看，周末朋友让帮忙，便看了看。
pandas有很强的数据处理能力，配合matplotlib一同使用，效果简直爆炸。

正文

问题难点:
1.如何使用groupby
2.如何使用筛选

1.用前准备

导入pandas

import pandas as pd

首先导入数据

df = pd.read_excel('/Users/swift/Downloads/TaskData.xls')
#查看一下
df.head()

  taskNumber  taskLatitude  taskLongitude  taskPrice  taskStatus
0      A0001     22.566142     113.980837       66.0           0
1      A0002     22.686205     113.940525       65.5           0
2      A0003     22.576512     113.957198       65.5           1
3      A0004     22.564841     114.244571       75.0           0
4      A0005     22.558888     113.950723       65.5           0

使用matplotlib.pyplot显示一下

plt.scatter(x=df['taskLatitude'], y=df['taskLongitude'], c='r')
plt.show()

data.png

接下来我们做什么呢？

image.png

我们根据taskStatus来分类，使用taskPrice来筛选

2.使用groupby进行分类

g_df = df.groupby('taskStatus')
#我们来看一下g_df是什么类型
In [9]: g_df
Out[9]: <pandas.core.groupby.DataFrameGroupBy object at 0x1166d1210>

我们来用一下g_df，在晚上看到可以使用for来遍历。然后我就很天真的去遍历了。

In [10]: for x in g_df:
    ...:     print(x)
    ...:

然后我也没仔细看还以为就是DataFrame类型的。于是使用

In [13]: for i in g_df:
    ...:     plt.scatter(x=i['taskLatitude'], y=i['taskLongitude'],c='b')
    ...:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-7c2443f9322a> in <module>()
      1 for i in g_df:
----> 2     plt.scatter(x=i['taskLatitude'], y=i['taskLongitude'],c='b')
      3

TypeError: tuple indices must be integers, not str

很刺激报错了，我还纠结了很久，最后才发现这是一个Tuple，我们需要解包一下使用

In [15]: for i,v in g_df:
    ...:     print(i)
    ...:     print('---------')
    ...:     print(v)
    ...:

0
---------
    taskNumber  taskLatitude  taskLongitude  taskPrice  taskStatus
0        A0001     22.566142     113.980837       66.0           0
1        A0002     22.686205     113.940525       65.5           0
... ...

1
---------
    taskNumber  taskLatitude  taskLongitude  taskPrice  taskStatus
2        A0003     22.576512     113.957198       65.5           1
6        A0007     22.549004     113.972260       65.5           1

可以看到第二部分的v 才是DataFrame,使用一下看看

In [16]: for i,v in g_df:
    ...:     if i == 0:
    ...:         plt.scatter(x=v['taskLatitude'], y=v['taskLongitude'],c='b')
    ...:     else:
    ...:         plt.scatter(x=v['taskLatitude'], y=v['taskLongitude'],c='r')
    ...:
In [17]: plt.show()

image.png

效果很美

3.看一下筛选

这里有两种

3.1 使用loc

In [18]: n_group = df.loc[(df['taskPrice']==85)| (df['taskPrice']== 65), ['taskL
    ...: atitude','taskNumber', 'taskPrice']]

然后我们可以再使用分组显示，这个跟上面一样就不重复了。

3.2 另外还有一种就是

直接使用 df.taskPrice == xx.x

In [20]: df[df.taskPrice == 65.0]
Out[20]:
    taskNumber  taskLatitude  taskLongitude  taskPrice  taskStatus
21       A0022     22.515920     113.935677       65.0           1
40       A0041     22.662300     114.072997       65.0           0
... ...
... ...

这样直接获取到某一列的筛选值。