我用Python程序员我爱编程

Pandas入门

2018-01-30  本文已影响64人  弃用中

pandas的数据结构介绍

我们将使用下面的方式导入pandas:

import pandas as pd
from pandas import Series, DataFrame

Series

Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。由一组数据就可产生最简单的Series:

In [6]: obj = pd.Series([4,7,-5,3])

In [7]: obj
Out[7]:
0    4
1    7
2   -5
3    3
dtype: int64

Series的字符串表现为:索引在左边,值在右边。由于没有为数据指定索引,于是会自动创建一个0到N-1(N为数据的长度)的整数索引。可以通过Series的values和index属性获取其数组表示形式和索引对象:

In [8]: obj.values
Out[8]: array([ 4,  7, -5,  3], dtype=int64)

In [9]: obj.index
Out[9]: RangeIndex(start=0, stop=4, step=1)

通常,我们希望创建的Series带有一个可以对各个数据点进行标记的索引:

In [10]: obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])

In [12]: obj2
Out[12]:
d    4
b    7
a   -5
c    3
dtype: int64

与普通NumPy数组相比,可以通过索引的方式选取Series中的单个或一组值:

In [13]: obj2['a']
Out[13]: -5

In [15]: obj2['d'] = 7

In [18]: obj2[['c','a','d']]
Out[18]:
c    3
a   -5
d    7
dtype: int64

NumPy数组运算(如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的连接:

In [19]: obj2
Out[19]:
d    7
b    7
a   -5
c    3
dtype: int64

In [21]: obj2[obj2>0]
Out[21]:
d    7
b    7
c    3
dtype: int64

In [22]: obj2*2
Out[22]:
d    14
b    14
a   -10
c     6
dtype: int64

In [23]: np.exp(obj2)
Out[23]:
d    1096.633158
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

还可以将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。

In [24]: 'b' in obj2
Out[24]: True

In [25]: 'e' in obj2
Out[25]: False

如果数据放在一个字典中,也可以通过这个字典来创建Series,索引就是原字典的键:

In [26]: sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}

In [27]: obj3 = pd.Series(sdata)

In [28]: obj3
Out[28]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

再看一个例子:

In [30]: states = ['California','Ohio','Oregon','Texas']

In [31]: obj4 = pd.Series(sdata,index=states)

In [32]: obj4
Out[32]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

在这个例子中,sdata跟states索引相匹配的那3个值会被找出来并放到相应的位置上,但由于“California”所对应的sdata值找不到,所以其结果就为NaN(即“非数字”,在pandas中,它用于表示缺失或NA值)。pandas的isnulllnotnull函数可用于检测缺失数据:

In [33]: pd.isnull(obj4)
Out[33]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [34]: pd.notnull(obj4)
Out[34]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

对于许多应用而言,Series最重要的一个功能是:它在算术运算中会自动对齐不同索引的数据。

In [35]: obj3
Out[35]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [36]: obj4
Out[36]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [37]: obj3 + obj4
Out[37]:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象本身及其索引都有一个name属性,该属性跟pandas其他的关键功能关系非常密切:

In [38]: obj4.name = 'population'

In [39]: obj4.index.name = 'state'

In [40]: obj4
Out[40]:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Series的索引可以通过赋值的方式就地修改:

In [43]: obj.index = ['Bob','Steve','Jeff','Ryan']

In [44]: obj
Out[44]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看作由Series组成的字典(共用同一个索引)。

构造DataFrame的方法有很多,最常用的一种是直接传入一个由等长列表或NumPy数组组成的字典:

In [49]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
    ...:     ...: 'year':[2000,2001,2002,2001,2002],
    ...:     ...: 'pop':[1.5,1.7,3.6,2.4,2.9]}

In [50]: frame = pd.DataFrame(data)

DataFrame会自动加上索引(和Series一样),且全部列会被有序排列:

In [51]: frame
Out[51]:
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列:

In [52]: pd.DataFrame(data,columns=['year','state','pop'])
Out[52]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

和Series一样,如果传入的列在数据中找不到,就会产生NA值:

In [56]: frame2 = pd.DataFrame(data,columns=['year','state','pop',
    ...: 'debt'],index=['one','two','three','four','five'])

In [57]: frame2
Out[57]:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

In [58]: frame2.columns
Out[58]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series:

In [59]: frame2['state']
Out[59]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [60]: frame2.year
Out[60]:
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

返回的Series拥有原DataFrame相同的索引,且其name属性也已经被相应地设置好了。行也可以通过位置或名称地方式进行获取、比如用索引字段loc。

In [62]: frame2.loc['three']
Out[62]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [64]: frame2.iloc[0]
Out[64]:
year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

列可以通过赋值的方式进行修改。例如,我们可以给空的"debt"列赋上一个标量值或一组值:

In [65]: frame2['debt'] = 16.5

In [66]: frame2
Out[66]:
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

In [67]: frame2['debt'] = np.arange(5)

In [68]: frame2
Out[68]:
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4

将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。若赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都会被填上缺失值:

In [71]: val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])

In [72]: frame2['debt'] = val

In [73]: frame2
Out[73]:
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

为不存在的列赋值会创建出一个新列。关键字del用于删除列:

In [82]: frame2['eastern'] = frame2.state == 'Ohio'

In [83]: frame2
Out[83]:
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False

In [84]: del frame2['eastern']

In [85]: frame2.columns
Out[85]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过索引方式返回的列是相应数据的视图,不是副本。

另一种常见的数据形式是嵌套字典:

In [86]: pop = {'Nevada':{2001:2.4,2002:2.9},
    ...: 'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

In [87]: frame3 = pd.DataFrame(pop)

In [88]: frame3
Out[88]:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

外层字典的键作为列,内层键作为行索引

由Series组成的字典差不多也是一样的用法:

In [95]: pdata = {'Ohio':frame3['Ohio'][:-1],
    ...: 'Nevada':frame3['Nevada'][:2]}

In [96]: pd.DataFrame(pdata)
Out[96]:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7

可以输入给DataFrame构造器的数据:

上一篇下一篇

猜你喜欢

热点阅读