利用Python进行数据分析

DataFrame

2019-01-31  本文已影响4人  庵下桃花仙

DataFrame 表示矩阵数据表,有行索引和列索引。

构建方式


In [43]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    ...:         'year' : [2000, 2001, 2002, 2001, 2001, 2003],
    ...:         'pop'  : [1.5, 1.7,  3.6, 2.4, 2.9, 3.2]}

In [44]: frame = pd.DataFrame(data)

In [45]: frame
Out[45]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2001  2.9
5  Nevada  2003  3.2

对于大型 DataFrame,head 方法只选出前5行

In [46]: frame.head()
Out[46]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2001  2.9

指定顺序

In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2001  Nevada  2.9
5  2003  Nevada  3.2

传的列不在字典中

In [49]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
    ...:                                  index=['one', 'two', 'three', 'four', 'five', 'six'])

In [50]: frame2
Out[50]:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2001  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

某一列可以按字典型标记或属性检索为 Series

In [51]: frame2['state']
Out[51]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [52]: frame2.year
Out[52]:
one      2000
two      2001
three    2002
four     2001
five     2001
six      2003
Name: year, dtype: int64

行也可以通过位置或特殊属性 loc 进行选取

In [53]: frame2.loc['three']
Out[53]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列的引用是可以修改的

In [54]: frame2['debt'] = 16.5

In [55]: frame2
Out[55]:
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2001  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5
In [56]: frame2['debt'] = np.arange(6.)

In [57]: frame2
Out[57]:
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2001  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0

将Series赋值给一列

In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [59]: frame2['debt'] = val

In [60]: frame2
Out[60]:
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2001  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN

del 删除某一列

In [61]: frame2['eastern'] = frame2.state == 'Ohio'

In [62]: frame2
Out[62]:
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2001  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False

In [63]: del frame2['eastern']

In [64]: frame2.columns
Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

对Series的修改会映射到DaraFrame中,如果要复制,应显示使用Series的copy方法

另一种数据形式

In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
    ...:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [66]: frame3 = pd.DataFrame(pop)

In [67]: frame3
Out[67]:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

调换行和列

In [68]: frame3.T
Out[68]:
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

如果显示指明索引,则内部的字典的键不会被排序

In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[69]:
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

包含Series的字典也可以用于构造DataFrame

In [70]: pdata = {'Ohio': frame3['Ohio'][: -1],
    ...:          'Nevada': frame3['Nevada'][: 2]}

In [71]: pd.DataFrame(pdata)
Out[71]:
      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4

索引和列拥有name属性

In [72]: frame3.index.name = 'year'

In [73]: frame3.columns.name = 'state'

In [74]: frame3
Out[74]:
state  Nevada  Ohio
year
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6
In [75]: frame3.values
Out[75]:
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

自动选择适合所有列的类型

In [77]: frame2.values
Out[77]:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2001, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

索引对象

在构造Series或DataFrame时,使用的任意数组或标签序列都可以在内部转换为索引对象

In [78]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [79]: index = obj.index

In [80]: index
Out[80]: Index(['a', 'b', 'c'], dtype='object')

In [81]: index[1:]
Out[81]: Index(['b', 'c'], dtype='object')

In [82]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-82-a452e55ce13b> in <module>
----> 1 index[1] = 'd'

c:\users\a\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   3881
   3882     def __setitem__(self, key, value):
-> 3883         raise TypeError("Index does not support mutable operations")
   3884
   3885     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [83]:

In [83]: labels = pd.Index(np.arange(3))

In [84]: labels
Out[84]: Int64Index([0, 1, 2], dtype='int64')

In [85]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [86]: obj2
Out[86]:
0    1.5
1   -2.5
2    0.0
dtype: float64

In [87]: obj2.index is labels
Out[87]: True

索引对象是不可变的

In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [90]: 'Ohio' in frame3.columns
Out[90]: True

In [91]: 2003 in frame3.columns
Out[91]: False

In [88]: frame3
Out[88]:
state  Nevada  Ohio
year
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [90]: 'Ohio' in frame3.columns
Out[90]: True

In [91]: 2003 in frame3.columns
Out[91]: False

In [92]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [93]: dup_labels
Out[93]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
上一篇下一篇

猜你喜欢

热点阅读