Pandas: Index与Selection方式 (Part

2018-09-01 本文已影响128人 zeks有好的生物钟

本文的内容是关于:

Pandas 三种基本的Index方式: .loc, .iloc, [].
Pandas 基本的Selection Data的方式, Selection By Label, Selection By Position 以及 Selection by Callable.

基本的Index方式

import pandas as pd
import numpy as np

Pandas目前有3种不同的Multi Axis 索引方式

.loc 索引
.iloc 索引
[] 索引, A.K.A get_item() 索引

.loc 索引

使用Multi Axis索引从一个Object中取出值, 使用如下方式 (.iloc相同).

Object Type     Indexers

Series          s.loc[indexer]

DataFrame       df.loc[row_indexer,column_indexer]

Panel           p.loc[item_indexer,major_indexer,minor_indexer]

在以上的notation中, 任何一个axis accessors都可以是 null slice :, 如果没有明确某个 axis的slice,就会被认为是 :. 比如
df.loc['a']
等价于
df.loc['a', :]

# .loc 索引
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8,4), index=dates, columns=['A','B','C','D'])

df.loc[:,'A':'C']

从以上的索引代码
df.loc[:, 'A':'C']
可以看出, Pandas的 slice是包含了开始和结束的元素的. 这点和Python, Numpy的Slice不同

[] index

[] index的首要功能, 是选择出低维度(lower-dimensional)的slice. 当使用 [] 去索引pandas Object时, 会有如下返回值

Object Type     Selection           Return Value Type
Series          series[label]       scalar value
DataFrame       frame[colname]      Series corresponding to colname
Panel           panel[itemname]     DataFrame corresponding to the itemname

将一个col name传入 [] 索引

df['A']

2000-01-01    1.002840
2000-01-02    1.339418
2000-01-03    0.934340
2000-01-04   -1.350004
2000-01-05    0.579421
2000-01-06   -0.094316
2000-01-07   -1.327806
2000-01-08   -0.136517
Freq: D, Name: A, dtype: float64

将一个col list传入 [] 索引

df[['A','B','C']]

df[['B','A']] = df[['A','B']]

df

使用

df[['A','B']] = df[['B','A']]

可以很方便地用于in place transform for a subset of the columns.

该语句等价于:

df.__set_item__(['A','B'], df.__get_item__(['B','A']))

需要注意的是,

df.loc[:, ['B', 'A']] = df[['A','B']]

不能将 A 与 B 互换. 因为使用 .loc 索引时, Pandas会 align all AXES, 我对这个表述的理解是, Pandas会自动对齐各Label以及各Index. 如果要使用 .loc 索引将两个col的值互换, 应该使用以下方法:

df.loc[:, ['B', 'A']] = df[['A', 'B']].values

其中, values 的作用是

Return a Numpy representation of the DataFrame.Only the values in the DataFrame will be returned, the axes labels will be removed

使用这种方式, 可以在取出 A,B两列后, 去除Label信息. 使用 .loc 赋值时, 就不会发生 align AXES的情况.

具体见以下代码:

df

df.loc[:, ['B', 'A']] = df[['A', 'B']]
df

df.loc[:, ['B', 'A']] = df[['A', 'B']].values
df

额外的低频使用的Index方式: Attribute Index

我们也可以用Attribute Index的方式(.attr_name).
Object Type Selection Result
Series index
Dataframe column
Panel dataframe

# Series Attribute Index
sa = pd.Series([1,2,3], index = list("abc"))
sa.a

#DataFrame Attribute Index
dfa = df.copy()
dfa

dfa.A

2000-01-01    1.002840
2000-01-02    1.339418
2000-01-03    0.934340
2000-01-04   -1.350004
2000-01-05    0.579421
2000-01-06   -0.094316
2000-01-07   -1.327806
2000-01-08   -0.136517
Freq: D, Name: A, dtype: float64

当DataFrame中某一列已存在时, 可以使用Attribute Index为该列赋值.

dfa.A = list(range(len(dfa.index)))

dfa

** 当DataFrame中某一列存在时, 无法使用Attribute Index 去获取该列, 也不能使用Attribute Index去创建新列. **

当尝试使用Attribute Index创建一个新列时, 事实上没有创建列, 而是真的为该对象添加了一个新的属性...

而应该使用 [] Index 或者 .loc Index 去创建.

dfNoA = df.loc[:,['B','C', 'D']].copy()
dfNoA

# 用 Attribute Index 尝试创建新列时, 此时 F 会作为 dfNoA的属性存在, 而不是DataFrame中的一个Col
dfNoA.F = list(range(len(df.index)))

/Users/zxwang/.pyenv/versions/tensorflow/lib/python3.6/site-packages/ipykernel_launcher.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

dfNoA

dfNoA.F

[0, 1, 2, 3, 4, 5, 6, 7]

# 使用 loc Index 创建新列
dfNoA.loc[:,'A'] = list(range(len(df.index)))
dfNoA

# 使用 [] index 创建新列
dfNoA1 = df.loc[:,['B','C', 'D']].copy()
dfNoA1['A'] = list(range(len(df.index)))
dfNoA1

Slicing Ranges

DataFrame 中使用 [] Index进行Range Slice时, 是在 Row 上进行起作用的.

df

df[:3]

Selection By Label

Pandas提供了一组方法, 以允许纯粹地基于Label进行Index.

.loc 索引方法是Pandas使用label索引时最重要的的索引方法. 以下是 .loc 索引的有效输入

单独的Label, 比如 'a'或者 5. 但是请注意, 单独的Label传入loc索引, 是对Row Label的索引. 以及5 在此处是 Index的Label, 并不代表 "第5个元素".
List or Array of Labels. ['a', 'b', 'c']
Slice Object. ['a':'c']. 注意, Pandas的Slice是开始和结束都包括在索引里的.
A boolean Array
A callable.

df1 = pd.DataFrame(np.random.randn(6, 4), index=list("abcdef"), columns=list("ABCD"))
df1

# Access with list of index labels
df1.loc[['a','b','c'], :]

#Access with list of col labels
df1.loc[:, ['A', 'C']]

# Access via label slice object
df1.loc['b', 'A':'C']

A   -1.909139
B    0.782975
C    1.155431
Name: b, dtype: float64

# Getting Value with boolean array
df1.loc[:, df1.loc['b'] > 0]

以上的语句表达的含义为:

找到这样的Col的所有值, 这些col在 "b" index(row) 上的值应该大于0. 这个.loc索引语句是根据行值选择列.

如果单独执行 b 语句. 可以看出它返回的是对应各Col的 Boolean Series.

df1.loc['b'] > 0

A    False
B     True
C     True
D    False
Name: b, dtype: bool

Slicing with Labels 细节

很明显, 使用 Label进行Slice是需要关注一些细节的.
在Python的list slice中, slice的开始和结束都是基于"位置"的. 而不管list的长度怎样, 在位置上这些总是连续的. 如果对 Pandas 对象使用基于Label的slice. 很明显地是, Label并不总是连续, 甚至是乱序的. 比如 col label可以是 ['b', 'f', 'a'] , 此时如果使用slice会发生什么?

答案是... 会拿到位于起始label与结束label之间的所有label, 包括起始label和结束label.

s1 = pd.Series(list('abcde'), index=[0,3,2,5,4])

# 拿到开始和结束label之间的所有label, 包括开始和结束label
s1.loc[3:5]

3    b
2    c
5    d
dtype: object

如果slice的开始label和结束label不在Pandas对象的label中, 会发生什么?

这取决于Pandas对象的Label是否已经按顺序排列.

如果label是排序的, 那么会取最靠近开始label和结束label(假设slice的开始和结束label都不存在于labels中)
如果label是无序的, 那么会抛出异常.

# s1的index labels 无序, 因此使用 .loc[1:6]会抛出异常.
s1.loc[1:6]

# s1.sort_index() 之后得到的Series对象是有序的, 使用 .loc[1:6] 索引可以成功.

s1.sort_index().loc[1:6]

2    c
3    b
4    e
5    d
dtype: object

Selection by Position

Pandas提供了一组方法, 能够纯粹基于Integer进行索引. 该索引的语法与Python以及Numpy的slicing 很相近: 0-based 索引, 当slice操作时, 结果包含了开始position, 不包括结束position. 使用一个非整数数值索引, 甚至是valid label时, 都会抛出异常.

.iloc 方法是Selection by Position的Primary方法, 以下是该方法的有效输入.

一个整数, 比如 5
A list or array of integers [4, 3, 0]
Slice 对象
Boolean array
Callable

和 .loc 属性方法的有效输入挺相似的

df2 = pd.DataFrame(np.random.randn(6,4), index=[1,2,3,4,5,6], columns=list("ABCD"))

df2

# iloc 传入slice object
df2.iloc[0:3, 1:3]

# iloc 传入Integer, 此时会根据position选择第3行, 而不是根据Index Label选择第2行
df2.iloc[2,:]

A   -1.445161
B   -1.066899
C    0.703438
D   -0.389675
Name: 3, dtype: float64

Selection By Callable

.loc, iloc, [] 三种索引方式都可以传入callable 作为参数. Callable 必须是一个function, 该function接受一个参数, 这个参数就是要被索引的Pandas对象, 返回有效的输出作为index.

如果以 loc 索引为例, 则该function需要返回的值类型为下列4中之一:

单独的Label, 比如 'a'或者 5. 但是请注意, 单独的Label传入loc索引, 是对Row Label的索引. 以及5 在此处是 Index的Label, 并不代表 "第5个元素".
List or Array of Labels. ['a', 'b', 'c']
Slice Object. ['a':'c']. 注意, Pandas的Slice是开始和结束都包括在索引里的.
A boolean Array

# Selectin by Callable, lambda 返回的结果是 Series of Boolean.
df2.loc[lambda df : df['A'] > 0]

# 上面的语句等价于以下索引方式
df2.loc[df2['A']>0]

callable 作为 index的参数, 既然callable总是要返回合法的索引方式, 为什么不直接就使用那些合理的索引方法?

Callable其实是为了chained selection准备的, 以减少对临时变量的使用. 官方文档的具体说法是

using callable indexers, you can chain data selection operations without using temporary variable

例子需要在后面补充

以上所有部分的参考文档是

Indexing and Selecting Data https://pandas.pydata.org/pandas-docs/stable/indexing.html