Python 数据处理（二十四）—— 索引和选择

2021-02-28 本文已影响0人名本无名

8 组合位置和标签索引

如果你想获取 'A' 列的第 0 和第 2 个元素，你可以这样做:

In [98]: dfd = pd.DataFrame({'A': [1, 2, 3],
   ....:                     'B': [4, 5, 6]},
   ....:                    index=list('abc'))
   ....: 

In [99]: dfd
Out[99]: 
   A  B
a  1  4
b  2  5
c  3  6

In [100]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[100]: 
a    1
c    3
Name: A, dtype: int64

这也可以用 .iloc 获取，通过使用位置索引来选择内容

In [101]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[101]: 
a    1
c    3
Name: A, dtype: int64

可以使用 .get_indexer 获取多个索引:

In [101]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[101]: 
a    1
c    3
Name: A, dtype: int64

9 列表中包含缺失标签的索引已被弃用

警告：

对于包含一个或多个缺失标签的列表，使用 .loc 或 [] 将不再重新索引，而是使用 .reindex

在以前的版本中，只要索引列表中存在至少一个有效标签，就可以使用 .loc[list-of-labels]

但是现在，只要索引列表中存在缺失的标签将引发 KeyError。推荐的替代方法是使用 .reindex()。

例如

In [103]: s = pd.Series([1, 2, 3])

In [104]: s
Out[104]: 
0    1
1    2
2    3
dtype: int64

索引列表的标签都存在

In [105]: s.loc[[1, 2]]
Out[105]: 
1    2
2    3
dtype: int64

先前的版本

In [4]: s.loc[[1, 2, 3]]
Out[4]:
1    2.0
2    3.0
3    NaN
dtype: float64

但是，现在

In [4]: s.loc[[1, 2, 3]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-28925a59f003> in <module>
----> 1 s.loc[[1, 2, 3]]
...
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([3], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

reindex

索引标签列表中包含不存在的标签，使用 reindex

In [106]: s.reindex([1, 2, 3])
Out[106]: 
1    2.0
2    3.0
3    NaN
dtype: float64

另外，如果你只想选择有效的键，可以使用下面的方法，同时保留了数据的 dtype

In [107]: labels = [1, 2, 3]

In [108]: s.loc[s.index.intersection(labels)]
Out[108]: 
1    2
2    3
dtype: int64

对于 .reindex()，如果有重复的索引将会引发异常

In [109]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])

In [110]: labels = ['c', 'd']

In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis

通常，您可以将所需的标签与当前轴做交集，然后重新索引

In [111]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[111]: 
c    3.0
d    NaN
dtype: float64

但是，如果你的索引结果包含重复标签，还是会引发异常

In [41]: labels = ['a', 'd']

In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis

10 随机抽样

使用 sample() 方法可以从 Series 或 DataFrame 中随机选择行或列。

该方法默认会对行进行采样，并接受一个特定的行数、列数，或数据子集。

In [112]: s = pd.Series([0, 1, 2, 3, 4, 5])

# 不传递参数，返回任意一行
In [113]: s.sample()
Out[113]: 
4    4
dtype: int64

# 选择指定数量的行
In [114]: s.sample(n=3)
Out[114]: 
0    0
4    4
1    1
dtype: int64

# 返回百分比的数据
In [115]: s.sample(frac=0.5)
Out[115]: 
5    5
3    3
1    1
dtype: int64

默认情况下，sample 每行最多返回一次，但也可以使用 replace 参数进行替换采样

In [116]: s = pd.Series([0, 1, 2, 3, 4, 5])

# Without replacement (default):
In [117]: s.sample(n=6, replace=False)
Out[117]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64

# With replacement:
In [118]: s.sample(n=6, replace=True)
Out[118]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64

默认情况下，每一行被选中的概率相等，但是如果你想让每一行有不同的概率，你可以为 sample 函数的 weights 参数设置抽样权值

In [119]: s = pd.Series([0, 1, 2, 3, 4, 5])

In [120]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [121]: s.sample(n=3, weights=example_weights)
Out[121]: 
5    5
4    4
3    3
dtype: int64

# Weights will be re-normalized automatically
In [122]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [123]: s.sample(n=1, weights=example_weights2)
Out[123]: 
0    0
dtype: int64

这些权重可以是一个列表、一个 NumPy 数组或一个 Series，但它们的长度必须与你要抽样的对象相同。

缺失的值将被视为权重为零，并且不允许使用 inf 值。如果权重之和不等于 1，则将所有权重除以权重之和，将其重新归一化。例如

In [119]: s = pd.Series([0, 1, 2, 3, 4, 5])

In [120]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [121]: s.sample(n=3, weights=example_weights)
Out[121]: 
5    5
4    4
3    3
dtype: int64

# Weights will be re-normalized automatically
In [122]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [123]: s.sample(n=1, weights=example_weights2)
Out[123]: 
0    0
dtype: int64

当应用于 DataFrame 时，您可以通过简单地将列名作为字符串传递给 weights 作为采样权重（前提是您要采样的是行而不是列）。

In [124]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
   .....:                     'weight_column': [0.5, 0.4, 0.1, 0]})
   .....: 

In [125]: df2.sample(n=3, weights='weight_column')
Out[125]: 
   col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1

sample 还允许用户使用 axis 参数对列进行抽样。

In [126]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

In [127]: df3.sample(n=1, axis=1)
Out[127]: 
   col1
0     1
1     2
2     3

最后，我们还可以使用 random_state 参数为 sample 的随机数生成器设置一个种子，它将接受一个整数（作为种子）或一个 NumPy RandomState 对象

In [128]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# With a given seed, the sample will always draw the same rows.
In [129]: df4.sample(n=2, random_state=2)
Out[129]: 
   col1  col2
2     3     4
1     2     3

In [130]: df4.sample(n=2, random_state=2)
Out[130]: 
   col1  col2
2     3     4
1     2     3

当为该轴设置一个不存在的键时，.loc/[] 操作可以执行放大

在 Series 的情况下，这实际上是一个追加操作

In [131]: se = pd.Series([1, 2, 3])

In [132]: se
Out[132]: 
0    1
1    2
2    3
dtype: int64

In [133]: se[5] = 5.

In [134]: se
Out[134]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

可以通过 .loc 在任一轴上放大 DataFrame

In [135]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
   .....:                    columns=['A', 'B'])
   .....: 

In [136]: dfi
Out[136]: 
   A  B
0  0  1
1  2  3
2  4  5

In [137]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']

In [138]: dfi
Out[138]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

这就像 DataFrame 的 append 操作

In [139]: dfi.loc[3] = 5

In [140]: dfi
Out[140]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

11 快速获取和设置标量值

由于用 [] 做索引必须处理很多情况（单标签访问、分片、布尔索引等），所以需要一些开销来搞清楚你的意图

如果你只想访问一个标量值，最快的方法是使用 at 和 iat 方法，这两个方法在所有的数据结构上都实现了

与 loc 类似，at 提供了基于标签的标量查找，而 iat 提供了基于整数的查找，与 iloc 类似

In [141]: s.iat[5]
Out[141]: 5

In [142]: df.at[dates[5], 'A']
Out[142]: -0.6736897080883706

In [143]: df.iat[3, 0]
Out[143]: 0.7215551622443669

同时，你也可以根据这些索引进行设置值

In [144]: df.at[dates[5], 'E'] = 7

In [145]: df.iat[3, 0] = 7

如果索引标签不存在，会放大数据

In [146]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7

In [147]: df
Out[147]: 
                   A         B         C         D    E    0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632  NaN  NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236  NaN  NaN
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804  NaN  NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860  NaN  NaN
2000-01-05 -0.424972  0.567020  0.276232 -1.087401  NaN  NaN
2000-01-06 -0.673690  0.113648 -1.478427  0.524988  7.0  NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268  NaN  NaN
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885  NaN  NaN
2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0

12 布尔索引

另一种常见的操作是使用布尔向量来过滤数据。运算符包括：

|(or)、&(and)、~ (not)

这些必须用括号来分组，因为默认情况下，Python 会将 df['A'] > 2 & df['B'] < 3 这样的表达式评估为 df['A'] > (2 & df['B']) < 3，而理想的执行顺序是 (df['A'] > 2) & (df['B'] < 3)

使用一个布尔向量来索引一个 Series，其工作原理和 NumPy ndarray 一样。

In [148]: s = pd.Series(range(-3, 4))

In [149]: s
Out[149]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [150]: s[s > 0]
Out[150]: 
4    1
5    2
6    3
dtype: int64

In [151]: s[(s < -1) | (s > 0.5)]
Out[151]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [152]: s[~(s < 0)]
Out[152]: 
3    0
4    1
5    2
6    3
dtype: int64

您可以使用一个与 DataFrame 的索引长度相同的布尔向量从 DataFrame 中选择行

In [153]: df[df['A'] > 0]
Out[153]: 
                   A         B         C         D   E   0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236 NaN NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860 NaN NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268 NaN NaN

列表推导式和 Series 的 map 函数可用于产生更复杂的标准

In [154]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                     'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

# only want 'two' or 'three'
In [155]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [156]: df2[criterion]
Out[156]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# equivalent but slower
In [157]: df2[[x.startswith('t') for x in df2['a']]]
Out[157]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# Multiple criteria
In [158]: df2[criterion & (df2['b'] == 'x')]
Out[158]: 
       a  b         c
3  three  x  0.361719

我们可以使用布尔向量结合其他索引表达式，在多个轴上索引

In [159]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[159]: 
   b         c
3  x  0.361719

iloc 支持两种布尔索引。如果索引器是一个布尔值 Series，就会引发异常。

例如，在下面的例子中，df.iloc[s.values, 1] 是正确的。但是 df.iloc[s，1] 会引发 ValueError。

In [160]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
   .....:                   index=list('abc'),
   .....:                   columns=['A', 'B'])
   .....: 

In [161]: s = (df['A'] > 2)

In [162]: s
Out[162]: 
a    False
b     True
c     True
Name: A, dtype: bool

In [163]: df.loc[s, 'B']
Out[163]: 
b    4
c    6
Name: B, dtype: int64

In [164]: df.iloc[s.values, 1]
Out[164]: 
b    4
c    6
Name: B, dtype: int64