Python 数据处理（十一）—— 排序

2021-02-12 本文已影响0人名本无名

11 排序

pandas 支持三种排序方式：

按索引排序
按指定列的值排序
按索引和列排序

11.1 按索引排序

Series.sort_index() 和 DataFrame.sort_index() 方法用于按其索引级别对 pandas 对象进行排序

In [300]: df = pd.DataFrame(
   .....:     {
   .....:         "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
   .....:         "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
   .....:         "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
   .....:     }
   .....: )
   .....: 

In [301]: unsorted_df = df.reindex(
   .....:     index=["a", "d", "c", "b"], columns=["three", "two", "one"]
   .....: )
   .....: 

In [302]: unsorted_df
Out[302]: 
      three       two       one
a       NaN -1.152244  0.562973
d -0.252916 -0.109597       NaN
c  1.273388 -0.167123  0.640382
b -0.098217  0.009797 -1.299504

# DataFrame
In [303]: unsorted_df.sort_index()
Out[303]: 
      three       two       one
a       NaN -1.152244  0.562973
b -0.098217  0.009797 -1.299504
c  1.273388 -0.167123  0.640382
d -0.252916 -0.109597       NaN

In [304]: unsorted_df.sort_index(ascending=False)
Out[304]: 
      three       two       one
d -0.252916 -0.109597       NaN
c  1.273388 -0.167123  0.640382
b -0.098217  0.009797 -1.299504
a       NaN -1.152244  0.562973

In [305]: unsorted_df.sort_index(axis=1)
Out[305]: 
        one     three       two
a  0.562973       NaN -1.152244
d       NaN -0.252916 -0.109597
c  0.640382  1.273388 -0.167123
b -1.299504 -0.098217  0.009797

# Series
In [306]: unsorted_df["three"].sort_index()
Out[306]: 
a         NaN
b   -0.098217
c    1.273388
d   -0.252916
Name: three, dtype: float64

按索引排序还支持接受可调用函数的 key 参数，以将其应用于要排序的索引。

对于 MultiIndex 对象，在某一 level 上使用 key 参数来对每个 level 值执行相应的函数。

In [307]: s1 = pd.DataFrame({"a": ["B", "a", "C"], "b": [1, 2, 3], "c": [2, 3, 4]}).set_index(
   .....:     list("ab")
   .....: )
   .....: 

In [308]: s1
Out[308]: 
     c
a b   
B 1  2
a 2  3
C 3  4

In [309]: s1.sort_index(level="a")
Out[309]: 
     c
a b   
B 1  2
C 3  4
a 2  3

In [310]: s1.sort_index(level="a", key=lambda idx: idx.str.lower())
Out[310]: 
     c
a b   
a 2  3
B 1  2
C 3  4

11.2 按值排序

Series.sort_values() 方法用于将 Series 按值排序。

DataFrame.sort_values() 方法用于将 DataFrame 按照指定的的列或行值进行排序。其可选的 by 参数可用于指定需要排序的一列或多列

In [311]: df1 = pd.DataFrame(
   .....:     {"one": [2, 1, 1, 1], "two": [1, 3, 2, 4], "three": [5, 4, 3, 2]}
   .....: )
   .....: 

In [312]: df1.sort_values(by="two")
Out[312]: 
   one  two  three
0    2    1      5
2    1    2      3
1    1    3      4
3    1    4      2

by 参数还支持传入列表

In [313]: df1[["one", "two", "three"]].sort_values(by=["one", "two"])
Out[313]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

这些方法通过 na_position 参数对 NA 值进行特殊处理

In [314]: s[2] = np.nan

In [315]: s.sort_values()
Out[315]: 
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2    <NA>
5    <NA>
dtype: string

In [316]: s.sort_values(na_position="first")
Out[316]: 
2    <NA>
5    <NA>
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: string

还支持 key 参数来接受一个可调用的函数，会将其应用于要排序的值上

In [317]: s1 = pd.Series(["B", "a", "C"])

In [318]: s1.sort_values()
Out[318]: 
0    B
2    C
1    a
dtype: object

In [319]: s1.sort_values(key=lambda x: x.str.lower())
Out[319]: 
1    a
0    B
2    C
dtype: object

对于 Series 对象，key 参数传入的是 Series 的值，应该返回具有相同形状的 Series 或数组。

对于 DataFrame 对象，key 是按列应用的，因此 key 仍应传入 Series 并返回 Series，例如

In [320]: df = pd.DataFrame({"a": ["B", "a", "C"], "b": [1, 2, 3]})

In [321]: df.sort_values(by="a")
Out[321]: 
   a  b
0  B  1
2  C  3
1  a  2

In [322]: df.sort_values(by="a", key=lambda col: col.str.lower())
Out[322]: 
   a  b
1  a  2
0  B  1
2  C  3

每列的名称或类型可用于将不同的函数应用于不同的列

11.3 通过索引和值

DataFrame.sort_values() 中，传递给 by 参数的字符串可以列名或索引的 level

# Build MultiIndex
In [323]: idx = pd.MultiIndex.from_tuples(
   .....:     [("a", 1), ("a", 2), ("a", 2), ("b", 2), ("b", 1), ("b", 1)]
   .....: )
   .....: 

In [324]: idx.names = ["first", "second"]

# Build DataFrame
In [325]: df_multi = pd.DataFrame({"A": np.arange(6, 0, -1)}, index=idx)

In [326]: df_multi
Out[326]: 
              A
first second   
a     1       6
      2       5
      2       4
b     2       3
      1       2
      1       1

按 second 和 A 排序

In [327]: df_multi.sort_values(by=["second", "A"])
Out[327]: 
              A
first second   
b     1       1
      1       2
a     1       6
b     2       3
a     2       4
      2       5

注意

如果传入的字符串与列名和索引 level 的名称一样，则会发出警告，并且列名优先。这将在将来的版本中导致歧义错误

11.4 searchsorted

Series 有 searchsorted() 方法，其工作原理类似于 numpy.ndarray.searchsorted()

In [328]: ser = pd.Series([1, 2, 3])

In [329]: ser.searchsorted([0, 3])
Out[329]: array([0, 2])

In [330]: ser.searchsorted([0, 4])
Out[330]: array([0, 3])

In [331]: ser.searchsorted([1, 3], side="right")
Out[331]: array([1, 3])

In [332]: ser.searchsorted([1, 3], side="left")
Out[332]: array([0, 2])

In [333]: ser = pd.Series([3, 1, 2])

In [334]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[334]: array([0, 2])

11.5 smallest/largest values

Series 有 nsmallest() 和 nlargest() 方法，它们返回前 n 个最小或最大值。

对于一个较大的 Series，这比对整个 Series 进行排序并对结果调用 head(n) 要快得多。

In [335]: s = pd.Series(np.random.permutation(10))

In [336]: s
Out[336]: 
0    2
1    0
2    3
3    7
4    1
5    5
6    9
7    6
8    8
9    4
dtype: int64

In [337]: s.sort_values()
Out[337]: 
1    0
4    1
0    2
2    3
9    4
5    5
7    6
3    7
8    8
6    9
dtype: int64

In [338]: s.nsmallest(3)
Out[338]: 
1    0
4    1
0    2
dtype: int64

In [339]: s.nlargest(3)
Out[339]: 
6    9
8    8
3    7
dtype: int64

DataFrame 也有 nlargest 和 nsmallest 方法

In [340]: df = pd.DataFrame(
   .....:     {
   .....:         "a": [-2, -1, 1, 10, 8, 11, -1],
   .....:         "b": list("abdceff"),
   .....:         "c": [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0],
   .....:     }
   .....: )
   .....: 

In [341]: df.nlargest(3, "a")
Out[341]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN

In [342]: df.nlargest(5, ["a", "c"])
Out[342]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN
2   1  d  4.0
6  -1  f  4.0

In [343]: df.nsmallest(3, "a")
Out[343]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0

In [344]: df.nsmallest(5, ["a", "c"])
Out[344]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0
2  1  d  4.0
4  8  e  NaN

11.6 按多索引列名排序

当列为 MultiIndex 时，您必须明确地为 by 指定所有的 level。

In [345]: df1.columns = pd.MultiIndex.from_tuples(
   .....:     [("a", "one"), ("a", "two"), ("b", "three")]
   .....: )
   .....: 

In [346]: df1.sort_values(by=("a", "two"))
Out[346]: 
    a         b
  one two three
0   2   1     5
2   1   2     3
1   1   3     4
3   1   4     2

12 拷贝

Pandas 对象上的 copy() 方法，会复制底层数据（不包括轴索引，因为轴索引不可变），并返回一个新的对象。

注意：几乎不需要复制对象。例如，只有几种方法可以原地修改 DataFrame

插入、删除、修改列操作
为 index 或 columns 属性重新赋值
对于同质数据，可以通过 values 属性或高级索引方法直接修改值

需要明确的一点是，使用 Pandas 方法修改数据不会带来任何副作用，几乎所有方法都返回一个新对象，而原始对象保持不变。如果数据被修改，那是因为您明确地进行了修改