Series第八讲重塑/排序

2020-09-23 本文已影响0人 butters001

Series第八讲重塑/排序

本节课将讲解Pandas-Series的重塑(Reshaping)与排序(sorting)。

重塑顾名思义即改变数据的形状。

计算/描述统计(下)

Series.argsort()
Series.argmin()
Series.argmax()
Series.reorder_levels()
Series.sort_values()
Series.sort_index()
Series.swaplevel()
Series.unstack()
Series.explode()
Series.searchsorted()
Series.ravel()
Series.repeat()
Series.squeeze()
Series.view()

详细介绍

首先导入所需依赖包

In [1]: import numpy as np                                                               
In [2]: import pandas as pd

1. `Series.argsort()`

Series.argsort(axis=0, kind='quicksort', order=None)

对value排序，返回一个新的Series，新Series的值是排序后的原value在原Series中的整数位置，NaN的位置为-1。(有点绕，通过下面例子图解会比较清晰易懂)

常用参数介绍：

kind：{‘mergesort’, ‘quicksort’, ‘heapsort’}, default ‘quicksort’ 【排序方法，mergesort是唯一稳定的算法】

In [3]: s = pd.Series([1, 3, 2, 2, 5, 6, 5], index=[list('abcdefg')])           

In [4]: s                                                                       
Out[4]: 
a    1
b    3
c    2
d    2
e    5
f    6
g    5
dtype: int64

In [5]: s.argsort()                                                             
Out[5]: 
a    0
b    2
c    3
d    1
e    4
f    6
g    5
dtype: int64

图解argsort：

argsort.png

2. `Series.argmin()`

Series.argmin(axis=None, skipna=True, *args, **kwargs)

返回Series中最小值的整数位置。

与Series.idxmin()比较：argmin返回的是最小值的整数位置，idxmin返回的是最小值的索引。

In [6]: s.argmin()                                                                              
Out[6]: 0

In [7]: s.idxmin()                                                                              
Out[7]: ('a',)

3. `Series.argmax()`

Series.argmax(axis=None, skipna=True, *args, **kwargs)

返回Series中最大值的整数位置。

与Series.idxmax()比较：argmax返回的是最大值的整数位置，idxmax返回的是最大值的索引。

In [8]: s.argmax()                                                                              
Out[8]: 5

In [9]: s.idxmax()                                                                              
Out[9]: ('f',)

4. `Series.reorder_levels()`

Series.reorder_levels(order)

使用输入顺序重新排列索引级别。重新排序MultiIndex的级别。

常用参数介绍：

order：list of int representing new level order 【新的等级排序，是一个包含int的列表，int表示多级索引的位置级别】

In [10]: midx = pd.MultiIndex.from_arrays([['Networking', 'Cryptography',   
    ...:                                      'Anthropology', 'Science'],   
    ...:                                              [88, 84, 98, 95]])                        

In [11]: midx                                                                                   
Out[11]: 
MultiIndex([(  'Networking', 88),
            ('Cryptography', 84),
            ('Anthropology', 98),
            (     'Science', 95)],
           )

# 将两个索引互换位置
In [12]: midx.reorder_levels([1, 0])                                                            
Out[12]: 
MultiIndex([(88,   'Networking'),
            (84, 'Cryptography'),
            (98, 'Anthropology'),
            (95,      'Science')],
           )

# 使用swaplevel来实现呢
In [112]: midx.swaplevel()                                                                      
Out[112]: 
MultiIndex([(88,   'Networking'),
            (84, 'Cryptography'),
            (98, 'Anthropology'),
            (95,      'Science')],
           )

5. `Series.sort_values()`

Series.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

按值对Series排序。

常用参数介绍：

ascending：bool, default True 【True升序，False降序】
kind：{‘quicksort’, ‘mergesort’ or ‘heapsort’}, default ‘quicksort’ 【排序方法】
na_position：{‘first’ or ‘last’}, default ‘last’ 【values中NaN值的位置，默认放到最后】
ignore_index：bool, default False 【如果为True，则用0,1,2....标签替换原索引，1.0.0版的新功能】
key：callable, optional 【如果不是None，则在排序之前将键函数应用于Series的values。比较像python内建函数sorted()中的key参数，1.1.0版的新功能】

In [13]: s = pd.Series([np.nan, 1, 3, 10, 5])                                                   
In [14]: s                                                                                      
Out[14]: 
0     NaN
1     1.0
2     3.0
3    10.0
4     5.0
dtype: float64

# 升序
In [15]: s.sort_values(ascending=True)                                                          
Out[15]: 
1     1.0
2     3.0
4     5.0
3    10.0
0     NaN
dtype: float64

# 降序
In [16]: s.sort_values(ascending=False)                                                         
Out[16]: 
3    10.0
4     5.0
2     3.0
1     1.0
0     NaN
dtype: float64

# NaN值的位置
In [17]: s.sort_values(na_position='first')                                                     
Out[17]: 
0     NaN
1     1.0
2     3.0
4     5.0
3    10.0
dtype: float64

# ignore_index=True 忽视索引 用0,1,2,3.....作为新索引
In [22]: s = pd.Series([1, 3, 2, 2, 5, 6, 5], index=[list('abcdefg')])                          
In [23]: s.sort_values()                                                                        
Out[23]: 
a    1
c    2
d    2
b    3
e    5
g    5
f    6
dtype: int64

In [24]: s.sort_values(ignore_index=True)                                                       
Out[24]: 
0    1
1    2
2    2
3    3
4    5
5    5
6    6
dtype: int64

# 使用key function进行排序。您的key函数将获得Series的值，并且应返回类似数组的形式
In [27]: s = pd.Series(['a', 'B', 'c', 'D', 'e'])                                               
In [28]: s.sort_values()                                                                        
Out[28]: 
1    B
3    D
0    a
2    c
4    e
dtype: object

In [29]: s.sort_values(key=lambda x: x.str.lower())                                                                        
Out[29]: 
1    a
3    B
0    c
2    D
4    e
dtype: object

6. `Series.sort_index()`

Series.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)

按索引标签对Series进行排序。

常用参数介绍：

level：int, optional 【如果不为None，则对指定index level中的值进行排序】
ascending：bool, default True 【True升序，False降序】
kind：{‘quicksort’, ‘mergesort’ or ‘heapsort’}, default ‘quicksort’ 【排序方法】
na_position：{‘first’ or ‘last’}, default ‘last’ 【索引中NaN值的位置，默认放到最后】
sort_remaining：bool, default True 【如果为True，且是多层索引，则在指定级别排序后，也在其他级别（按顺序）排序】
ignore_index：bool, default False 【如果为True，则用0,1,2....标签替换原索引，1.0.0版的新功能】
key：callable, optional 【如果不是None，则在排序之前将键函数应用于Series的values。比较像python内建函数sorted()中的key参数，1.1.0版的新功能】

In [30]: s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])                                
In [31]: s.sort_index()                                                                         
Out[31]: 
1    c
2    b
3    a
4    d
dtype: object

# 降序
In [32]: s.sort_index(ascending=False)                                                          
Out[32]: 
4    d
3    a
2    b
1    c
dtype: object

# 处理NaN索引位置
In [33]: s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan])                           
In [34]: s.sort_index(na_position='first')                                                      
Out[34]: 
NaN    d
1.0    c
2.0    b
3.0    a
dtype: object

# 多级索引
In [35]: arrays = [np.array(['qux', 'qux', 'foo', 'foo', 
    ...:                     'baz', 'baz', 'bar', 'bar']), 
    ...:           np.array(['two', 'one', 'two', 'one', 
    ...:                     'two', 'one', 'two', 'one'])]                                      

In [36]: s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=arrays)                                  
In [37]: s                                                                                      
Out[37]: 
qux  two    1
     one    2
foo  two    3
     one    4
baz  two    5
     one    6
bar  two    7
     one    8
dtype: int64

# 第二层索引升序，默认也会对其他级别索引进行升序
In [38]: s.sort_index(level=1)                                                                  
Out[38]: 
bar  one    8
baz  one    6
foo  one    4
qux  one    2
bar  two    7
baz  two    5
foo  two    3
qux  two    1
dtype: int64

# 只对第二层索引升序，其他级别索引不进行排序
In [39]: s.sort_index(level=1, sort_remaining=False)                                            
Out[39]: 
qux  one    2
foo  one    4
baz  one    6
bar  one    8
qux  two    1
foo  two    3
baz  two    5
bar  two    7
dtype: int64

7. `Series.swaplevel()`

Series.swaplevel(i=- 2, j=- 1, copy=True)

MultiIndex的i级索引和j级索引交换位置。

常用参数介绍：

copy：bool, default True 【是否复制基础数据，如果False，则原Series的改变会影响到新Series】

# s为上一个方法最后的s对象
In [43]: s                                                                                      
Out[43]: 
qux  two    1
     one    2
foo  two    3
     one    4
baz  two    5
     one    6
bar  two    7
     one    8
dtype: int64

# 默认会重新复制数据
In [44]: s1 = s.swaplevel()                                                             

# 不复制基础数据，则使用的是原数据的引用
In [45]: s2 = s.swaplevel(copy=False)

# 对原Series进行修改，观察发现s2的数据也被修改了
s.loc['qux', 'two'] = 888

In [62]: s                                                                                      
Out[62]: 
qux  two    888
     one      2
foo  two      3
     one      4
baz  two      5
     one      6
bar  two      7
     one      8
dtype: int64

In [63]: s1                                                                                     
Out[63]: 
two  qux    1
one  qux    2
two  foo    3
one  foo    4
two  baz    5
one  baz    6
two  bar    7
one  bar    8
dtype: int64

In [64]: s2                                                                                     
Out[64]: 
two  qux    888
one  qux      2
two  foo      3
one  foo      4
two  baz      5
one  baz      6
two  bar      7
one  bar      8
dtype: int64

8. `Series.unstack()`

Series.unstack(level=- 1, fill_value=None)

数据透视，使用MultiIndex Series生成DataFrame。

# s为上一个方法的s对象
In [65]: s                                                                                      
Out[65]: 
qux  two    888
     one      2
foo  two      3
     one      4
baz  two      5
     one      6
bar  two      7
     one      8
dtype: int64

In [66]: s.unstack(level=0)                                                                     
Out[66]: 
     bar  baz  foo  qux
one    8    6    4    2
two    7    5    3  888

9. `Series.explode()`

Series.explode(ignore_index=False)

将类似列表的每个元素转换为行，这些行的索引将重复。

常用参数介绍：

ignore_index：bool, default False 【如果为True，则用0,1,2....标签替换原索引，1.0.0版的新功能】

In [69]: s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])                                          

In [70]: s                                                                                      
Out[70]: 
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object

In [71]: s.explode()                                                                            
Out[71]: 
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object

10. `Series.searchsorted()`

Series.searchsorted(value, side='left', sorter=None)

插入元素应在的位置，插入点，原理应该是二分查找。

注意⚠️：Series必须单调排序，即Series必须是已经排好序的。

常用参数介绍：

value：array_like 【我要插入的values】
side：{‘left’, ‘right’}, optional 【left为插入到发现的第一个元素位置前一个，right为插入到发现的最后一个元素位置的后一个】

In [79]: ser = pd.Series([1, 2, 3, 3])                                                          

In [80]: ser                                                                                    
Out[80]: 
0    1
1    2
2    3
3    3
dtype: int64

# 数值4应该会被插入的位置
In [81]: ser.searchsorted(4)                                                                    
Out[81]: 4

# 数值0和4应该会被插入的位置
In [82]: ser.searchsorted([0, 4])                                                               
Out[82]: array([0, 4])

# 插入到第一个合适的位置的前面
In [83]: ser.searchsorted([1, 3], side='left')                                                  
Out[83]: array([0, 2])

# 插入到最后一个合适的位置的后面
In [84]: ser.searchsorted([1, 3], side='right')                                                 
Out[84]: array([1, 4])

11. `Series.ravel()`

Series.ravel()

返回一个展平的ndarray。

In [86]: ser.ravel()                                                                            
Out[86]: array([1, 2, 3, 3])

In [87]: ser.values                                                                             
Out[87]: array([1, 2, 3, 3])

In [88]: ser.to_numpy()                                                                         
Out[88]: array([1, 2, 3, 3])

12. `Series.repeat()`

Series.repeat(repeats, axis=None)

重复Series的元素。

返回一个新Series，其中每个元素都连续重复给定次数。

常用参数介绍：

repeats：int or array of ints 【每个元素的重复次数。这应该是一个非负整数。重复0次将返回一个空系列】
axis：None 【必须为None，无效参数】

In [89]: s = pd.Series(['a', 'b', 'c'])                                                 
In [90]: s                                                                                      
Out[90]: 
0    a
1    b
2    c
dtype: object

In [91]: s.repeat(2)                                                                            
Out[91]: 
0    a
0    a
1    b
1    b
2    c
2    c
dtype: object

In [92]: s.repeat([1, 2, 3])                                                                    
Out[92]: 
0    a
1    b
1    b
2    c
2    c
2    c
dtype: object

13. `Series.squeeze()`

Series.squeeze(axis=None)

具有单个元素的Series或DataFrames被压缩为标量。具有单列或单行的DataFrame被压缩为Series。否则，对象不变（感觉这个方法用处不大，可以用loc代替）。

# Series
In [97]: primes = pd.Series([2])                                                         
In [98]: primes                                                                                 
Out[98]: 
0    2
dtype: int64

In [99]: primes.squeeze()                                                                       
Out[99]: 2

# DataFrame
In [100]: df = pd.DataFrame([1, 2], columns=['a'])                                       
In [101]: df                                                                                    
Out[101]: 
   a
0  1
1  2

In [102]: df.squeeze()                                                                          
Out[102]: 
0    1
1    2
Name: a, dtype: int64

14. `Series.view()`

Series.view(dtype=None)

创建一个Series的新视图。

注意⚠️：新视图值的修改会影响到原Series。

In [104]: s = pd.Series([-2, -1, 0, 1, 2], dtype='int8')                                 
In [105]: s                                                                                     
Out[105]: 
0   -2
1   -1
2    0
3    1
4    2
dtype: int8

# 创建一个新视图
In [106]: us = s.view('uint8')                                                          
In [107]: us                                                                                    
Out[107]: 
0    254
1    255
2      0
3      1
4      2
dtype: uint8

# 新视图的修改会影响到原Series
In [109]: us[0] = 128                                                                   
In [110]: s                                                                                     
Out[110]: 
0   -128
1     -1
2      0
3      1
4      2
dtype: int8

Series第八讲重塑/排序

Series第八讲重塑/排序

计算/描述统计(下)

详细介绍

1. `Series.argsort()`

常用参数介绍：

2. `Series.argmin()`

3. `Series.argmax()`

4. `Series.reorder_levels()`

常用参数介绍：

5. `Series.sort_values()`

常用参数介绍：

6. `Series.sort_index()`

常用参数介绍：

7. `Series.swaplevel()`

常用参数介绍：

8. `Series.unstack()`

9. `Series.explode()`

常用参数介绍：

10. `Series.searchsorted()`

常用参数介绍：

11. `Series.ravel()`

12. `Series.repeat()`

常用参数介绍：

13. `Series.squeeze()`

14. `Series.view()`

猜你喜欢

热点阅读

Series第八讲 重塑/排序

Series第八讲 重塑/排序

计算/描述统计(下)

详细介绍

1. Series.argsort()

常用参数介绍：

2. Series.argmin()

3. Series.argmax()

4. Series.reorder_levels()

常用参数介绍：

5. Series.sort_values()

常用参数介绍：

6. Series.sort_index()

常用参数介绍：

7. Series.swaplevel()

常用参数介绍：

8. Series.unstack()

9. Series.explode()

常用参数介绍：

10. Series.searchsorted()

常用参数介绍：

11. Series.ravel()

12. Series.repeat()

常用参数介绍：

13. Series.squeeze()

14. Series.view()

猜你喜欢

热点阅读

Series第八讲重塑/排序

Series第八讲重塑/排序

1. `Series.argsort()`

2. `Series.argmin()`

3. `Series.argmax()`

4. `Series.reorder_levels()`

5. `Series.sort_values()`

6. `Series.sort_index()`

7. `Series.swaplevel()`

8. `Series.unstack()`

9. `Series.explode()`

10. `Series.searchsorted()`

11. `Series.ravel()`

12. `Series.repeat()`

13. `Series.squeeze()`

14. `Series.view()`