pandas笔记(Data Transformation)

2020-06-26  本文已影响0人  生信start_site

上一篇学习笔记学习了如何处理缺失值(数据清理),这一篇笔记继续学习pandas的数据处理:数据转换

Removing Duplicates去重

#首先还是进入ipython,当然你也可以用python进行练习
$ ipython
Python 3.7.6 (default, Jan  8 2020, 19:59:22)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
In [2]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
   ...: 'k2': [1, 1, 2, 3, 3, 4, 4]}) #构建一个dataframe

In [3]: data #这里索引5和索引6对应的行是完全一样的
Out[3]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

In [4]: data.duplicated() #检查是否有重复的行,这里索引6的结果返回的是True
Out[4]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [5]: data.drop_duplicates() #去掉.duplicated()函数判断结果为True的行,只保留返回结果为False的行
Out[5]:
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

上面的去重是直接把整行去掉,你也可以按照某一列里的元素进行去重:

In [6]: data['v1'] = range(7) #把dataframe加一列v1

In [7]: data
Out[7]:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6

In [8]: data.drop_duplicates(['k1']) #根据k1这一列里的元素进行去重,所以只返回了索引为0和1的两行,因为对于k1列来说,后面的元素都是one和two,所以都是重复项
Out[8]:
    k1  k2  v1
0  one   1   0
1  two   1   1

duplicateddrop_duplicates都只保留了重复项的第一项,你也可以选择只保留所有重复项里最后出现的那一项:

In [9]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[9]:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
6  two   4   6 #索引5和6对应的两行是重复的,这里只保留了索引6对应的行

Transforming Data Using a Function or Mapping

对于许多数据集,你可能想根据数组、Series或Dataframe其中某一列中的值执行一些转换,举个例子:

In [11]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
    ...: 'Pastrami', 'corned beef', 'Bacon',
    ...: 'pastrami', 'honey ham', 'nova lox'],
    ...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [12]: data #创建一个dataframe,一列是food,一列是ounces数
Out[12]:
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

然后再加一列,是每一种食物对应的喂养的动物:

In [13]: meat_to_animal ={
        'bacon': 'pig',
         'pulled pork': 'pig',
        'pastrami': 'cow',
         'corned beef': 'cow',
        'honey ham': 'pig',
        'nova lox': 'salmon'
        }

The map method on a Series accepts a function or dict-like object containing a map‐
ping, but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the str.lower
Series method
这里需要注意的是,如果你要使用map功能,你必须使你要map的两个对象的大小写一致,比如Pastrami这个单词,在data和meat_to_animal里的开头字母的大小写不一致,所以我们要先把data里的food一栏都改成小写开头:

In [14]: lowercased = data['food'].str.lower()

In [15]: lowercased
Out[15]:
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

然后将data和meat_to_animal合并,并根据food一栏进行map:

In [16]: data['animal'] = lowercased.map(meat_to_animal)

In [17]: data
Out[17]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

上面是map之后,返回的是完整的dataframe,你也可以选择只返回map好的meat_to_animal一列:

In [18]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[18]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing Values

上一篇笔记学习了用fillna填补缺失值,这只是一个特殊情况。现在来学习一下一般值的替换。使用replace功能:

In [19]: data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [20]: data
Out[20]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [21]: import numpy as np

In [22]: data.replace(-999, np.nan) #把-999的值替换成缺失值
Out[22]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [23]: data.replace([-999, -1000], np.nan) #把-999和-1000替换成缺失值
Out[23]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [24]: data.replace([-999, -1000], [np.nan, 0]) #把-999替换成缺失值,把-1000替换成0
Out[24]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [25]: data.replace({-999: np.nan, -1000: 0}) #用字典形式进行替换
Out[25]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Renaming Axis Indexes重命名轴索引

除了对dataframe里的值进行map,你也可以map dataframe的行名和列名:

#将行名重命名
In [26]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                index=['Ohio', 'Colorado', 'New York'],
                columns=['one', 'two', 'three', 'four'])

In [27]: data
Out[27]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

In [28]: transform = lambda x: x[:4].upper()

In [29]: data.index.map(transform)
Out[29]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [30]: data.index = data.index.map(transform)

In [31]: data
Out[31]:
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

#将列名重命名
#将列名进行大写处理
In [32]: data.rename(index=str.title, columns=str.upper)
Out[32]:
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

rename功能同时进行行名和列名的修改:

In [33]: data.rename(index={'OHIO': 'INDIANA'},
        columns={'three': 'peekaboo'})
Out[33]:
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11

rename功能默认不覆盖原始dataframe,你也可以通过参数设置,让替换后的dataframe覆盖原始dataframe:

In [34]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [35]: data
Out[35]:
         one  two  three  four
INDIANA    0    1      2     3
COLO       4    5      6     7
NEW        8    9     10    11

Discretization and Binning离散化和分箱

连续型数据经常是离散的,或者被分成不同的“箱”(bin)进行分析。假设这里有一组年龄的数据,你要把这些人的年龄分成几组:

In [36]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

然后我们将把年龄分成:18-25,26-35,36-60,以及61以上的。你可以使用pandas里的cut函数:

In [37]: bins = [18, 25, 35, 60, 100]

In [38]: cats = pd.cut(ages,bins)

In [39]: cats #把ages里每一个数字都对应到相应的bin里
Out[39]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

除了把每一个元素对应到bin里,你还可以根据bin的索引来map你的每一个元素:

#查看你的bin
In [40]: cats.categories
Out[40]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
#根据bin的索引位置来map你的元素
In [41]: cats.codes
Out[41]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
#查看每一个bin里map了多少个元素
In [42]: pd.value_counts(cats)
Out[42]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

你也可以给每一个bin命名:

In [43]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [44]: pd.cut(ages, bins, labels=group_names)
Out[44]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

上面是对于整数元素而言的,下面看看浮点元素的操作:

In [45]: data = np.random.rand(20)

In [46]: data #随机生成20个0-1的数字
Out[46]:
array([0.81110965, 0.47609535, 0.71813052, 0.71125473, 0.50874785,
       0.54215936, 0.51799094, 0.92532814, 0.75840793, 0.59097154,
       0.28318023, 0.580895  , 0.57077378, 0.86404974, 0.10115019,
       0.29316477, 0.57158071, 0.14108119, 0.36732534, 0.18085161])

In [47]: pd.cut(data,4,precision=2) #从上面20个数字里的最小值到最大值,等分成4分,就是4个bin
#precision=2的意思是保留2位小数点
Out[47]:
[(0.72, 0.93], (0.31, 0.51], (0.51, 0.72], (0.51, 0.72], (0.31, 0.51], ..., (0.1, 0.31], (0.51, 0.72], (0.1, 0.31], (0.31, 0.51], (0.1, 0.31]]
Length: 20
Categories (4, interval[float64]): [(0.1, 0.31] < (0.31, 0.51] < (0.51, 0.72] < (0.72, 0.93]]

Detecting and Filtering Outliers检测并过滤离群值

In [48]: data = pd.DataFrame(np.random.randn(1000, 4)) #构建一个1000行,4列的datagrame由随机数构成

In [49]: data 
Out[49]:
            0         1         2         3
0    0.650892  2.070260  0.538299  0.679726
1    1.074305  0.495664 -0.970136 -0.344586
2   -0.859329  1.453186  1.612602 -0.321719
3    0.558712  0.562590  0.646543 -0.483733
4    0.935728  1.516592  0.852229  1.926105
..        ...       ...       ...       ...
995 -1.304159 -0.085793  0.482636  1.523466
996  0.285450 -1.898452  1.422107 -1.227373
997  2.083328 -1.410110  0.061404 -0.860416
998 -0.438052 -1.744427 -1.677008 -0.481408
999  0.999009 -0.240101 -1.299566 -0.172057

[1000 rows x 4 columns]

In [50]: data.describe() #查看每一列的数据情况
Out[50]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.004135    -0.041167    -0.030157    -0.015668
std       0.987452     0.996992     1.004676     0.982699
min      -3.438090    -3.179065    -3.055933    -3.336112
25%      -0.683605    -0.714445    -0.679892    -0.677707
50%      -0.033715    -0.066457    -0.122572    -0.039982
75%       0.646951     0.578665     0.643963     0.632722
max       3.625363     3.499694     3.896634     3.082743

如果你想找第2列里绝对值大于3的数字:

In [51]: col = data[2]

In [52]: col[np.abs(col) >3]
Out[52]:
106    3.248979
365    3.896634
467    3.659385
475   -3.055933
Name: 2, dtype: float64

寻找所有行里绝对值大于3的数字:

In [53]: data[(np.abs(data) > 3).any(1)]
Out[53]:
            0         1         2         3
106  0.402003  0.675224  3.248979  0.594704
135  3.625363  0.757227  0.267661  2.870315
143 -0.040237  3.025956 -0.950451  2.369624
250  0.411540  3.499694 -0.268605  0.811673
365 -1.609991 -0.691870  3.896634 -0.761072
439  0.050316  0.433346 -0.938715 -3.277411
467 -0.722803  2.093205  3.659385  0.373497
475  0.326796  0.196501 -3.055933 -0.601524
492 -3.438090  0.132217 -0.305294 -0.574376
616 -0.623880  0.588810 -0.742245 -3.239445
646 -0.009522 -3.179065 -1.350068  0.309251
778  1.959143 -0.304691 -0.760753  3.082743
871 -0.686462  3.470719  1.305890 -1.060105
919  3.452396  1.216505 -0.495451  0.237577
966 -1.427159 -0.546189  1.346565 -3.336112

把dataframe里所有大于3和小于-3的元素,一律都设置为3:

In [54]: data[np.abs(data) > 3] = np.sign(data) * 3

In [55]: data.describe()
Out[55]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.003495    -0.041985    -0.031906    -0.014898
std       0.982432     0.993157     0.998391     0.979717
min      -3.000000    -3.000000    -3.000000    -3.000000
25%      -0.683605    -0.714445    -0.679892    -0.677707
50%      -0.033715    -0.066457    -0.122572    -0.039982
75%       0.646951     0.578665     0.643963     0.632722
max       3.000000     3.000000     3.000000     3.000000

上面的代码里np.sign(data)的意思是把任何数根据正负判断,变成-1和1:

In [57]: np.sign(data)
Out[57]:
       0    1    2    3
0    1.0  1.0  1.0  1.0
1    1.0  1.0 -1.0 -1.0
2   -1.0  1.0  1.0 -1.0
3    1.0  1.0  1.0 -1.0
4    1.0  1.0  1.0  1.0
..   ...  ...  ...  ...
995 -1.0 -1.0  1.0  1.0
996  1.0 -1.0  1.0 -1.0
997  1.0 -1.0  1.0 -1.0
998 -1.0 -1.0 -1.0 -1.0
999  1.0 -1.0 -1.0 -1.0

Permutation and Random Sampling随机重排

使用np.random.permutation可以对Series或者dataframe的行和列进行重排:

In [58]: df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [59]: df
Out[59]:
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

In [60]: sampler = np.random.permutation(5)

In [61]: sampler
Out[61]: array([2, 4, 0, 3, 1])

In [62]: df.take(sampler)
Out[62]:
    0   1   2   3
2   8   9  10  11
4  16  17  18  19
0   0   1   2   3
3  12  13  14  15
1   4   5   6   7

Computing Indicator/Dummy Variables计算指标/虚拟变量

统计建模或机器学习应用的另一种转换是将分类变量转换为“虚拟”或“指标”矩阵。如果DataFrame中的一列有k个不同的值,那么你将得到k个列,包含所有1和0的矩阵或DataFrame。pandas有一个get_dummies函数来完成这个任务:

In [67]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
    ...: 'data1': range(6)})

In [68]: df
Out[68]:
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

In [69]: pd.get_dummies(df['key'])
Out[69]:
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

上面的函数可以这样理解:get_dummies函数是把dataframe里其中一列抽出来,把这一列的元素作为列名,行代表是否出现过key这一列的元素,出现过即为1,没出现即为0。所以对于key列来说,b在最开始出现了两次,所以在 get_dummies`返回的dataframe里,b列的前2行是1,其他列是0。以此类推。

可以把上面得到的dataframe改一下列名:

In [70]: dummies = pd.get_dummies(df['key'], prefix='key')

In [71]: dummies
Out[71]:
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0
上一篇下一篇

猜你喜欢

热点阅读